Networking Scale 2023

This is a conference about networks that I was interested and I finally got some emails with the presentations. They are mainly from Meta.

Meta’s Network Journey to Enable AI: video – second part interesting.

  • AI fabric (backend: gpu to gpu) hanging from DC fabric.
  • SPC (Space, Power, Cooling)
  • Fiber, Automation
  • RDMA requires (lossless, low-latency, in-order) -> ROCEv2(Ethernet) or IB
  • Servers have 8x400G to TOR. Tor 400G to Spines
  • 1xAI zone per DH. 1xDC has several DHs.
  • Oversubscribed between zones, eBGP, ECMP.

Scaling RoCE Networks for AI Training: video — Really really good.

  • RMDA/IB used for long time in Research.
  • Training: learning a new capability from existing data (focus of the video)
  • Inference: Applying this capability to new data (real time)
  • Distributed training for complex models. GPU to GPU sync -> High BW and low/predictable latency.
  • ROCEv2 with (tuned) PFC/ECN. TE + ECMP (flow multplexing)
  • Oversubscription is fine in spine (higher layer)
  • Challenges: Load balancing (elefant flows), Slow receivers/back pressure, packet loss from L1 issues (those flapping links, faulty optics, cables, etc xD), debugging (find jobs failures)

Traffic Engineering for AI Training Networks : video – interesting both parts.

  • Non-blocking. RTSW=TOR. CTSW=Spine. Fat-Tree Architecture. 2xServer per rack. 1xserver=8xGPU. CTSW=16 downlinks -> 16 uplinks. Up to 208 racks?
  • ROCE since 2020. CTSW are high redix and deep buffer switches.
  • AI Workload challenges: low entropy (flow repetitive, predictable), bursty, high intensity elephant flows.
  • SW based TE: dynamic routing adapted on real-time. Adaptive job placement. Controller (stateless)
  • Data plane: Overlay (features from broadcom chips) and Underly (BGP)
  • Flow granularity: nic to host flow.
  • Handle network failures with minimum convergence time. Backdoor channel with inhouse protocol.
  • Simulation platform. NCCL benchmark.

Networking for GenAI Training and Inference Clusters: video Super Good!

  • Recommendation Model: training 100GFlops/interation. inference: few GFlops/s for 100ms latency.
  • LLM: training 1PetaFlops/sentence (3 orders magnitude > recommendation), inference: 10PF/s for 1sec time-to-first token. +10k GPUs for training. Distributed inferencce. Need Compute too.
  • LLama2 70Billion tokens -> 1.7M hours of GPU. IB 200G per GPU, 51.2 TB/s bisection bw. 800 ZetaFlops. 2 Trillion dataset. 2k A100 GPUs. As well, used ROCEv2 (LLama2 34B).
  • +30 ExaFlops (30% of H100 GPUs fp8 peak) + LLama65B training < 1day.
  • Massive cluster: 32k GPUs! Model Parallelism.
  • LLM inference: dual-edge problem. Prefill large messages (High BW) + Decode small messages (latency sensitive).
  • Scale out (-bw, large domain. Scalable RDMA (IB or Ethernet), data parallel traffic) + Scale up (+BW, smaller domain. NVLink 400G, model parallel traffic)
  • 32k GPU. TOR (252), Spine (18), AGG (18). 3 levels. Oversubscription Spine-Agg 7:1. 8 clusters. 252 racks per cluster. 16 GPUs per rack (8x252x16=32k GPUs). ROCEv2!
  • Model Parallelism harder for computation. Model Parallel traffic: all-reduced/all-to-all, big messages (inside cluster = sclae-up). Data Parallel traffic: all-gather & reduce-scatter (between cluster = scale-out, NVLink)
  • Challenges: Latency matters more than ranking. Reliability !!!!!
  • LLM inference needs a fabric.

Scale out vs scale up: storage DB

scale up (vertical): more bw(links), more storage, etc

scale out (horizontal): distribute load into different devices

Network Observability for AI/HPC Training Workflows: video

  • ROCET: Automating RDMA metric collection and analysis for GPU training. Info from hosts/nics and switches.
  • Report: out-of-sequence, nic flaps, local ack timeouts.
  • PARAM + pytorch. Chakra.

EVPN into IXP – ESNOG

From ESNOG (although I am subscribed, I dont really receive notifications…) I saw this presentation, and then I found the main video reference.

BUM: 1.5Mbps -> that goes to all customer ports!! Small devices can’t cope with that.

Ratelimiting for BUM ingress/egress

Originally IXP is VPLS. Issue with long lasting flows / long lasting MACs

Solution: EVPN + ProxyARP/ND Agent.

I know EVPN is mainly for Datacenters but it is interesting the move to IXP. Although I remember reading about EVPN into Cisco IOS-XR platform that is mainly a ISP device.

GEO-LEO-Starlink

From this blog, I could read an interesting presentation about network performance using satellite services. I guess there are many blogs about the performance of Starlink but this was the first time I read something about it. I was surprised the results were not that good even with the latest version of Startlink (that I didnt know neither)

Falcon

I checked out this blog note from Google about Falcon. To be honest, I dont really understand the “implementation”. Is it purely software? Does it interact with merchant Ethernet silicon ASICs? There is so much happening trying to get Ethernet similar to Infiniband that I am wonder how this fits outside Google infra. At the end of the day, Ethernet has been successful because everybody could use it for nearly anything. Just my opinion.

Ring Memory

Reading this news I was surprised by the mentioned paper where LLM can take up to millions of tokens. My knowledge of LLM infrastucture is very little (and the paper is a bit beyond me…) but I thought the implementation of this models followed kind of “chain/waterfall” where the output of some GPUs fed other GPUs.

LLM: hardware connection

Good article about LLM from the hardware/networks perspective. I liked it wasnt a show-off from Juniper products, as I haven’t seen any mention of Juniper kit in deployments of LLM in cloud providers, hyperscalers, etc. The points about Infiniband (the comment at the end about the misconceptions of IB is funny) and ethernet were not new but I liked the VOQ reference.

Still as a network engineer, I feel I am missing something about how to make the best network deployment for training LLM.

Infiniband Essentials

NVIDIA provides this course for free. Although I surprised that there is no much “free” documentation about this technology. I wish they follow the same path as most networking vendors where they want you to learn their technology without much barriers. And it is quite pathetic that you can’t really find books about it…

The course is very very high level and very very short. So I didnt become an Infiniband CCIE…

  • Intro to IB

— Elements of IB: IB switch, Subnet Manager (it is like a SDN controller), hosts (clients), adaptors (NICs), gateways (convert IB <> Ethernet) and IB routers.

  • Key features

— Simplify mgmt: because of the Subnet Manager

— High bw: up to 400G)

— Cpu offload: RDMA, bypass OS.

— Ultra low latency: 1us host to host.

— Network scale-out: 48k nodes in a single subnet. You can connect subnets using IB router.

— QoS: achieve loss-less flows.

— Fabric resilience: Fast-ReRouting at switch level takes 1ms compared with 5s using Traffic Manager => Self-Healing

— Optimal load-balancing: using AR (adaptive routing). Rebalance packets and flows.

–MPI super performance (SHARP – scalable hierarchical aggregation and reduction protocol): off-load operations from cpu/gpu to switches -> decrease the retransmissions from end hosts -> less data sent. Dont really understand this.

— Variety of supported topologies: fat-tree, dragonfly+, torus, hypercurve and hyperx.

  • Architecture:

— Similar layers as OSI model: application, transport, network, link and physical.

— In IB, applications connect to NIC, bypass OS.

— Upper layer protocol:

— MPI: Message Passing Interface

— NCCL: NVIDIA Collective Communication Library

— iSEB: RDMA storage protocols.

— IPoIB: IP over IB

— Transport Layer: diff from tcp/ip, it creates an end-to-end virtual channel between applications (source and destination), bypassing OS in both ends.

— Network Layer: This is mainly at IB routers to connect IB subnets. Routers use GID as identifier for source and destinations.

— Link Layer: each node is identified by a LID (local ID), managed by the Subnet Manager. Switch has a forwarding table with “port vs LID” <- generated by Subnet Manager. You have flow-control for providing loss-less connections.

— Physical Layer: Support for copper (DAC) and optical (AOC) connectors.

AI Supercomputer – NVLink

So NVIDIA has an AI supercomputer via this. Meta, Google and MS making comments about it. And based on this, it is a 24 racks setup using 900GBps NVLink-C2C interface, so no ethernet and no infiniband. Here, there is a bit more info about NVLink:

NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps. 

This is the official page for NVlink but only with the above I understood this is like a “new” switching infrastructure.

But looks like if you want to connect up those supercomputers, you need to use infiniband. And again power/cooling is a important subject.