Networking Scale 2023

This is a conference about networks that I was interested and I finally got some emails with the presentations. They are mainly from Meta.

Meta’s Network Journey to Enable AI: video – second part interesting.

Scaling RoCE Networks for AI Training: video — Really really good.

RMDA/IB used for long time in Research.
Training: learning a new capability from existing data (focus of the video)
Inference: Applying this capability to new data (real time)
Distributed training for complex models. GPU to GPU sync -> High BW and low/predictable latency.
ROCEv2 with (tuned) PFC/ECN. TE + ECMP (flow multplexing)
Oversubscription is fine in spine (higher layer)
Challenges: Load balancing (elefant flows), Slow receivers/back pressure, packet loss from L1 issues (those flapping links, faulty optics, cables, etc xD), debugging (find jobs failures)

Traffic Engineering for AI Training Networks : video – interesting both parts.

Non-blocking. RTSW=TOR. CTSW=Spine. Fat-Tree Architecture. 2xServer per rack. 1xserver=8xGPU. CTSW=16 downlinks -> 16 uplinks. Up to 208 racks?
ROCE since 2020. CTSW are high redix and deep buffer switches.
AI Workload challenges: low entropy (flow repetitive, predictable), bursty, high intensity elephant flows.
SW based TE: dynamic routing adapted on real-time. Adaptive job placement. Controller (stateless)
Data plane: Overlay (features from broadcom chips) and Underly (BGP)
Flow granularity: nic to host flow.
Handle network failures with minimum convergence time. Backdoor channel with inhouse protocol.
Simulation platform. NCCL benchmark.

Networking for GenAI Training and Inference Clusters: video Super Good!

Recommendation Model: training 100GFlops/interation. inference: few GFlops/s for 100ms latency.
LLM: training 1PetaFlops/sentence (3 orders magnitude > recommendation), inference: 10PF/s for 1sec time-to-first token. +10k GPUs for training. Distributed inferencce. Need Compute too.
LLama2 70Billion tokens -> 1.7M hours of GPU. IB 200G per GPU, 51.2 TB/s bisection bw. 800 ZetaFlops. 2 Trillion dataset. 2k A100 GPUs. As well, used ROCEv2 (LLama2 34B).
+30 ExaFlops (30% of H100 GPUs fp8 peak) + LLama65B training < 1day.
Massive cluster: 32k GPUs! Model Parallelism.
LLM inference: dual-edge problem. Prefill large messages (High BW) + Decode small messages (latency sensitive).
Scale out (-bw, large domain. Scalable RDMA (IB or Ethernet), data parallel traffic) + Scale up (+BW, smaller domain. NVLink 400G, model parallel traffic)
32k GPU. TOR (252), Spine (18), AGG (18). 3 levels. Oversubscription Spine-Agg 7:1. 8 clusters. 252 racks per cluster. 16 GPUs per rack (8x252x16=32k GPUs). ROCEv2!
Model Parallelism harder for computation. Model Parallel traffic: all-reduced/all-to-all, big messages (inside cluster = sclae-up). Data Parallel traffic: all-gather & reduce-scatter (between cluster = scale-out, NVLink)
Challenges: Latency matters more than ranking. Reliability !!!!!
LLM inference needs a fabric.

Scale out vs scale up: storage DB

scale up (vertical): more bw(links), more storage, etc

scale out (horizontal): distribute load into different devices

Network Observability for AI/HPC Training Workflows: video

ROCET: Automating RDMA metric collection and analysis for GPU training. Info from hosts/nics and switches.
Report: out-of-sequence, nic flaps, local ack timeouts.
PARAM + pytorch. Chakra.