Cloudflare backbone 2024, Cisco AI, Leetcode, Alibaba HPN, Altman UBI, xAI 100k GPU, Crowdstrike RCA, Github deleted data, DGX SuperPod, how ssh works, Grace Hooper Nvidia

Cloudflare backbone 2024: Everything very high level. 500% backbone capacity increase since 2021. Use of MPLS + SR-TE. Would be interesting to see how the operate/automate those many PoPs.

Cisco AI: “three of the top four hyperscalers deploying our Ethernet AI fabric” I assume it is Google, Microsoft and Meta? AWS is the forth and biggest.

Huawei Cloud Monitor: Haven’t read the paper RD-Probe. I would expect a git repo with the code 🙂 And refers to AWS pdf and video.

Automated Leetcode: One day, I should have time to use it a learn more programming, although AI can solve them quicker than me 🙂

Alibaba Cloud HPN: linkedin, paper, AIDC material

LLM Traffic Pattern: periodically burst flows, few flows (LB harder)

Sensitive to failures: GPU, link, switch, etc

Limitations of Traditional Clos: ECMP (hash polarization) and SPOF in TORs

HPN goals:

-Scalability: up to 100k GPU

-Performance: low latency (minimum amount of hops) and maximum network utilization

-Reliability: Use two TORs with LACP from the host.

Tier1

– Use single-chip switch 51.2Tbps. They are more reliable. Dual TOR

– 1k GPUs in a segment (like nv-link) Rail-optimized network

Tier2: Eliminating load imbalance: Using dual plane. It has oversubscription

Tier3: connects several pod. Can reach 100k GPUs. Independent front-end network

Altman Universal Base Income Study: It doesnt fixt all problems, but in my opinion, it helps, and it is a good direction.

xAI 100k GPU cluster: 100k liquid-cooled H100s on single RDMA fabric. Looks like Supermicro involved for servers and Juniper only front-end network. NVIDIA provides all ethernet switches with Spectrum-4. Very interesting. Confirmation from NVIDIA (Spectrum used = Ethernet). More details with a video.

Crowdstrike RCA:

Github access deleted data: Didn’t know about it. Interesting and scary.

Nvidia DGX SuperPod: reference architecture. video. 1 pod is 16 racks with 4 DGX each (128×8=1024 GPU per pod), 2xIB fabric: compute + storage, fat tree, rail-optimized, liquid cooling. 32k GPU fills a DC.

How SSH works: So powerful, and I am still so clueless about it

Chips and Cheese GH200: Nice analysis for Nvidia Grace CPU (ARM Neoverse) and Hopper H100 GPU