From this blog, I found some interesting links. I didnt listen to the podcast though.
Awesome Connective Repo: Where I found this map about submarine cables (It rings the bell I saw it before) And this map about CDNs.
Today I Learned
From this blog, I found some interesting links. I didnt listen to the podcast though.
Awesome Connective Repo: Where I found this map about submarine cables (It rings the bell I saw it before) And this map about CDNs.
From this blog, I could read an interesting presentation about network performance using satellite services. I guess there are many blogs about the performance of Starlink but this was the first time I read something about it. I was surprised the results were not that good even with the latest version of Startlink (that I didnt know neither)
I checked out this blog note from Google about Falcon. To be honest, I dont really understand the “implementation”. Is it purely software? Does it interact with merchant Ethernet silicon ASICs? There is so much happening trying to get Ethernet similar to Infiniband that I am wonder how this fits outside Google infra. At the end of the day, Ethernet has been successful because everybody could use it for nearly anything. Just my opinion.
Reading this news I was surprised by the mentioned paper where LLM can take up to millions of tokens. My knowledge of LLM infrastucture is very little (and the paper is a bit beyond me…) but I thought the implementation of this models followed kind of “chain/waterfall” where the output of some GPUs fed other GPUs.
Good article about LLM from the hardware/networks perspective. I liked it wasnt a show-off from Juniper products, as I haven’t seen any mention of Juniper kit in deployments of LLM in cloud providers, hyperscalers, etc. The points about Infiniband (the comment at the end about the misconceptions of IB is funny) and ethernet were not new but I liked the VOQ reference.
Still as a network engineer, I feel I am missing something about how to make the best network deployment for training LLM.
NVIDIA provides this course for free. Although I surprised that there is no much “free” documentation about this technology. I wish they follow the same path as most networking vendors where they want you to learn their technology without much barriers. And it is quite pathetic that you can’t really find books about it…
The course is very very high level and very very short. So I didnt become an Infiniband CCIE…
— Elements of IB: IB switch, Subnet Manager (it is like a SDN controller), hosts (clients), adaptors (NICs), gateways (convert IB <> Ethernet) and IB routers.
— Simplify mgmt: because of the Subnet Manager
— High bw: up to 400G)
— Cpu offload: RDMA, bypass OS.
— Ultra low latency: 1us host to host.
— Network scale-out: 48k nodes in a single subnet. You can connect subnets using IB router.
— QoS: achieve loss-less flows.
— Fabric resilience: Fast-ReRouting at switch level takes 1ms compared with 5s using Traffic Manager => Self-Healing
— Optimal load-balancing: using AR (adaptive routing). Rebalance packets and flows.
–MPI super performance (SHARP – scalable hierarchical aggregation and reduction protocol): off-load operations from cpu/gpu to switches -> decrease the retransmissions from end hosts -> less data sent. Dont really understand this.
— Variety of supported topologies: fat-tree, dragonfly+, torus, hypercurve and hyperx.
— Similar layers as OSI model: application, transport, network, link and physical.
— In IB, applications connect to NIC, bypass OS.
— Upper layer protocol:
— MPI: Message Passing Interface
— NCCL: NVIDIA Collective Communication Library
— iSEB: RDMA storage protocols.
— IPoIB: IP over IB
— Transport Layer: diff from tcp/ip, it creates an end-to-end virtual channel between applications (source and destination), bypassing OS in both ends.
— Network Layer: This is mainly at IB routers to connect IB subnets. Routers use GID as identifier for source and destinations.
— Link Layer: each node is identified by a LID (local ID), managed by the Subnet Manager. Switch has a forwarding table with “port vs LID” <- generated by Subnet Manager. You have flow-control for providing loss-less connections.
— Physical Layer: Support for copper (DAC) and optical (AOC) connectors.
So NVIDIA has an AI supercomputer via this. Meta, Google and MS making comments about it. And based on this, it is a 24 racks setup using 900GBps NVLink-C2C interface, so no ethernet and no infiniband. Here, there is a bit more info about NVLink:
NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps.
This is the official page for NVlink but only with the above I understood this is like a “new” switching infrastructure.
But looks like if you want to connect up those supercomputers, you need to use infiniband. And again power/cooling is a important subject.
Jericho3 is the new chip from Broadcom to take into NVIDIA infiniband. From that article, I dont really understand the “Ramon3” fabric. It seems it can support 18 ports at 800G (based on 144 serdes at 100G). It has 160 SerDes (16Tbs) for uplink to Ramon3. The goal is to reduce the time the nodes wait on the network so it is not just (port to port) latency. Based on Broadcom testing swapping a 200Gb Infiniband switch with a Jericho3 is 10% better. As well, dont understand what they mean by “perfect load balancing” because the flow size matters (from my point of view) and “congestion free”. Having this working at scale… looks interesting…
But then we have the answer from NVIDIA: spectrum-X. So it is Spectrum-4 switches, with Bluefield3 DPU and software optimization. This is an Ethernet platform. Spectrum-4 looks very impressive definitely. But this sentence, puzzles me “The world’s top hyperscalers are adopting NVIDIA Spectrum-X, including industry-leading cloud innovators.” But most of links I have been reading lately are saying that Azure, Meta, Google are using Infiniband. Now NVIDIA says top hyperscales are adopting Spectrum-X, when Spectrum-4 started shipping this quarter?
And finally, why NVIDIA is pushing for Ethernet and Infiniband? I think this is a good link for that. Based on NVIDIA CEO, Infiniband is great and nearly “free” if you build for very specific application (supercomputers, etc). But for multi-tenant, you want Ethernet. So that kind of explains why hyperscalers likeAWS, GCP, Azure want at the end of the day Ethernet, at least for customers access. At the end of the day, if you have just one (commodity) network, it is cheaper and easier to run/maintain. You dont have a vendor lock like IB.
Will see what happens with all these crazy AI/LLM/ML etc.
Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.
As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.
Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)
An again, cooling is critical.
I haven’t played much with ChatGPT but my first question was “how is the network infrastructure for building something like ChatGPT?” or similar. Obviously I didnt have the answer I was looking for and I think i think ask properly neither.
Today, I came to this video and at 3:30 starts something very interesting as this is an official video as says the OpenAI cluster built in 2020 for ChatGTP was actullay based on 285k AMD CPU “infinibad” plus 10k V100 GPU “infiniband” connected. They dont mention more lower level details but looks like two separated networks? And I have seen in several other pages/videos, M$ is hardcode in infiniband.
Then regarding the infiniband architectures, it seems the most common are “fat-tree” and “dragon-fly”. This video is quite good although I have to watch it again (or more) to fully understand.
These blog, pdf and wikipedia (high level) are good for learning about “Fat-Tree”.
Although most info I found is “old”, these technologies are not old. Frontier and looks like most of supercomputers use it.