EVPN into IXP – ESNOG

From ESNOG (although I am subscribed, I dont really receive notifications…) I saw this presentation, and then I found the main video reference.

BUM: 1.5Mbps -> that goes to all customer ports!! Small devices can’t cope with that.

Ratelimiting for BUM ingress/egress

Originally IXP is VPLS. Issue with long lasting flows / long lasting MACs

Solution: EVPN + ProxyARP/ND Agent.

I know EVPN is mainly for Datacenters but it is interesting the move to IXP. Although I remember reading about EVPN into Cisco IOS-XR platform that is mainly a ISP device.

GEO-LEO-Starlink

From this blog, I could read an interesting presentation about network performance using satellite services. I guess there are many blogs about the performance of Starlink but this was the first time I read something about it. I was surprised the results were not that good even with the latest version of Startlink (that I didnt know neither)

Falcon

I checked out this blog note from Google about Falcon. To be honest, I dont really understand the “implementation”. Is it purely software? Does it interact with merchant Ethernet silicon ASICs? There is so much happening trying to get Ethernet similar to Infiniband that I am wonder how this fits outside Google infra. At the end of the day, Ethernet has been successful because everybody could use it for nearly anything. Just my opinion.

Ring Memory

Reading this news I was surprised by the mentioned paper where LLM can take up to millions of tokens. My knowledge of LLM infrastucture is very little (and the paper is a bit beyond me…) but I thought the implementation of this models followed kind of “chain/waterfall” where the output of some GPUs fed other GPUs.

LLM: hardware connection

Good article about LLM from the hardware/networks perspective. I liked it wasnt a show-off from Juniper products, as I haven’t seen any mention of Juniper kit in deployments of LLM in cloud providers, hyperscalers, etc. The points about Infiniband (the comment at the end about the misconceptions of IB is funny) and ethernet were not new but I liked the VOQ reference.

Still as a network engineer, I feel I am missing something about how to make the best network deployment for training LLM.

Infiniband Essentials

NVIDIA provides this course for free. Although I surprised that there is no much “free” documentation about this technology. I wish they follow the same path as most networking vendors where they want you to learn their technology without much barriers. And it is quite pathetic that you can’t really find books about it…

The course is very very high level and very very short. So I didnt become an Infiniband CCIE…

  • Intro to IB

— Elements of IB: IB switch, Subnet Manager (it is like a SDN controller), hosts (clients), adaptors (NICs), gateways (convert IB <> Ethernet) and IB routers.

  • Key features

— Simplify mgmt: because of the Subnet Manager

— High bw: up to 400G)

— Cpu offload: RDMA, bypass OS.

— Ultra low latency: 1us host to host.

— Network scale-out: 48k nodes in a single subnet. You can connect subnets using IB router.

— QoS: achieve loss-less flows.

— Fabric resilience: Fast-ReRouting at switch level takes 1ms compared with 5s using Traffic Manager => Self-Healing

— Optimal load-balancing: using AR (adaptive routing). Rebalance packets and flows.

–MPI super performance (SHARP – scalable hierarchical aggregation and reduction protocol): off-load operations from cpu/gpu to switches -> decrease the retransmissions from end hosts -> less data sent. Dont really understand this.

— Variety of supported topologies: fat-tree, dragonfly+, torus, hypercurve and hyperx.

  • Architecture:

— Similar layers as OSI model: application, transport, network, link and physical.

— In IB, applications connect to NIC, bypass OS.

— Upper layer protocol:

— MPI: Message Passing Interface

— NCCL: NVIDIA Collective Communication Library

— iSEB: RDMA storage protocols.

— IPoIB: IP over IB

— Transport Layer: diff from tcp/ip, it creates an end-to-end virtual channel between applications (source and destination), bypassing OS in both ends.

— Network Layer: This is mainly at IB routers to connect IB subnets. Routers use GID as identifier for source and destinations.

— Link Layer: each node is identified by a LID (local ID), managed by the Subnet Manager. Switch has a forwarding table with “port vs LID” <- generated by Subnet Manager. You have flow-control for providing loss-less connections.

— Physical Layer: Support for copper (DAC) and optical (AOC) connectors.

AI Supercomputer – NVLink

So NVIDIA has an AI supercomputer via this. Meta, Google and MS making comments about it. And based on this, it is a 24 racks setup using 900GBps NVLink-C2C interface, so no ethernet and no infiniband. Here, there is a bit more info about NVLink:

NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps. 

This is the official page for NVlink but only with the above I understood this is like a “new” switching infrastructure.

But looks like if you want to connect up those supercomputers, you need to use infiniband. And again power/cooling is a important subject.

Jericho3-vs-Infiniband

Jericho3 is the new chip from Broadcom to take into NVIDIA infiniband. From that article, I dont really understand the “Ramon3” fabric. It seems it can support 18 ports at 800G (based on 144 serdes at 100G). It has 160 SerDes (16Tbs) for uplink to Ramon3. The goal is to reduce the time the nodes wait on the network so it is not just (port to port) latency. Based on Broadcom testing swapping a 200Gb Infiniband switch with a Jericho3 is 10% better. As well, dont understand what they mean by “perfect load balancing” because the flow size matters (from my point of view) and “congestion free”. Having this working at scale… looks interesting…

But then we have the answer from NVIDIA: spectrum-X. So it is Spectrum-4 switches, with Bluefield3 DPU and software optimization. This is an Ethernet platform. Spectrum-4 looks very impressive definitely. But this sentence, puzzles me “The world’s top hyperscalers are adopting NVIDIA Spectrum-X, including industry-leading cloud innovators.” But most of links I have been reading lately are saying that Azure, Meta, Google are using Infiniband. Now NVIDIA says top hyperscales are adopting Spectrum-X, when Spectrum-4 started shipping this quarter?

And finally, why NVIDIA is pushing for Ethernet and Infiniband? I think this is a good link for that. Based on NVIDIA CEO, Infiniband is great and nearly “free” if you build for very specific application (supercomputers, etc). But for multi-tenant, you want Ethernet. So that kind of explains why hyperscalers likeAWS, GCP, Azure want at the end of the day Ethernet, at least for customers access. At the end of the day, if you have just one (commodity) network, it is cheaper and easier to run/maintain. You dont have a vendor lock like IB.

Will see what happens with all these crazy AI/LLM/ML etc.

AMD MI300 + Meta DC

Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.

As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.

Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)

An again, cooling is critical.