Jericho3-vs-Infiniband

Jericho3 is the new chip from Broadcom to take into NVIDIA infiniband. From that article, I dont really understand the “Ramon3” fabric. It seems it can support 18 ports at 800G (based on 144 serdes at 100G). It has 160 SerDes (16Tbs) for uplink to Ramon3. The goal is to reduce the time the nodes wait on the network so it is not just (port to port) latency. Based on Broadcom testing swapping a 200Gb Infiniband switch with a Jericho3 is 10% better. As well, dont understand what they mean by “perfect load balancing” because the flow size matters (from my point of view) and “congestion free”. Having this working at scale… looks interesting…

But then we have the answer from NVIDIA: spectrum-X. So it is Spectrum-4 switches, with Bluefield3 DPU and software optimization. This is an Ethernet platform. Spectrum-4 looks very impressive definitely. But this sentence, puzzles me “The world’s top hyperscalers are adopting NVIDIA Spectrum-X, including industry-leading cloud innovators.” But most of links I have been reading lately are saying that Azure, Meta, Google are using Infiniband. Now NVIDIA says top hyperscales are adopting Spectrum-X, when Spectrum-4 started shipping this quarter?

And finally, why NVIDIA is pushing for Ethernet and Infiniband? I think this is a good link for that. Based on NVIDIA CEO, Infiniband is great and nearly “free” if you build for very specific application (supercomputers, etc). But for multi-tenant, you want Ethernet. So that kind of explains why hyperscalers likeAWS, GCP, Azure want at the end of the day Ethernet, at least for customers access. At the end of the day, if you have just one (commodity) network, it is cheaper and easier to run/maintain. You dont have a vendor lock like IB.

Will see what happens with all these crazy AI/LLM/ML etc.

AMD MI300 + Meta DC

Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.

As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.

Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)

An again, cooling is critical.

Fat Tree – Drangonfly – OpenAI infra

I haven’t played much with ChatGPT but my first question was “how is the network infrastructure for building something like ChatGPT?” or similar. Obviously I didnt have the answer I was looking for and I think i think ask properly neither.

Today, I came to this video and at 3:30 starts something very interesting as this is an official video as says the OpenAI cluster built in 2020 for ChatGTP was actullay based on 285k AMD CPU “infinibad” plus 10k V100 GPU “infiniband” connected. They dont mention more lower level details but looks like two separated networks? And I have seen in several other pages/videos, M$ is hardcode in infiniband.

Then regarding the infiniband architectures, it seems the most common are “fat-tree” and “dragon-fly”. This video is quite good although I have to watch it again (or more) to fully understand.

These blog, pdf and wikipedia (high level) are good for learning about “Fat-Tree”.

Although most info I found is “old”, these technologies are not old. Frontier and looks like most of supercomputers use it.

Meta Chips – Colvore water-cooling – Google AI TPv4 – NCCL – PINS P4 – Slingshot – KUtrace

Read 1. Meta to build its own AI chips. Currently using 16k A100 GPU (Google using 26k H100 GPU). And it seems Graphcore had some issues in 2020.

Read 2. Didnt know Colovore, interesting to see how critical is actually power/cooling with all the hype in AI and power constrains in key regions (Ashburn VA…) And with proper water cooling you can have a 200kw rack! And seems they have the same power as a 6x bigger facility. Cost of cooling via water is cheaper than air-cooled.

Read 3. Google one of biggest NVIDIA GPU customer although they built TPUv4. MS uses 10k A100 GPU for training GPT4. 25k for GPT5 (mix of A100 and H100?) For customer, MS offers AI supercomputer based on H100, 400G infiniband quantum2 switches and ConnectX-7 NICs: 4k GPU. Google has A3 GPU instanced treated like supercomputers and uses “Apollo” optical circuit switching (OCS). “The OCS layer replaces the spine layer in a leaf/spine Clos topology” -> interesting to see what that means and looks like. As well, it uses NVSwitch for interconnect the GPUs memories to act like one. As well, they have their own (smart) NICS (DPU data processing units or infrastructure processing units IPU?) using P4. Google has its own “inter-server GPU communication stack” as well as NCCL optimizations (2016! post).

Read4: Via the P4 newletter. Since Intel bought Barefoot, I kind of assumed the product was nearly dead but visiting the page and checking this slides, it seems “alive”. Sonic+P4 are main players in Google SDN.

 “Google has pioneered Software-Defined Networking (SDN) in data centers for over a decade. With the open sourcing of PINS (P4 Integrated Network Stack) two years ago, Google has ushered in a new model to remotely configure network switches. PINS brings in a P4Runtime application container to the SONiC architecture and supports extensions that make it easier for operators to realize the benefits of SDN. We look forward to enhancing the PINS capabilities and continue to support the P4 community in the future”

Read5: Slingshot is another switching technology coming from Cray supercomputers and trying to compete with Infiniband. A 2019 link that looks interesting too. Paper that I dont thik I will be able to read neither understand.

Read6: ISC High Performance 2023. I need to try to attend one of these events in the future. There are two interesting talks although I doubt they will provide any online video or slides.

Talk1: Intro to Networking Technologies for HPC: “InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing and Big Data (Hadoop, Spark, HBase and Memcached) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink, NVLink2, NVSwitch, Slingshot, Tofu architectures will also be given. Next, an overview of the OpenFabrics stack which encapsulates IB, HSE, and RoCE (v1/v2) in a unified manner will be presented. An overview of libfabrics stack will also be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks”

Talk2: State-of-the-Art High Performance MPI Libraries and Slingshot Networking: “Many top supercomputers utilize InfiniBand networking across nodes to scale out performance. Underlying interconnect technology is a critical component in achieving high performance, low latency and high throughput, at scale on next-generation exascale systems. The deployment of Slingshot networking for new exascale systems such as Frontier at OLCF and the upcoming El-Capitan at LLNL pose several challenges. State-of-the-art MPI libraries for GPU-aware and CPU-based communication should adapt to be optimized for Slingshot networking, particularly with support for the underlying HPE Cray fabric and adapter to have functionality over the Slingshot-11 interconnect. This poses a need for a thorough evaluation and understanding of slingshot networking with regards to MPI-level performance in order to provide efficient performance and scalability on exascale systems. In this work, we delve into a comprehensive evaluation on Slingshot-10 and Slingshot-11 networking with state-of-the-art MPI libraries and delve into the challenges this newer ecosystem poses.”

Read7: Slides and Video. I was aware of Dtrace (although never used it) so not sure how to compare with KUtrace. I guess I will ask Chat-GPT 🙂

Read8: Python as programming of choice for AI, ML, etc.

Read9: M$ “buying” energy from fusion reactors.

VXLAN BGP EVPN Multisite

This is a video that explains high level about EVPN Multisite. There is no really config involved. The pdf for the session “BRKDCN-2913” is easy to find and download. Although this is NXOS based, Arista has similar feature called “EVPN Gateway”:  https://www.arista.com/en/support/toi/eos-4-25-0f/14591-evpn-l3-gateway (needs registration….) Just one line really to add under the EVPN address family to change the next hop to the gateway’s address. The implementation looks much more simpler than NXOS….

This is a summary of the video:


RFC9014 … DCI EVPN Overlay defines the Layer-2 extension between two domains

section 3: decoupled gw. vland handoff with a WAN edge.
section 4: integrated gw: gw talk directly L2EVPN
multi-site (BESS version) draft-sharma-bess-multi-site-evpn. support extension of l2 and l3, uc and mc, vpns. BGW talk ebgp evpn AF.
gw mode: anycast vip (ecmp: underlay) or multipath vip (ecmp: under and overlay)
type5: re-originated.
RD: separate RD for vIP and PIP
RT: same for intra/inter dc
Border GW = EVPN GW

EVPN-IPVPN interop defines the Layer-3 extension between domains, currently lacks of EVPN to EVPN interconnects

Multisite draft combines RFC9014 and EVPN-IPVPN with EVPN to EVPN connection: https://datatracker.ietf.org/doc/html/draft-sharma-bess-multi-site-evpn-02

Use cases:
1- Compartmentalization:

  • multiple fabrics, single DC
  • control at BGW: allow extension l2,l3. Reduces remote VTEP count. Expands VTEP scale.
  • BUM packet: LS replicated only in the fabric, then BGW to the BGW in the other fabric. In no multi-site, LS replicate to ALL VTEP in the fabric.

2- Scale

  • control at BGW: Reduces remote VTEP count. Expands VTEP scale.
  • scale thhrough hierarchy: multiply vtep with sites
    up to 128 sites per multi-site domain. Up to 256 VTEP per fabric -> 32768 VTEPs

3- DC interconnect (DCI)

  • IP reachability and MTU.
    integration with legacy networks:
    hybrid cloud connectivity: extends l3 with vrf awareness.

Deeper look:
HW support only important in BGW. LS is not important.

tunnels:

  • stitched at BGW (no recirculation, hw rate)
  • intra fabric tunnel goes LS to LS or LS to BFW
  • inter fabric tunnel goes BGW to BGW
  • only BGW IP must be unique.. Fabrics are “separated”.

BGW deployment considerations:

  • 1) anycast bgw
  • – up to 6 nodes. They are not interconnected, just share ASN nothing else.. In LS or SS
  • – VIP mode: vip for tunnel stitching. foucs on scale and convergence. overlay ecpm
  • – PIP mode: for 3rd party interop. Uses PIP for tunnel stitching. Uses under and overlay Ecmp.

  • 2) vpc bgw:
  • – only 2 (because vpc, peer link). Only in LS
    – legacy network integration, attachment of fw and adcs.

NOTE: anycast and vpc must have a multi-site vip and PIP. only vpc needs an extra IP for VPC IP.
PIP needed for establishing BGP and for Designated Forwarding election (only one BGW forwards per vlan.

CP and DP:

  • As eBGP uses betweem multi-sites -> ebgp changes NH => vxlan tunnel termination and re-origination + loop prevention (as-path). Full mesh ebgp evpn between sites.
  • underlay/overlay CP deployemnt: recommended IEI (recommended) within fabric: IGP as underlay, iBGP as overlay.
  • full mesh ebgp evpn between site OR deploy RS (route-server) -> RS is in a separate AS and only does CP = eBGP RR (RFC 7947): evpn routes reflection, NH unchanged, RT rewrite!

I think this is the white paper mentioned:  https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-739942.html

Another thing, I wish it wouldnt be that painful to simulate NXOS. It is so easy spin up a lab with cEOS…..in a standard laptop..

BGP Add Path

Some weeks ago was asked some questions and I totally missed that BGP has a feature to advertise more paths than just the best path, that is the default behaviour. So I wanted to learn more about it. The RFC is here. It is good to understand the negotiation of the feature. I have search for other links that give you a bit more info about the implementation/design details. Because, reading the RFC, I didnt notice that this feature is for iBGP, like mentioned here.

Another feature I need to lab it up.

BGP Site of Origin (SoO)

SoO is something that I have read and I forget often so trying to stick it in my mind here. Found this link that I think it is quite good.

Definition:

Ensuring a loop-free network in particular multi-homed MPLS Layer 3 VPN sites. BGP SoO is a tag that is appended on BGP updates to allow a peer (PE) to mark a particular prefix as belonging to a particular site. 

In certain MPLS L3 VPN configurations, the BGP AS-Path may not provide the granularity needed to prevent a loop in the control-plane. For example when your CPEs in a site peers with PEs (multisite) from the SP using the same ASN, that means you need to use "allow-as in" in your CPEs. 

Scenario:

This scenario has two issues:

  • Suboptimal routing
  • Routing loop under failure.

Solution:

Configure a unique SoO code for each multihomed site on the PE routers.

This is just an intro as I want to create a lab with this.

NANOG 86

This is something I had in my to-watch list… and finally found some time to check out a couple of talks that looked interesting:

  • Emulating Network Topologies in k8s (Google): video

I liked this talk. It is about network simulation with kubernetes. It reminds me to “containerlab” as it uses containers. I was surprises that Google showed Nokia SROS and Arista cEOS. Do they use them in production? Funny enough, there was no Cisco.

I checked KNE and looks interesting, I should try it at some point.

  •    Towards a new Ethernet for High-Performance Data Centers – Activities and Enhancements in IEEE 802.1: video

Mentions infinibad is still king in top 100 HPC. There are 3 improvements in the pipeline for Ethernet QoS to make PFC more “flow” aware. Still some “time” until it hits the market though.

As well, there is a moment, where it mentions Azure managed to get RDMA working at 100km with MACSEC. I think this is the video.

This is the whole video list and slides.

CXL

In one meeting somebody mentioned CXL, and I had to look it up.

Interesting:

Eventually CXL it is expected to be an all-encompassing cache-coherent interface for connecting any number of CPUs, memory, process accelerators (notably FPGAs and GPUs), and other peripherals.

BFD Multihop

BFD is a protocol that I assumed I knew “well” as it is quite straightforward…. But after having to check how to configure BFD multihop works I notices, I had actually no idea. As usual, I need to read the RFC at some point.

From this link, I noticed that the concept of Hellos and Echo…. and that echo uses the same IP as src and destination…. I really like the wireshark captures.

Copy/Paste from the link

Packet Types

Control Packets

Control packets are used to establish BFD peerings. Essential information are included within these packets, to include flags for things such as authentication, in addition to the timer negotiations. 
These packets are send via UDP to the destination of the far side IP, utilizing the bfd-control port of 3784. 
Because these packets must actually be processed by the peer, they are sent less frequently then the actual BFD echos used for sub second failure detection.


Echo Packets

BFD echo packets are essentially for local use. They are sent with the same source, and destination IP of itself, destined for the UDP bfd-echo port of 3785. When an echo packet is received, because the destination IP is not of the router receiving it, it simple forwards it out of the appropriate interface, ridding the need to punt it up to the processor.
Because the source and destination IP are the local router, BFD can be ran asynchronous. As in, you can set up a single side to utilize BFD echo detection, while the other side merely maintains a BFD neighbor relationship through control packets.

And now about the BFD multihop. It is a short read, and main point is the UDP port is 4784 compared with 3784 in single-hop.

Then checking for the specific details for configuring BFD MH in NX-OS, it is better to check the official documentation. That for example confirms “Echo mode is not supported for multihop BFD.”

Another thing to take into account is the COPP. You need to check if your device OS captures BFD in the CoPP policies as multi-hop goes to CPU. As well, check if there is any other hardware configuration required.

Another thing that bites me is that when testing this in a software lab, BFD is always down but at least the routing protocols come up.