NVIDIA GTC March 2023

I watched this week some interesting videos from NVIDA GTC related to networking. And it is a pain that you need to use a “work” email to register….

  • S51839 – Designing the Next Generation of AI Systems:

— A quick summary, it seems any HPC networks needs to use InfiniBand… NVIDA has solution for all sizes. They can provide a POD solution!!! All Cloud providers provide their services.

  • S51112 – How to Design an AI Supercomputer for Fast Distributed Training, and its Use Cases:

— Very interesting talk from NEC Japan. They built a network based on Ethernet switches for HPC with GPUs (and not IB as seen in the other video). As well they are heavy in RDMA/ROCEv2. And seems they have dedicated ports in the network for storage, management, etc. They are very happy with Cumulus/Linux as NOS.

  • S51339 – Hit the Ground Running with Data Center Digital Twin Automation:

— Interesting tool NVIDIA air for creating labs. I expected in the demonstration to show off and built a huge network. “Digital Twin” looks like the new buzzword in the network automation world.

  • S51751 – Powering Telco Cloud Services with Open Accelerated Ethernet:

— This is from COMCAST. And it is very interesting how “big” looks like SONIC is becoming. And NVIDIA is the second contributor to SONIC after M$! I need to try SONIC at some point.

  • S51204 – Transforming Clouds to Cloud-Native Supercomputing Best Practices with Microsoft Azure:

— Obviously, building NVIDIA based supercomputers in M$ Azure. Again, all infiniband.

And another thing, the Spectrum-4 switch looks insane.

AWS Networking Videos – March 2023

I watched very interesting videos about AWS networking. They are high level, so they dont tell you the magic sauce you would like to know but it is nice that this info is out in the public.

  • DKNOG – How AWS is evolving its peering-edge in 2023 and onwards link + event:

— Evolution from buying chassis to building your own devices: consume -> create (NOC-less, auto-remediation, active telemetry, etc)-> innovate (freedom to examine trade-offs, 1U devices). Clearly use of “Clos” networks and they linux-based software.

— Delighted: low complexity + high innovation

— Simplicity Scales

— It is interesting the view of a router/brick like a set of 1U devices (rack 102.8T – 200x400G ports for customers, non-blocking). An it is very good they have pictures of the concept of “bricks” and “spines”.

— Challenges with cabling (SN connector — no patching rack needed) and 400G ZR+ (heating!)

— BGP peering is actually with a container:

— James Hamilton paper – link + pdf

  • AWS re:Invent 2022 – Dive deep on AWS networking infrastructure (NET402)– link

— summary: This is “similar” to the DKNOG but with longer and some other details like:

— “We dont like chassis”. 1+million devices

— SDR at NIC level so one TCP flow is actually load-balanced in several paths

— Hybrid SDN approach: You have controllers to give you a big picture view (I guess it provides the visibility to say “just send 70% traffic to this device” – but not sure how) and the own device device capability to deal with changes.

— Telemetry, continuous monitoring, triangulation: Be able to detect the port/device is causing the problem.

  • AWS re:Invent 2022 – Leaping ahead: The power of cloud network innovation (NET211-L) – link:

— AWS Global Infrastructure: Backbone capacity

— Customer SW/HW

— Everything fails all the time

— GPS locations in fibers! + inject light in fiber to double check fault -> intelligent optical routing/failover -> better than BGP….

— Termite sheet fibers for Australia 🙂

— Nitro card = NIC (offload card)

— SDR: not need in-order packet deliver as required by TCP. 25Gbps flows allowed now.