AWS Intent-Driven 2023- Groq – Graviton4 -Liquid Cooling – Petals – Google – Crawler – VAX – dmesg

AWS Reinvent Intent-Driven Network Infra: Interesting video about Intent-driven networking in AWS. This is the paper he shows in the presentation. Same note as last year, leaf-spine, pizza boxes, all home made. The development of the SIDR as the control plane for scale. And somehow the talk about UltraCluster for AI (20k+ GPU). Maybe that is related to this collaboration NVIDIA-AWS. Interesting that there is no mention to QoS, he said no oversubscription. In general, everything is high level, and done in-house, and very likely they facing problems that very few companies in the world are facing. Still would be nice to open all those techs (like Google has done – but never for network infra). As well, I think he hits the nail on the head how he defines himself from Network Engineer to Technologist, as at the end of the day, you touch all topics.

AWS backbone: No chassis, all pizza boxes

Graviton4: More ARM chips in cloud-scale

Groq: Didnt know this “GPU” alternative. Interesting numbers. Let’s see if somebody buys it.

Petals: Run LLMs bittorrent style!

Google view after 18 years: Very nice read about the culture shift in the company, from do not evil, to make lots of many at any cost.

GTP-Crawler: Negative thing, you need the pay version of chatgpt. I wonder, If I crawke cisco, juniper and arista, what would be nearly all network knowledge in the planet? If that crawler can get ALL that date.

Linux/VAX porting: Something that I want to keep (ATP).

dmesg -T: How many times (in even more years!!!!) I wondered how to make those timestamp to something I could compare with then debugging.

VimGPT – Maia AI – Mirai – Reptar – Mellanox Debian – RISC-V DC – Mojo – Moors Law

VimGTP: Very interesting project. I haven’t used it. But thinking aloud, you could use it to interact with sites that dont have API (couriers)? I think with Selenium you can do things like that?

Maia AI: CLoud providers like to be masters of their own destiny so try to build as many things by themselves as possible. So now MS has developed its GPU for AI. It is quite interesting the custom rack they had to built with the sidekick for cooling down the new chips. There are no many figures about the chip (5nm, 105b transistors) to compare with other things in the market.

Reptar: new Intel CPU vulnerability. It looks like is a feature from Ice Lake architecture. It looks like you can crash the cores but no yet take over. Still interesting.

I am not affected ๐Ÿ™‚

$ grep fsrm /proc/cpuinfo
$

Mellanox with Debian: Interesting how you can install a nearly standard Debian into a Mellanox SN2700 switch.

RISC-V into datacenter: Happy to see RISC-V chips in the datacenter. But not clear who is going to use them?

Mirai history: I think most of wired articles read like a holywood movie ๐Ÿ™‚ Although 2016 security issues are “old” school, still interesting how teenagers got that far.

Mojo: Interesting because of the people behind of it… really impressive.

Moor’s law analysis: I liked the part about networks, that is not very common mentioned in these type of analysis.

FP8-LM

From the AlphaSignal email list, that most of the times go over my lame knowledge, I found this piece of info, quite interesting:

FP8-LM: Training FP8 Large Language Models

Goal: Optimize LLM training with FP8 low-bit data formats.
Issue: High cost of LLM computational resources.
Solution: FP8 automatic mixed-precision framework for LLMs.
Results: Reduced memory by 42%, increased speed by 64%.
Insight: FP8 maintains accuracy, optimizes training efficiency.

Repo. Paper

This is something I want to really understand at one point. FP (Floating-Point) instructions can be from several sizes (8, 16, 32, 64). So the bigger, the better precision. I guess for some scientific tasks that is important. But looks like for AI, with FP8 could be good enough.

Limits Computer Performance

Reading across this blog, I came to this statement:

What limits computer performance today is predictability, and the two big ones are instruction/branch predictability, and data locality.

That is from this interview. I dont kown Jim Keller but it is a long and interesting conversation. I liked it when he says he was the laziest person at Tesla!

And actually I found a tab from his company

HotChips

I didnt know anything about this conference until the last two months started to read news about it from different blogs. I am surprised the webpage doesnt link to the videos. This one is quite interesting but after the 15 minute becomes very hardcode for me.

LLM: hardware connection

Good article about LLM from the hardware/networks perspective. I liked it wasnt a show-off from Juniper products, as I haven’t seen any mention of Juniper kit in deployments of LLM in cloud providers, hyperscalers, etc. The points about Infiniband (the comment at the end about the misconceptions of IB is funny) and ethernet were not new but I liked the VOQ reference.

Still as a network engineer, I feel I am missing something about how to make the best network deployment for training LLM.

AI Supercomputer – NVLink

So NVIDIA has an AI supercomputer via this. Meta, Google and MS making comments about it. And based on this, it is a 24 racks setup using 900GBps NVLink-C2C interface, so no ethernet and no infiniband. Here, there is a bit more info about NVLink:

NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system. Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps. 

This is the official page for NVlink but only with the above I understood this is like a “new” switching infrastructure.

But looks like if you want to connect up those supercomputers, you need to use infiniband. And again power/cooling is a important subject.

AMD MI300 + Meta DC

Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.

As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.

Now, Meta โ€“ working with Nvidia, Penguin Computing and Pure Storage โ€“ has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc โ€œRomeโ€ CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (โ€œAI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?โ€ said Kalyan Saladi โ€“ a software engineer at Meta โ€“ in a presentation at the event.)

An again, cooling is critical.

Fat Tree – Drangonfly – OpenAI infra

I haven’t played much with ChatGPT but my first question was “how is the network infrastructure for building something like ChatGPT?” or similar. Obviously I didnt have the answer I was looking for and I think i think ask properly neither.

Today, I came to this video and at 3:30 starts something very interesting as this is an official video as says the OpenAI cluster built in 2020 for ChatGTP was actullay based on 285k AMD CPU “infinibad” plus 10k V100 GPU “infiniband” connected. They dont mention more lower level details but looks like two separated networks? And I have seen in several other pages/videos, M$ is hardcode in infiniband.

Then regarding the infiniband architectures, it seems the most common are “fat-tree” and “dragon-fly”. This video is quite good although I have to watch it again (or more) to fully understand.

These blog, pdf and wikipedia (high level) are good for learning about “Fat-Tree”.

Although most info I found is “old”, these technologies are not old. Frontier and looks like most of supercomputers use it.

Meta Chips – Colvore water-cooling – Google AI TPv4 – NCCL – PINS P4 – Slingshot – KUtrace

Read 1. Meta to build its own AI chips. Currently using 16k A100 GPU (Google using 26k H100 GPU). And it seems Graphcore had some issues in 2020.

Read 2. Didnt know Colovore, interesting to see how critical is actually power/cooling with all the hype in AI and power constrains in key regions (Ashburn VA…) And with proper water cooling you can have a 200kw rack! And seems they have the same power as a 6x bigger facility. Cost of cooling via water is cheaper than air-cooled.

Read 3. Google one of biggest NVIDIA GPU customer although they built TPUv4. MS uses 10k A100 GPU for training GPT4. 25k for GPT5 (mix of A100 and H100?) For customer, MS offers AI supercomputer based on H100, 400G infiniband quantum2 switches and ConnectX-7 NICs: 4k GPU. Google has A3 GPU instanced treated like supercomputers and uses “Apollo” optical circuit switching (OCS). “The OCS layer replaces the spine layer in a leaf/spine Clos topology” -> interesting to see what that means and looks like. As well, it uses NVSwitch for interconnect the GPUs memories to act like one. As well, they have their own (smart) NICS (DPU data processing units or infrastructure processing units IPU?) using P4. Google has its own โ€œinter-server GPU communication stackโ€ as well as NCCL optimizations (2016! post).

Read4: Via the P4 newletter. Since Intel bought Barefoot, I kind of assumed the product was nearly dead but visiting the page and checking this slides, it seems “alive”. Sonic+P4 are main players in Google SDN.

 “Google has pioneered Software-Defined Networking (SDN) in data centers for over a decade. With the open sourcing of PINS (P4 Integrated Network Stack) two years ago, Google has ushered in a new model to remotely configure network switches. PINS brings in a P4Runtime application container to the SONiC architecture and supports extensions that make it easier for operators to realize the benefits of SDN. We look forward to enhancing the PINS capabilities and continue to support the P4 community in the future”

Read5: Slingshot is another switching technology coming from Cray supercomputers and trying to compete with Infiniband. A 2019 link that looks interesting too. Paper that I dont thik I will be able to read neither understand.

Read6: ISC High Performance 2023. I need to try to attend one of these events in the future. There are two interesting talks although I doubt they will provide any online video or slides.

Talk1: Intro to Networking Technologies for HPC: “InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing and Big Data (Hadoop, Spark, HBase and Memcached) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink, NVLink2, NVSwitch, Slingshot, Tofu architectures will also be given. Next, an overview of the OpenFabrics stack which encapsulates IB, HSE, and RoCE (v1/v2) in a unified manner will be presented. An overview of libfabrics stack will also be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks”

Talk2: State-of-the-Art High Performance MPI Libraries and Slingshot Networking: “Many top supercomputers utilize InfiniBand networking across nodes to scale out performance. Underlying interconnect technology is a critical component in achieving high performance, low latency and high throughput, at scale on next-generation exascale systems. The deployment of Slingshot networking for new exascale systems such as Frontier at OLCF and the upcoming El-Capitan at LLNL pose several challenges. State-of-the-art MPI libraries for GPU-aware and CPU-based communication should adapt to be optimized for Slingshot networking, particularly with support for the underlying HPE Cray fabric and adapter to have functionality over the Slingshot-11 interconnect. This poses a need for a thorough evaluation and understanding of slingshot networking with regards to MPI-level performance in order to provide efficient performance and scalability on exascale systems. In this work, we delve into a comprehensive evaluation on Slingshot-10 and Slingshot-11 networking with state-of-the-art MPI libraries and delve into the challenges this newer ecosystem poses.”

Read7: Slides and Video. I was aware of Dtrace (although never used it) so not sure how to compare with KUtrace. I guess I will ask Chat-GPT ๐Ÿ™‚

Read8: Python as programming of choice for AI, ML, etc.

Read9: M$ “buying” energy from fusion reactors.