AMD MI300 + Meta DC

Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.

As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.

Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)

An again, cooling is critical.

Fat Tree – Drangonfly – OpenAI infra

I haven’t played much with ChatGPT but my first question was “how is the network infrastructure for building something like ChatGPT?” or similar. Obviously I didnt have the answer I was looking for and I think i think ask properly neither.

Today, I came to this video and at 3:30 starts something very interesting as this is an official video as says the OpenAI cluster built in 2020 for ChatGTP was actullay based on 285k AMD CPU “infinibad” plus 10k V100 GPU “infiniband” connected. They dont mention more lower level details but looks like two separated networks? And I have seen in several other pages/videos, M$ is hardcode in infiniband.

Then regarding the infiniband architectures, it seems the most common are “fat-tree” and “dragon-fly”. This video is quite good although I have to watch it again (or more) to fully understand.

These blog, pdf and wikipedia (high level) are good for learning about “Fat-Tree”.

Although most info I found is “old”, these technologies are not old. Frontier and looks like most of supercomputers use it.

Meta Chips – Colvore water-cooling – Google AI TPv4 – NCCL – PINS P4 – Slingshot – KUtrace

Read 1. Meta to build its own AI chips. Currently using 16k A100 GPU (Google using 26k H100 GPU). And it seems Graphcore had some issues in 2020.

Read 2. Didnt know Colovore, interesting to see how critical is actually power/cooling with all the hype in AI and power constrains in key regions (Ashburn VA…) And with proper water cooling you can have a 200kw rack! And seems they have the same power as a 6x bigger facility. Cost of cooling via water is cheaper than air-cooled.

Read 3. Google one of biggest NVIDIA GPU customer although they built TPUv4. MS uses 10k A100 GPU for training GPT4. 25k for GPT5 (mix of A100 and H100?) For customer, MS offers AI supercomputer based on H100, 400G infiniband quantum2 switches and ConnectX-7 NICs: 4k GPU. Google has A3 GPU instanced treated like supercomputers and uses “Apollo” optical circuit switching (OCS). “The OCS layer replaces the spine layer in a leaf/spine Clos topology” -> interesting to see what that means and looks like. As well, it uses NVSwitch for interconnect the GPUs memories to act like one. As well, they have their own (smart) NICS (DPU data processing units or infrastructure processing units IPU?) using P4. Google has its own “inter-server GPU communication stack” as well as NCCL optimizations (2016! post).

Read4: Via the P4 newletter. Since Intel bought Barefoot, I kind of assumed the product was nearly dead but visiting the page and checking this slides, it seems “alive”. Sonic+P4 are main players in Google SDN.

 “Google has pioneered Software-Defined Networking (SDN) in data centers for over a decade. With the open sourcing of PINS (P4 Integrated Network Stack) two years ago, Google has ushered in a new model to remotely configure network switches. PINS brings in a P4Runtime application container to the SONiC architecture and supports extensions that make it easier for operators to realize the benefits of SDN. We look forward to enhancing the PINS capabilities and continue to support the P4 community in the future”

Read5: Slingshot is another switching technology coming from Cray supercomputers and trying to compete with Infiniband. A 2019 link that looks interesting too. Paper that I dont thik I will be able to read neither understand.

Read6: ISC High Performance 2023. I need to try to attend one of these events in the future. There are two interesting talks although I doubt they will provide any online video or slides.

Talk1: Intro to Networking Technologies for HPC: “InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing and Big Data (Hadoop, Spark, HBase and Memcached) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink, NVLink2, NVSwitch, Slingshot, Tofu architectures will also be given. Next, an overview of the OpenFabrics stack which encapsulates IB, HSE, and RoCE (v1/v2) in a unified manner will be presented. An overview of libfabrics stack will also be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks”

Talk2: State-of-the-Art High Performance MPI Libraries and Slingshot Networking: “Many top supercomputers utilize InfiniBand networking across nodes to scale out performance. Underlying interconnect technology is a critical component in achieving high performance, low latency and high throughput, at scale on next-generation exascale systems. The deployment of Slingshot networking for new exascale systems such as Frontier at OLCF and the upcoming El-Capitan at LLNL pose several challenges. State-of-the-art MPI libraries for GPU-aware and CPU-based communication should adapt to be optimized for Slingshot networking, particularly with support for the underlying HPE Cray fabric and adapter to have functionality over the Slingshot-11 interconnect. This poses a need for a thorough evaluation and understanding of slingshot networking with regards to MPI-level performance in order to provide efficient performance and scalability on exascale systems. In this work, we delve into a comprehensive evaluation on Slingshot-10 and Slingshot-11 networking with state-of-the-art MPI libraries and delve into the challenges this newer ecosystem poses.”

Read7: Slides and Video. I was aware of Dtrace (although never used it) so not sure how to compare with KUtrace. I guess I will ask Chat-GPT 🙂

Read8: Python as programming of choice for AI, ML, etc.

Read9: M$ “buying” energy from fusion reactors.

Wafer-on-Wafer

After reading about wafer scale, I found this other new type of processor:

The Bow IPU is the first processor in the world to use Wafer-on-Wafer (WoW) 3D stacking technology, taking the proven benefits of the IPU to the next level.

An IPU is an “Intelligence Processing Unit”. With the boom of AI/ML, we have a new word.

There is a paper about it, from 2019, but still interesting. I haven’t been able to read fully and understand it but the section 1.2 about the differences between CPU, GPU and IPU helps to see (very high level) how each one works.

CPU: Tends to offer complex cores in relatively small counts. Excel at single-thread performance and control-dominated code, possibly at the expense of energy efficiency and aggregate arithmetic throughput per area of silicon.

GPU: Features smaller cores in a significantly higher count per
device (e.g., thousands). GPU cores are architecturally simpler. Excel at regular, dense, numerical, data-flow-dominated workloads that naturally lead to coalesced accesses and a coherent control flow.

IPU: Provides large core counts (1,216 per processor) and offer cores complex enough to be capable of executing completely distinct programs. IPU only offers small, distributed memories that are local and tightly coupled to each core. And are implemented as SRAM.

From a pure networking view, their pod solution uses 100G RoCEv2. So no infiniband.

In general, all this goes beyond my knowledge but still interesting the advances in processor technology with “wafer” likes design. It seems everything was focused in CPU (adding more cores, etc) and GPU. But there are new options, although the applications looks very niche.

Chips wars

I was reading this news recently about Huaewi capability of building 14nm chips.

Today, the EDA (Electronic Design Automation) market is largely controlled by three companies: California-based Synopsys and Cadence, as well as Germany's Siemens. 

I wonder, are the export controls good long term? Maybe the solution is worse than the illness…

And based on that, I learned that ASML is the biggest valued company in Europe!

As well, there is a book about “chip wars” that I want to read at some point.

CXL

In one meeting somebody mentioned CXL, and I had to look it up.

Interesting:

Eventually CXL it is expected to be an all-encompassing cache-coherent interface for connecting any number of CPUs, memory, process accelerators (notably FPGAs and GPUs), and other peripherals.

Wafer-Scale

Somehow I came across this company that provides some crazy numbers in just one rack. Then again nearly by coincidence I show this news from an email that mentioned “cerebras” and wafer-scale, a term that I have never heard about it. So found some info in wikipedia and all looks like amazing. As well, I have learned about Gene Amdahl as he was the first one trying wafer-scale integration and his law. Didnt know he was such a figure in the computer architecture history.

VMware Co-stop / LPM in hardware

This is a very interesting article about how Longest Prefix Matching is done in networks chips. I remember reading about bloom filters in some Cloudfare blog but I didnt think that would be use too in network chips. As well, I forgot how critical is LPM in networking.

I had to deal lately with some performance issues with an application running in a VM. I am quite a noob regarding virtualization and always assumed the bigger the VM the better (very masculine thing I guess…) But a virtualization expert at work explained me issues regarding that assumption with this link. I learnt a lot from it (still a noob though). But I agree that I see most vendors asking for crazy requirements when offering products to run in VM…. and that looks like that kills that idea itself of having a virtualization environment because that VM looks like requires a dedicated server…. So right-sizing your product/VM is very important. I agree with the statement that vendors dont really do load testing for their VM offering and the high requirements it is an excuse to “avoid” problems from customers.

Analog Computing

This is an interesting video about how we can use analog computing. It seems a good use in matrix calculation used in AI.

All our technology is digital but we are reaching limits (power usage, physical limits, etc) and the “boom” in AI seems to benefit from analog computing.

RISC-V

I have been reading this book during my lunch brakes for several month. Most of the times just a couple of pages to be honest as generally my knowledge of CPU architecture is very poor. I really enjoyed this subject in Uni and this book was one of my favourites during that time. It was like a bible of CPU architecture. And Patterson is an author in both books.

I remember that there were too main architectures RISC vs CISC. In a very summarize way, RISC were simple instruction that were easy to parallelized and executed (with more instructions to execute) and CISC were complex instruction (few to execute) but that were difficult to scale. So let’s say simplicity (RISC) “won” the race in CPU architecture.

RISC-V is an open standard so anybody can produce CPUs for executing those instruction. So you can easily get your hands dirty getting a board.

One of the reason of RISC-V is to learn from all the architectures mistakes and provide a design that works for any type of processor (embedded to super-computers), is efficient, modular and stable.

The book compares RISC-V with current ISAS from ARM-32, MIPS-32, Intel x86-32, etc. Based on cost, simplicity, performance , isolation from implementation, room for growth, code size and ease of programming.

There were many parts of the book that I couldn’t really understand but I found the chapter 8 quite interesting. This is about how to compute data concurrently. The best know architecture is Single Instruction Multiple Data (SIMD). The alternative is Vector architecture. And this is used in RISC-V. The implementation details are too our of my league.

In summary, it was a nice read to refresh a bit my CPU architecture knowledge.