Reading different articles: 1, 2, 3 I was made aware of this new architecture of CPU-GPU-HMB3 from AMD.
As well, Meta has a new DC design for ML/AI using Nvidia and Infiniband.
Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)
An again, cooling is critical.