Several posts worth reading. There are plenty of things go over my knowledge. I already posted this, it is a good refresher.
GPU Fabrics: The first of the article is the one I am more lost as it about training and the communications between the GPU depending on the take to handle the models. There are several references to improvements as the use of FP8 and different topologies. As well, a bit more clear about NVLink (as internal switch for connecting GPUs inside the same server or rack)
When it moved to the inter-server traffic, I started to understand a bit more things like “rail-optimized” (it is like having a “plane” for my old job where the leaf only connects to a spine instead of all spines, in this case each GPU connects to just one leaf. If you cluster is bigger then you need spines). I am not keen of modular chassis from operations point of view but it is mentioned as an option. Fat-tree CLOS, Dragon-Fly: reminds me to Infiniband. Like all RDMA.
And Fabric congestion it is a big topic with many different approaches: adaptive LB (IB again), several congestion control protocols and mention to Google (CSIG) and Amazon (SDR) implementations.
In general I liked the article because I dont really feel any bias (she works for Juniper) and it is very open with the solutions from different players.
LLM Inference – HW/SW Optimizations: It is interesting the explanation about LLM inferencing (doubt I can’t explain it though) and all different optimizations. The hw optimization (different custom hw solutions vs GPU) section was a bit more familiar. My summary is you dont need the same infrastructure (and cost) for doing inference and there is an interest for companies to own that as it should be better and cheaper than hosting with somebody else.
Network Acceleration for AI/ML workloads: Nice to have a summary of the different “collectives”. “collectives” refer to a set of operations involving communication among a group of processing nodes (like GPUs) to perform coordinated tasks. For example, NCCL (Nvidia Collective Communication Library) efficiently implements the collective operations designed for their GPU architecture. When a model is partitioned across a set of GPUs, NCCL manages all communication between them. Network switches can help offload some or all of the collective operations. Nvidia supports this in their InfiniBand and NVLink switches using SHARP (Scalable Hierarchical Aggregation and Reduction Protocol – proprietary). This is call “in-network computing”. For Ethernet, there are no standards yet. The Ultra Ethernet Consortium is working on it but will take years until something is seen in production. And Juniper has the programmable architecture Trio (MX routers – paper) that can do this offloading (You need to program it though – language similar to C). Still this is not a perfect solution (using a switches). The usage of collectives in inference is less common than their extensive use during the training phase of deep learning models. This is primarily because inference tasks can often be executed on a single GPU
From a different topics:
Learning at Cambridge: Spend less hours studying, dont take notes (that’s hard for me), go wild with active learning (work in exercises until you fully understand them)
British Library CyberAttack: blog and public learning lesson. I know this is happening to often for many different institutions but this one caught my eye đ I think is a recurrent theme in most government institutions were upgrading is expensive (because it is not done often), tight budgets and IT experts.
“Our major software systems cannot be brought back in their pre-attack form, either because they are no longer supported by the vendor or because they will not function on the new secure infrastructure that is currently being rolled out”
However, the first detected unauthorised access to our network was identified at the Terminal Services server. Likely a compromised account.
Personally, I wonder what you can get from “stealing” in a library ???