{"id":1435,"date":"2023-10-28T18:44:19","date_gmt":"2023-10-28T17:44:19","guid":{"rendered":"https:\/\/blog.thomarite.uk\/?p=1435"},"modified":"2023-10-28T18:44:19","modified_gmt":"2023-10-28T17:44:19","slug":"networking-scale-2023","status":"publish","type":"post","link":"https:\/\/blog.thomarite.uk\/index.php\/2023\/10\/28\/networking-scale-2023\/","title":{"rendered":"Networking Scale 2023"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">This is a conference about networks that I was interested and I finally got some emails with the <a href=\"https:\/\/atscaleconference.com\/events\/networking-scale-2023\/?\">presentations<\/a>. They are mainly from Meta.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Meta\u2019s Network Journey to Enable AI: <a href=\"https:\/\/www.youtube.com\/watch?v=rJEYoCym-uo\">video<\/a> &#8211; second part interesting.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI fabric (backend: gpu to gpu) hanging from DC fabric.<\/li>\n\n\n\n<li>SPC (Space, Power, Cooling)<\/li>\n\n\n\n<li>Fiber, Automation<\/li>\n\n\n\n<li>RDMA requires (lossless, low-latency, in-order) -> ROCEv2(Ethernet) or IB<\/li>\n\n\n\n<li>Servers have 8x400G to TOR. Tor 400G to Spines<\/li>\n\n\n\n<li>1xAI zone per DH. 1xDC has several DHs.<\/li>\n\n\n\n<li>Oversubscribed between zones, eBGP, ECMP.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Scaling RoCE Networks for AI Training: <a href=\"https:\/\/www.youtube.com\/watch?v=H564lUSK804\">video<\/a> &#8212; Really really good.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>RMDA\/IB used for long time in Research.<\/li>\n\n\n\n<li>Training: learning a new capability from existing data (focus of the video)<\/li>\n\n\n\n<li>Inference: Applying this capability to new data (real time)<\/li>\n\n\n\n<li>Distributed training for complex models. GPU to GPU sync -> High BW and low\/predictable latency.<\/li>\n\n\n\n<li>ROCEv2 with (tuned) PFC\/ECN. TE + ECMP (flow multplexing)<\/li>\n\n\n\n<li>Oversubscription is fine in spine (higher layer)<\/li>\n\n\n\n<li>Challenges: Load balancing (elefant flows), Slow receivers\/back pressure, packet loss from L1 issues (those flapping links, faulty optics, cables, etc xD), debugging (find jobs failures)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Traffic Engineering for AI Training Networks\u00a0: <a href=\"https:\/\/www.youtube.com\/watch?v=-24Ud5BjZB0\">video<\/a> &#8211; interesting both parts.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Non-blocking. RTSW=TOR. CTSW=Spine. Fat-Tree Architecture. 2xServer per rack. 1xserver=8xGPU. CTSW=16 downlinks -> 16 uplinks.  Up to 208 racks?<\/li>\n\n\n\n<li>ROCE since 2020. CTSW are high redix and deep buffer switches.<\/li>\n\n\n\n<li>AI Workload challenges: low entropy (flow repetitive, predictable), bursty, high intensity elephant flows.<\/li>\n\n\n\n<li>SW based TE: dynamic routing adapted on real-time. Adaptive job placement. Controller (stateless)<\/li>\n\n\n\n<li>Data plane: Overlay (features from broadcom chips) and Underly (BGP)<\/li>\n\n\n\n<li>Flow granularity: nic to host flow.<\/li>\n\n\n\n<li>Handle network failures with minimum convergence time. Backdoor channel with inhouse protocol.<\/li>\n\n\n\n<li>Simulation platform. NCCL benchmark.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Networking for GenAI Training and Inference Clusters: <a href=\"https:\/\/www.youtube.com\/watch?v=192S3xNbcEs\">video<\/a> Super Good!<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Recommendation Model: training 100GFlops\/interation. inference: few GFlops\/s for 100ms latency.<\/li>\n\n\n\n<li>LLM: training 1PetaFlops\/sentence (3 orders magnitude > recommendation), inference: 10PF\/s for 1sec time-to-first token. +10k GPUs for training. Distributed inferencce. Need Compute too.<\/li>\n\n\n\n<li>LLama2 70Billion tokens -> 1.7M hours of GPU. IB 200G per GPU, 51.2 TB\/s bisection bw. 800 ZetaFlops. 2 Trillion dataset. 2k A100 GPUs. As well, used ROCEv2 (LLama2 34B).<\/li>\n\n\n\n<li>+30 ExaFlops (30% of H100 GPUs fp8 peak) + LLama65B training &lt; 1day.<\/li>\n\n\n\n<li>Massive cluster: 32k GPUs!  Model Parallelism.<\/li>\n\n\n\n<li>LLM inference: dual-edge problem. Prefill large messages (High BW) + Decode small messages (latency sensitive).<\/li>\n\n\n\n<li>Scale out (-bw, large domain. Scalable RDMA (IB or Ethernet), data parallel traffic) + Scale up (+BW, smaller domain. NVLink 400G, model parallel traffic)<\/li>\n\n\n\n<li>32k GPU. TOR (252), Spine (18), AGG (18). 3 levels. Oversubscription Spine-Agg 7:1. 8 clusters. 252 racks per cluster. 16 GPUs per rack (8x252x16=32k GPUs). ROCEv2!<\/li>\n\n\n\n<li>Model Parallelism harder for computation. Model Parallel traffic: all-reduced\/all-to-all, big messages (inside cluster = sclae-up). Data Parallel traffic: all-gather &amp; reduce-scatter (between cluster = scale-out, NVLink)<\/li>\n\n\n\n<li>Challenges: Latency matters more than ranking. Reliability !!!!!<\/li>\n\n\n\n<li>LLM inference needs a fabric.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"540\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5-1024x540.png\" alt=\"\" class=\"wp-image-1438\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5-1024x540.png 1024w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5-300x158.png 300w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5-768x405.png 768w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5-1200x633.png 1200w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2023\/10\/image-5.png 1474w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Scale out vs scale up: <a href=\"https:\/\/blog.purestorage.com\/purely-informational\/scale-out-vs-scale-up-whats-the-difference\/\">storage<\/a>    <a href=\"https:\/\/azure.microsoft.com\/en-au\/resources\/cloud-computing-dictionary\/scaling-out-vs-scaling-up\/\">DB<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">scale up (vertical): more bw(links), more storage, etc<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">scale out (horizontal): distribute load into different devices<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Network Observability for AI\/HPC Training Workflows: <a href=\"https:\/\/www.youtube.com\/watch?v=-xB_8_z7uuY\">video<\/a><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ROCET: Automating RDMA metric collection and analysis for GPU training. Info from hosts\/nics and switches.<\/li>\n\n\n\n<li>Report: out-of-sequence, nic flaps, local ack timeouts.<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/facebookresearch\/param\">PARAM<\/a> + pytorch. <a href=\"https:\/\/arxiv.org\/pdf\/2305.14516.pdf\">Chakra<\/a>.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This is a conference about networks that I was interested and I finally got some emails with the presentations. They are mainly from Meta. Meta\u2019s Network Journey to Enable AI: video &#8211; second part interesting. Scaling RoCE Networks for AI Training: video &#8212; Really really good. Traffic Engineering for AI Training Networks\u00a0: video &#8211; interesting &hellip; <a href=\"https:\/\/blog.thomarite.uk\/index.php\/2023\/10\/28\/networking-scale-2023\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Networking Scale 2023&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-1435","post","type-post","status-publish","format-standard","hentry","category-networks"],"_links":{"self":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/comments?post=1435"}],"version-history":[{"count":1,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1435\/revisions"}],"predecessor-version":[{"id":1439,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1435\/revisions\/1439"}],"wp:attachment":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/media?parent=1435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/categories?post=1435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/tags?post=1435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}