{"id":1819,"date":"2024-08-26T10:37:37","date_gmt":"2024-08-26T09:37:37","guid":{"rendered":"https:\/\/blog.thomarite.uk\/?p=1819"},"modified":"2024-11-09T18:02:40","modified_gmt":"2024-11-09T18:02:40","slug":"cloudflare-backbone-2024-cisco-ai-leetcode-alibaba-hpn-altman-ubi-xai-100k-gpu-crowdstrike-rca","status":"publish","type":"post","link":"https:\/\/blog.thomarite.uk\/index.php\/2024\/08\/26\/cloudflare-backbone-2024-cisco-ai-leetcode-alibaba-hpn-altman-ubi-xai-100k-gpu-crowdstrike-rca\/","title":{"rendered":"Cloudflare backbone 2024, Cisco AI, Leetcode, Alibaba HPN, Altman UBI, xAI 100k GPU, Crowdstrike RCA, Github deleted data, DGX SuperPod, how ssh works, Grace Hooper Nvidia"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/blog.cloudflare.com\/backbone2024\">Cloudflare backbone 2024<\/a>: Everything very high level. 500% backbone capacity increase since 2021. Use of MPLS + SR-TE. Would be interesting to see how the operate\/automate those many PoPs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.nextplatform.com\/2024\/08\/15\/ai-pervades-the-cisco-stack-but-is-only-starting-to-drive-sales\/\">Cisco AI<\/a>: &#8220;three of the top four hyperscalers deploying our Ethernet AI fabric&#8221; I assume it is Google, Microsoft and Meta? AWS is the forth and biggest.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.theregister.com\/2024\/08\/07\/huawei_cloud_rd_probe\/\">Huawei Cloud Monitor<\/a>: Haven&#8217;t read the paper <a href=\"https:\/\/cs.stanford.edu\/~keithw\/sigcomm2024\/sigcomm24-final696-acmpaginated.pdf\">RD-Probe<\/a>. I would expect a git repo with the code \ud83d\ude42 And refers to AWS <a href=\"https:\/\/storage.googleapis.com\/site-media-prod\/meetings\/NANOG88\/4790\/20230613_Evans_No_Packet_Left_v1.pdf\">pdf<\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=FixkCbixgMM\">video<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.reddit.com\/r\/leetcode\/comments\/1ex7a1k\/i_automated_leetcode_using_claudes_35_sonnet_api\/\">Automated Leetcode<\/a>: One day, I should have time to use it a learn more programming, although AI can solve them quicker than me \ud83d\ude42<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/watch?v=aTTeCwHq5AI\">Alibaba Cloud HPN<\/a>: <a href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:activity:7228407318659416064\/\">linkedin<\/a>, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3651890.3672265\">paper<\/a>, <a href=\"https:\/\/github.com\/Yingzhen-ietf\/AIDC-IETF120\/tree\/main\">AIDC material<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">LLM Traffic Pattern: periodically burst flows, few flows (LB harder)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Sensitive to failures: GPU, link, switch, etc<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Limitations of Traditional Clos: ECMP (hash polarization) and SPOF in TORs<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">HPN goals: <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">-Scalability: up to 100k GPU<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">-Performance: low latency (minimum amount of hops) and maximum network utilization<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">-Reliability: Use two TORs with LACP from the host.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Tier1<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8211; Use single-chip switch 51.2Tbps. They are more reliable. Dual TOR<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8211; 1k GPUs in a segment (like nv-link) Rail-optimized network<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"571\" height=\"338\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image.png\" alt=\"\" class=\"wp-image-1820\" style=\"width:360px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image.png 571w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-300x178.png 300w\" sizes=\"auto, (max-width: 571px) 85vw, 571px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Tier2: Eliminating load imbalance: Using dual plane. It has oversubscription<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"595\" height=\"389\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-1.png\" alt=\"\" class=\"wp-image-1821\" style=\"width:365px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-1.png 595w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-1-300x196.png 300w\" sizes=\"auto, (max-width: 595px) 85vw, 595px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"592\" height=\"301\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-2.png\" alt=\"\" class=\"wp-image-1822\" style=\"width:399px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-2.png 592w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-2-300x153.png 300w\" sizes=\"auto, (max-width: 592px) 85vw, 592px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Tier3: connects several pod. Can reach 100k GPUs. Independent front-end network<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.theregister.com\/2024\/07\/23\/sam_altman_basic_income\/\">Altman Universal Base Income Study<\/a>: It doesnt fixt all problems, but in my opinion, it helps, and it is a good direction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.nextplatform.com\/2024\/07\/30\/so-who-is-building-that-100000-gpu-cluster-for-xai\/\">xAI 100k GPU cluster<\/a>: 100k liquid-cooled H100s on single RDMA fabric. Looks like Supermicro involved for servers and Juniper only front-end network. NVIDIA provides all ethernet switches with Spectrum-4. Very interesting. <a href=\"https:\/\/nvidianews.nvidia.com\/news\/spectrum-x-ethernet-networking-xai-colossus?ncid=so-nvsh-520000\">Confirmation<\/a> from NVIDIA (Spectrum used = Ethernet). More <a href=\"https:\/\/www.servethehome.com\/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk\/\">details<\/a> with a video.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.crowdstrike.com\/wp-content\/uploads\/2024\/08\/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf\">Crowdstrike RCA<\/a>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/trufflesecurity.com\/blog\/anyone-can-access-deleted-and-private-repo-data-github\">Github access deleted data<\/a>: Didn&#8217;t know about it. Interesting and scary.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Nvidia DGX SuperPod: <a href=\"https:\/\/docs.nvidia.com\/https:\/docs.nvidia.com\/dgx-superpod-reference-architecture-dgx-h100.pdf\">reference architecture<\/a>. video. 1 pod is 16 racks with 4 DGX each (128&#215;8=1024 GPU per pod), 2xIB fabric: compute + storage, fat tree, rail-optimized, liquid cooling. 32k GPU fills a DC.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"478\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-3-1024x478.png\" alt=\"\" class=\"wp-image-1825\" style=\"width:501px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-3-1024x478.png 1024w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-3-300x140.png 300w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-3-768x359.png 768w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-3.png 1116w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"677\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-4.png\" alt=\"\" class=\"wp-image-1827\" style=\"width:486px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-4.png 752w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-4-300x270.png 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/blog.bytebytego.com\/p\/ep124-how-does-ssh-work\">How SSH works<\/a>: So powerful, and I am still so clueless about it<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"719\" height=\"833\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-5.png\" alt=\"\" class=\"wp-image-1828\" style=\"width:467px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-5.png 719w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/08\/image-5-259x300.png 259w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/chipsandcheese.com\/2024\/07\/31\/grace-hopper-nvidias-halfway-apu\/\">Chips and Cheese GH200<\/a>: Nice analysis for Nvidia Grace CPU (ARM Neoverse) and Hopper H100 GPU<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cloudflare backbone 2024: Everything very high level. 500% backbone capacity increase since 2021. Use of MPLS + SR-TE. Would be interesting to see how the operate\/automate those many PoPs. Cisco AI: &#8220;three of the top four hyperscalers deploying our Ethernet AI fabric&#8221; I assume it is Google, Microsoft and Meta? AWS is the forth and &hellip; <a href=\"https:\/\/blog.thomarite.uk\/index.php\/2024\/08\/26\/cloudflare-backbone-2024-cisco-ai-leetcode-alibaba-hpn-altman-ubi-xai-100k-gpu-crowdstrike-rca\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Cloudflare backbone 2024, Cisco AI, Leetcode, Alibaba HPN, Altman UBI, xAI 100k GPU, Crowdstrike RCA, Github deleted data, DGX SuperPod, how ssh works, Grace Hooper Nvidia&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,20,2,1],"tags":[],"class_list":["post-1819","post","type-post","status-publish","format-standard","hentry","category-aws","category-economy","category-networks","category-uncategorised"],"_links":{"self":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1819","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/comments?post=1819"}],"version-history":[{"count":6,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1819\/revisions"}],"predecessor-version":[{"id":1875,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1819\/revisions\/1875"}],"wp:attachment":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/media?parent=1819"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/categories?post=1819"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/tags?post=1819"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}