{"id":2177,"date":"2026-06-06T13:39:15","date_gmt":"2026-06-06T12:39:15","guid":{"rendered":"https:\/\/blog.thomarite.uk\/?p=2177"},"modified":"2026-06-06T13:39:15","modified_gmt":"2026-06-06T12:39:15","slug":"friction-morning-routine-roce-meta-paper-aws-rng-slurm-rail-optimize-800vdc-phyllo-approach","status":"publish","type":"post","link":"https:\/\/blog.thomarite.uk\/index.php\/2026\/06\/06\/friction-morning-routine-roce-meta-paper-aws-rng-slurm-rail-optimize-800vdc-phyllo-approach\/","title":{"rendered":"Friction, Morning Routine, RoCe Meta Paper, AWS RNG, Slurm, Rail-Optimize, 800VDC, Phyllo, Approach"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/watch?v=EaFhlWqhqvw\">Manson AI<\/a>: You need friction<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/watch?v=P5sKKnWCvzk\">AI Agent<\/a>: narrow focus: goal, proof, steps <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/watch?v=Qj54_WVt_RA\">Bear Grylls&#8217; Morning Routine<\/a>: Cold (never get used to), bared foot, strength training, 30 minutes<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/engineering.fb.com\/2024\/08\/05\/data-center-engineering\/roce-network-distributed-ai-training-at-scale\/\">RoCE networks for distributed AI training at scale<\/a>: I have managed to read the <a href=\"https:\/\/cs.stanford.edu\/~keithw\/sigcomm2024\/sigcomm24-final246-acmpaginated.pdf\">paper<\/a> !  Although in the AI word, two years is an eternity, I think it is still interesting.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1) <strong>Network Topology<\/strong>: backend network only for GPUs (RDMA nics), non-blocking. Frontend network: data ingestion, checkpointing, logging.\n\nPod = AI zone\n leaf = RTSW, DAC cables, shallow buffer\n spine = CTSW, deep buffers. fiber between leaf-spine.\nSuperSpine = ATSW, oversubscribed, connect AI zones\n\nintra-node -&gt; nvlink\nROCE: cpu offloading, ethernet (standard)\n\ncollective communication library serves as the sw abstraction between training workloads and the NIC\n                                 schedules verbs calls over QP (Queue Pairs)\n parallelism strategy determines collective: allreduce, allgather, alltoall\n choice logical topology:\n\n------------------\n\n<strong>2) Routing<\/strong>: work load. low entropy flows (few flows) -&gt; ECMP bad (5-tuple udp: src\/dst ip, src\/dst port, protocol), burstiness, elephant flows\n--\n\n RTSW uplinks 1:2 under-subscribed! -&gt; expensive (short-term)\n 1) QP scaling: use destination QP of Roce packet using the UDF capability in switch to increase entropy -&gt; Enhanced ECMP -&gt; short-term\n 2) Central TE controller -&gt; long-term: CP real-time topology end-to-end cluster, \n                                        flow matrix (flow bps) + CSPF (constrained SPF)\n                                        write in switches dataplane\n                                     DP: TE overrides default BGP routing policy in leaf. Use Exact Match table.\n                           Not good with multiple link failures. Doesnt scale \n 3) Flowlet switching: try to improve 1 and 2. hw assistant schema. put packets in different ports in ECMP\n     out-of-order: move packets only after 1\/2 RTT\n     load-aware path assignment: better than TE\n\n------------------\n                     \n<strong>3) Transport<\/strong>: congestion management. Start with DCQCN. packet drops on ACK\/NACK can cause prolonged Local ACK timeout (LAT)\n--\n Tuning DCQCN not great (strict ECN -&gt; minimize PFC (can lead to head-of-line blocking)\n\n 200G, we stayed with relaxed ECN marking, allowing for buffer build up in the CTSW, while keeping default DCQCN settings.\n 400G We proceeded without DCQCN. just PFC for flow control\n re-design collective library: two-stage copy\n\n------------------\n\n4) <strong>Operations<\/strong>:\n Change QoS priority of Clear to Send (CTS) messages. In RTSW ASIC, modify dsCP marking for ACK  messages\n Tuning VOQ in CTSW\n obeservability: OOS: out of seq.\n                 Link flaps\n                 Local ACK timeouts (LAT)\n                 PFC watchdog: catch any long-duration PFC pause (&gt;200ms)\n                 buffer utilization RTSW\n                 reachibility (pings)\n                 constant latency monitoring loaded and unloaded (catch regressions)\n                 base lines!!!\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Perplexity: <a href=\"https:\/\/research.perplexity.ai\/articles\/hosting-qwen-on-blackwell\">Hosting Qwen on Blackwel<\/a>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">AWS <a href=\"https:\/\/www.aboutamazon.com\/stories\/aws-random-graph-theory-data-center-network-design?&amp;utm_term=36\">RNG<\/a> &#8211; Random Graph Network: The <a href=\"https:\/\/arxiv.org\/pdf\/2604.15261\">paper<\/a> is totally out of my space, but the concept looks brutal. With an operations hat, how you troubleshot it? (ping, traceroute, link congestion, data flows patterns, etc)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/coreweave.com\/topics\/what-is-slurm\">Slurm<\/a>: I like the &#8220;Slurm vs. Kubernetes&#8221;<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Slurm Workload Manager (short for Simple Linux Utility for Resource Management) has become a cornerstone of large-scale computing. Originally created in the early 2000s to support large-scale high-performance computing (HPC) environments, Slurm is now widely recognized as the de facto scheduler for HPC clusters. Today, it orchestrates jobs across thousands of servers and&nbsp;<a href=\"https:\/\/www.coreweave.com\/topics\/what-is-a-gpu\">GPUs<\/a>&nbsp;in some of the world\u2019s most advanced computing environments.&nbsp;<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/alex.smola.org\/posts\/37-infiniband-2n\/\">Interview Question<\/a>: 512 GPU, non-blocking (full bisection) and 2xUFM! I really liked this. I think for once I understand the rail-optimize (fat-tree = leaf-spine). Just break one leaf-spine link, beautiful!!!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"581\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2026\/05\/image-5-1024x581.png\" alt=\"\" class=\"wp-image-2194\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2026\/05\/image-5-1024x581.png 1024w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2026\/05\/image-5-300x170.png 300w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2026\/05\/image-5-768x436.png 768w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2026\/05\/image-5.png 1201w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/newsletter.semianalysis.com\/p\/inside-the-800vdc-revolution-part\">800VDC<\/a>: Next step in electrical infra in DC space.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/watch?v=hapSlAP2xrc\">Phyllo by hand<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.youtube.com\/shorts\/voGdgHdI-Js\">Approach woman<\/a>: curiosity and no performance. Practice. Be at peace with uncomfortable and akwardness. Rejection as learning<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Manson AI: You need friction AI Agent: narrow focus: goal, proof, steps Bear Grylls&#8217; Morning Routine: Cold (never get used to), bared foot, strength training, 30 minutes RoCE networks for distributed AI training at scale: I have managed to read the paper ! Although in the AI word, two years is an eternity, I think &hellip; <a href=\"https:\/\/blog.thomarite.uk\/index.php\/2026\/06\/06\/friction-morning-routine-roce-meta-paper-aws-rng-slurm-rail-optimize-800vdc-phyllo-approach\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Friction, Morning Routine, RoCe Meta Paper, AWS RNG, Slurm, Rail-Optimize, 800VDC, Phyllo, Approach&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,13,9,27,18,2],"tags":[],"class_list":["post-2177","post","type-post","status-publish","format-standard","hentry","category-automation","category-aws","category-cooking","category-kubernetes","category-maths","category-networks"],"_links":{"self":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/2177","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/comments?post=2177"}],"version-history":[{"count":11,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/2177\/revisions"}],"predecessor-version":[{"id":2198,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/2177\/revisions\/2198"}],"wp:attachment":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/media?parent=2177"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/categories?post=2177"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/tags?post=2177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}