Friction, Morning Routine, RoCe Meta Paper, AWS RNG, Slurm, Rail-Optimize, 800VDC, Phyllo, Approach

Manson AI: You need friction

AI Agent: narrow focus: goal, proof, steps

Bear Grylls’ Morning Routine: Cold (never get used to), bared foot, strength training, 30 minutes

RoCE networks for distributed AI training at scale: I have managed to read the paper ! Although in the AI word, two years is an eternity, I think it is still interesting.

1) Network Topology: backend network only for GPUs (RDMA nics), non-blocking. Frontend network: data ingestion, checkpointing, logging.

Pod = AI zone
 leaf = RTSW, DAC cables, shallow buffer
 spine = CTSW, deep buffers. fiber between leaf-spine.
SuperSpine = ATSW, oversubscribed, connect AI zones

intra-node -> nvlink
ROCE: cpu offloading, ethernet (standard)

collective communication library serves as the sw abstraction between training workloads and the NIC
                                 schedules verbs calls over QP (Queue Pairs)
 parallelism strategy determines collective: allreduce, allgather, alltoall
 choice logical topology:

------------------

2) Routing: work load. low entropy flows (few flows) -> ECMP bad (5-tuple udp: src/dst ip, src/dst port, protocol), burstiness, elephant flows
--

 RTSW uplinks 1:2 under-subscribed! -> expensive (short-term)
 1) QP scaling: use destination QP of Roce packet using the UDF capability in switch to increase entropy -> Enhanced ECMP -> short-term
 2) Central TE controller -> long-term: CP real-time topology end-to-end cluster, 
                                        flow matrix (flow bps) + CSPF (constrained SPF)
                                        write in switches dataplane
                                     DP: TE overrides default BGP routing policy in leaf. Use Exact Match table.
                           Not good with multiple link failures. Doesnt scale 
 3) Flowlet switching: try to improve 1 and 2. hw assistant schema. put packets in different ports in ECMP
     out-of-order: move packets only after 1/2 RTT
     load-aware path assignment: better than TE

------------------
                     
3) Transport: congestion management. Start with DCQCN. packet drops on ACK/NACK can cause prolonged Local ACK timeout (LAT)
--
 Tuning DCQCN not great (strict ECN -> minimize PFC (can lead to head-of-line blocking)

 200G, we stayed with relaxed ECN marking, allowing for buffer build up in the CTSW, while keeping default DCQCN settings.
 400G We proceeded without DCQCN. just PFC for flow control
 re-design collective library: two-stage copy

------------------

4) Operations:
 Change QoS priority of Clear to Send (CTS) messages. In RTSW ASIC, modify dsCP marking for ACK  messages
 Tuning VOQ in CTSW
 obeservability: OOS: out of seq.
                 Link flaps
                 Local ACK timeouts (LAT)
                 PFC watchdog: catch any long-duration PFC pause (>200ms)
                 buffer utilization RTSW
                 reachibility (pings)
                 constant latency monitoring loaded and unloaded (catch regressions)
                 base lines!!!

Perplexity: Hosting Qwen on Blackwel:

AWS RNG – Random Graph Network: The paper is totally out of my space, but the concept looks brutal. With an operations hat, how you troubleshot it? (ping, traceroute, link congestion, data flows patterns, etc)

Slurm: I like the “Slurm vs. Kubernetes”

Slurm Workload Manager (short for Simple Linux Utility for Resource Management) has become a cornerstone of large-scale computing. Originally created in the early 2000s to support large-scale high-performance computing (HPC) environments, Slurm is now widely recognized as the de facto scheduler for HPC clusters. Today, it orchestrates jobs across thousands of servers and GPUs in some of the world’s most advanced computing environments. 

Interview Question: 512 GPU, non-blocking (full bisection) and 2xUFM! I really liked this. I think for once I understand the rail-optimize (fat-tree = leaf-spine). Just break one leaf-spine link, beautiful!!!

800VDC: Next step in electrical infra in DC space.

Phyllo by hand

Approach woman: curiosity and no performance. Practice. Be at peace with uncomfortable and akwardness. Rejection as learning

Genghis Khan

Very interesting book. In Western Work we know a lot about Roman Empire, Alexander the Great, etc. But we dont look very often to Asia. And Gengish Khan and the following Mongol empires shaped much of the world society on the time and until know.

He was focused in meritocracy. As part of his war strategy, it was the elimination of the aristocracy of the conquered land. Very strong focus in integration. They never imposed their culture, they had full freedom for religious belief. They were brutal in war but never cruel. Torture was common in Europe and other empires, for them, it was against their belief.

They had a very clear war strategy: light travel, fast striking. They had few luggage and basic diet. They cleared the path for their horses for advancing and returning. So they destroyed and agriculture over their conquering paths.

It is interesting how Genghis Khan crumbled after his death because he didnt manage this family properly. But still, the new kingdoms kept a balance for a long time.

And something that reminded me to Rome, they had to keep expanding the empire just to keep happy the capital…. They introduced the paper money and women ruled when the men were fighting… and their campaigns lasted years!

Trade was critical for Mongols. They reached Hungary and the Balkans. They trade slaves with Venice and Genoa.

The climate was critical for their success, when the weather became warmer, their pastures were less productive in Mongolia, they had less horses, so the base of their strength was tilted.

They were master of propaganda, to spread fear so they conquering was easier. The empire was based on good army, good propaganda and good administration (just think of the sear size of the empire). They founded public education.

Mongols unified China. I didnt know that, they founded Beijing and started Forbidden City. They created the Chinese identity but they followed the mongol customes behind the curtines.

And the end of it was the Plague. The Plague stopped commerce and people. Without the fluid transit of people and goods, they couldnt keep it together.

And it is really shocking the bad reputation that has been written about Mongols after their incredible empire and success.