I have been quite surprised by this book. It is based on the psychology work from Alfred Adler. And i take it more as philosophy than anything else.
It is based on teleology: the study of the purpose of a given phenomenon, instead of its cause (that is aetiology). We determine our own lives according to the meaning we give to those past experiences. So it negates the influence of the past and traumas. The important thing is not what one is born with, but what use one makes of that equipment. We need the courage to be happy, because that needs change (the lifestyle), and it is scary. As well, this makes you to focus in the present.
All problems are interpersonal relationship problems. Personally, I feel my goal is not to be hurt in relationships with other people. But it is impossible not to get hurt (or hurt somebody).
Feelings of inferiority are subjective assumptions (and excuses), and those we can change it. Boasting is an inverted feeling of inferiority. If one really has confidence, one doesnt need to boast.
Life is not a competition (winner vs looser) and this applies to Relationships too. This make you see people as comrades. Only compare with your ideal self. You are the only one worrying about your appearance. When there is competition, there is a power struggle. Avoid the conflict as soon as possible, dont answer the action with a reaction (this is not admitting defeat though), because this evolves to a revenge.
Our objectives are: self-reliant (I have the ability) and live in harmony with society (people are my comrades). Our life tasks are: tasks of work, frienship and love
Life lie = I am making up flaws in other people just so that I can avoid my life tasks, and more, I can avoid interpersonal relationships -> courage.
As we have our tasks, the other have their tasks. As we focus in our task, looking for recognition is debilitating, it creates a dependency, a vertical relationship (you want horizontal relationship). Do not live to satisfy the expectations of others. This means freedom. Freedom to be disliked by other people. The same way, you dont have to praise or debuke. Saying “Thank you” is good enough
This leads to the goal of interpersonal relationships, that is the feeling of community. This is acquired by your own efforts, active commitment.
1) Network Topology: backend network only for GPUs (RDMA nics), non-blocking. Frontend network: data ingestion, checkpointing, logging.
Pod = AI zone
leaf = RTSW, DAC cables, shallow buffer
spine = CTSW, deep buffers. fiber between leaf-spine.
SuperSpine = ATSW, oversubscribed, connect AI zones
intra-node -> nvlink
ROCE: cpu offloading, ethernet (standard)
collective communication library serves as the sw abstraction between training workloads and the NIC
schedules verbs calls over QP (Queue Pairs)
parallelism strategy determines collective: allreduce, allgather, alltoall
choice logical topology:
------------------
2) Routing: work load. low entropy flows (few flows) -> ECMP bad (5-tuple udp: src/dst ip, src/dst port, protocol), burstiness, elephant flows
--
RTSW uplinks 1:2 under-subscribed! -> expensive (short-term)
1) QP scaling: use destination QP of Roce packet using the UDF capability in switch to increase entropy -> Enhanced ECMP -> short-term
2) Central TE controller -> long-term: CP real-time topology end-to-end cluster,
flow matrix (flow bps) + CSPF (constrained SPF)
write in switches dataplane
DP: TE overrides default BGP routing policy in leaf. Use Exact Match table.
Not good with multiple link failures. Doesnt scale
3) Flowlet switching: try to improve 1 and 2. hw assistant schema. put packets in different ports in ECMP
out-of-order: move packets only after 1/2 RTT
load-aware path assignment: better than TE
------------------
3) Transport: congestion management. Start with DCQCN. packet drops on ACK/NACK can cause prolonged Local ACK timeout (LAT)
--
Tuning DCQCN not great (strict ECN -> minimize PFC (can lead to head-of-line blocking)
200G, we stayed with relaxed ECN marking, allowing for buffer build up in the CTSW, while keeping default DCQCN settings.
400G We proceeded without DCQCN. just PFC for flow control
re-design collective library: two-stage copy
------------------
4) Operations:
Change QoS priority of Clear to Send (CTS) messages. In RTSW ASIC, modify dsCP marking for ACK messages
Tuning VOQ in CTSW
obeservability: OOS: out of seq.
Link flaps
Local ACK timeouts (LAT)
PFC watchdog: catch any long-duration PFC pause (>200ms)
buffer utilization RTSW
reachibility (pings)
constant latency monitoring loaded and unloaded (catch regressions)
base lines!!!
AWS RNG – Random Graph Network: The paper is totally out of my space, but the concept looks brutal. With an operations hat, how you troubleshot it? (ping, traceroute, link congestion, data flows patterns, etc)
MRC1 and MRC2 (OCI): Why we need planes (breakouts) and not just a big plane.
As SerDes speeds continue increasing, every microsecond of congestion creates much larger pressure inside the fabric. A 100G transport domain may be manageable. A 400G domain amplifies the same congestion into roughly 4x pressure. An 800G domain, and eventually a 1.6T domain, becomes much harder to coordinate.
This pressure appears as larger switch buffer requirements, larger congestion domains, harder retransmission coordination, larger cache pressure, larger synchronization storms, and harder thermal and power scaling inside ASICs.
At hyperscale, switch ASIC cache and transport coordination become fundamental scaling bottlenecks. Increasing switch buffer size is extremely difficult: high-speed SRAM is expensive, larger cache arrays consume significant power, thermal density rises quickly, die area scaling becomes inefficient, and routing complexity increases dramatically.
Splitting transport into many smaller lanes naturally reduces these pressures. Reliability improvements then emerge as a byproduct, because congestion, retransmission, and buffering become more distributed.
THE QUESTION: which breakout keeps the fabric at the shallowest practical Clos depth while keeping plane count and operations manageable? -> less hops, less switches, less latency
Slurm Workload Manager (short for Simple Linux Utility for Resource Management) has become a cornerstone of large-scale computing. Originally created in the early 2000s to support large-scale high-performance computing (HPC) environments, Slurm is now widely recognized as the de facto scheduler for HPC clusters. Today, it orchestrates jobs across thousands of servers and GPUs in some of the world’s most advanced computing environments.
Interview Question: 512 GPU, non-blocking (full bisection) and 2xUFM! I really liked this. I think for once I understand the rail-optimize (fat-tree = leaf-spine). Just break one leaf-spine link, beautiful!!!
800VDC: Next step in electrical infra in DC space.
Very interesting book. In Western Work we know a lot about Roman Empire, Alexander the Great, etc. But we dont look very often to Asia. And Gengish Khan and the following Mongol empires shaped much of the world society on the time and until know.
He was focused in meritocracy. As part of his war strategy, it was the elimination of the aristocracy of the conquered land. Very strong focus in integration. They never imposed their culture, they had full freedom for religious belief. They were brutal in war but never cruel. Torture was common in Europe and other empires, for them, it was against their belief.
They had a very clear war strategy: light travel, fast striking. They had few luggage and basic diet. They cleared the path for their horses for advancing and returning. So they destroyed and agriculture over their conquering paths.
It is interesting how Genghis Khan crumbled after his death because he didnt manage this family properly. But still, the new kingdoms kept a balance for a long time.
And something that reminded me to Rome, they had to keep expanding the empire just to keep happy the capital…. They introduced the paper money and women ruled when the men were fighting… and their campaigns lasted years!
Trade was critical for Mongols. They reached Hungary and the Balkans. They trade slaves with Venice and Genoa.
The climate was critical for their success, when the weather became warmer, their pastures were less productive in Mongolia, they had less horses, so the base of their strength was tilted.
They were master of propaganda, to spread fear so they conquering was easier. The empire was based on good army, good propaganda and good administration (just think of the sear size of the empire). They founded public education.
Mongols unified China. I didnt know that, they founded Beijing and started Forbidden City. They created the Chinese identity but they followed the mongol customes behind the curtines.
And the end of it was the Plague. The Plague stopped commerce and people. Without the fluid transit of people and goods, they couldnt keep it together.
And it is really shocking the bad reputation that has been written about Mongols after their incredible empire and success.