AI Labs Power, Colossus2, AWS Reinvent 2025, ASML, Power Laws, DeepMind, HotChips, Tax strategy, Groq, Polyglot

AI Labs Powers: Interesting articule for solutions to get power for AI infra

Colossus2: The tricks to get power between states… and cost.

Amazon Leo Architecture + Internet Edge ARC320: I didn’t know AWS was going to compete with Starlink… but having JB’s BlueHorizon, I guess it makes sense.

AWS Network Infra 2025 NET402: From minute 29 is the most interesting for me. Precast fiber duct banks, TE pre-signalled bypass IP tunnels for each path (without RSVP-TE?) and constant recalculation. UltraCluster3, connector improvements (36% reduce link failures! 76% reduce time for cabling!). UltraSwitch with dynamic LB and adaptive routing (like IB and UltraEthernet)

AWS DynamoDB outage lessons learnt

Kubernetes pod networking: Good refresher

Nanog95 summary:

Veritasium: Why ASML is so critical.

Veritasium: Power Laws: I think I get it… but it is scary: forest fires, sanpiles, money, investment…

Extreme success, you can’t be a balanced person: I like this format of smaller conversations. It is difficult to find time for 1h+ videos

Google DeepMind: I like the history behind DeepMind and his founder. I didnt know they tried to beat the Go champion in China…. and disconnected.

HotChips2025:

  • Datacenter Racks part2:
    • Nice the NVIDIA presentation from a mechanical engineer.
    • Meta Catalina pod 33:39 – 4xracks for liquid cooling! for 2xIT racks. 42:39 3 networks: frontend (N-S), backend(E-W) and management/console. Leaking monitoring
    • Google TPU rack Ironwood: I need to research the 3D Torus connection. 1h15m16
  • Networking:
    • Intel IPU E220: This is a NIC although I wasn’t 100% sure until I checked in another site. You can use P4
    • AMD Pensando Pollara 400 (NIC). P4 architecture. UltraEthernet ready. 48:36 95% network utilization (intelligent LB, congestion mgmt (RTT-based), fast failover and loss recovery (select ACK)= the trinity)
    • NVIDIA ConnectX-8 SuperNIC: 1h13m multiplane – as I understand, you use more leaf switches, in the example 4 planes = 4 leaf, and you have 64x gpu scale

switch radix 512, gpu 8x100G, two-level non-blocking

standard nic 2-level non-blocking: 64×32=2k

connect8-x multiplane: 256×512=128k

Spectrum-X: network round-trip time: 5-10us, packet transmisson 2ns

demo 1h24m (with grafana!)

Tax Strategy: ex1, ex2

NVIDIA buys Groq: interesting details between SRAM vs HBM. And it seems the key is “assembly line architecture”

Polyglot: no magic tricks. I need more exposure.