cpu – T.I.L

1MW Rack, Google Global Network, BGP PIC, Cisco Quantum, Deep Wiki, Gemini Languages, Vielleicht

1MW rack: I had to ask ChatGPT regarding the relationship Power and voltage. Shame on me as a son of a electrician.

Power = Voltage × Current (P = V × I)

So +/-400 VDC is 800V and 1MW power -> we need 1250 A. That is provided by the rectifiers?

Voltage (volts) = water pressure

Current (amps) = flow rate (liters per second)

Power (watts) = total water delivered (pressure × flow)

You can have high water pressure (voltage), but if your pipes are too small (not enough amps), you can’t fill a swimming pool (power).

You don’t “send” amps to the rack — you make amps available, and the rack draws what it needs.

Google global network 2025: (video)

BGP PIC: Prefix Independent Convergence (PIC). In summary, it is calculating a backup path and having installed so when having a fault, the convergence is minimize. I think it is like a MPLS FRR LSPs but for BGP. Cisco and Juniper

Cisco Quantum: Everybody to the wagon. D-Wave was the first company that offered quantum systems several years ago

DeepWiki: free explanation of GitHub repositories

Gemini Little Language Lessons: Usefull, I wish I could connect it to netflix so I can have german/english subtitles at the same time 🙂

Music:

Vielleicht: yeah, vielleicht

Hi Ren: so raw, brutal, respect.

Other:

reMarkable Paper Pro: Jealous my kindle paperwhite is so behind.

Huawei AI Cloud, Ironwood TPUv7, do the thing, TV garden Worldwide, Hacker Laws, Daylight, NVIDIA Photonics, Xsight, Finger Strength

Huawei AI Cloud: Power hungry, all optics, etc. Interesting take from China to NVIDIA. And even more interesting, how to fence off all the tariffs and restrictions…

Google Ironwood TPUv7: “It scales up to 9,216 liquid cooled chips linked with breakthrough Inter-Chip Interconnect (ICI) networking spanning nearly 10 MW” I wonder how is the network… but doesnt give low level details, just marketing.

Do the thing.

TV garden: TVs from around the world…. just in case you want to learn a language?

Hacker Laws: So many I dont know

Daylight: Looks so nice!!! And it seems it can read kindle books. Tempting

NVIDIA Photonics: I read about co-packaged from some Sherada post’s… but I didn’t see it coming so fast in production. With my network operations hat on…. how is the troubleshooting done? It the part where the fiber breaks, you replace the whole device? I guess this has been thought very deeply.

The power consumed by optics in the network is enormous and so is the capital expense. Anecdotally, we have heard it said many times that the majority of the cost in a datacenter-scale cluster is in the optical transceivers at both ends of a link and the fiber optic cable between them. Some the pieces that link switches to network interface cards is 75 percent to 80 percent of the cost of a network, with the switches and the NICs making up the other 20 percent to 25 percent.

Xsight: Another network silicon vendor. The article mentions Tofino P4.. I hope doesn’t end that way. I didn’t know anything about Avigdor Willenz

In part, that expectation for big change comes from the fact that Avigdor Willenz is the company’s founding investor. Willenz founded Galileo Technology, a maker of Ethernet switch ASICs that sold to Marvell in 2001 for $2.7 billion, and that wealth has been spread around. Willenz invested in Annapurna Labs, which sold to Amazon Web Services in 2015 for $350 million and which has created its Nitro DPUs, Graviton CPUs, and Trainium and Inferentia AI engines. He was president (now chairman) and first investor in distributed flash block storage maker Lightbits Labs. Willenz was a co-founder of AI chip maker Habana Labs, which sold to Intel in 2019 for $2 billion and is the foundation of its Gaudi compute engine line.

Finger Strength: “I’ve never seen strength like this before” true story

TFCC wrist injury: part of life…

Shelljack, Europe RISC-V, Quantum China, 100G optics teardown, Curiosity (Going long!), SS7 hacking, Juniper Hacking

Shelljack: It is old but still interesting. At least it seems easy to implement

Europe RISC-V: Interesting report about what EU is doing about the CHIP wars and RISC-V. I guess as EU is not pouring billions like USA/China is not making to the news. It was interesting to read about the participation of Spain with UCM and the people behind openchip.

Quantum China: Another quantum chip in the mix. So far everything came from USA.

100G SR4 QSFP28: An optic teardown. There are links to other teardown like 100G QSFP28 DAC and this is more hardcore: 800G ZR+ optic.

Curiosity: This is the best definition of what curios means (and I light years from it…) Ben Jojo is a star: “Trust, but verify”

SS7 hacking: More real than I thought.

Juniper Hacking: Juniper answer. In one sense doesnt surprise me, Mikrotik is famous to feed several bootnets, so why not EOL devices from other vendors?

MCP, Manus, Brain Computer, Spectrum-X, Quantum, DC, Hung Task, Do The Work

MCP: It is “old” news news from Dec 2024 but looks like a big thing now.

Manus: new hype, but looks cool. Need to try.

Brain Computer: You have to replace the neurons….

Spectrum-X with Cisco Silicon: I dont understand this move much. You are selling your Ethernet solution is the best for AI and then you bring a different one?

Quantum Computing: Several news lately from MS Majorana (official)and AWS Ocelot. Still, is being used in real problems? Just PR?

Build your own DC: good intro, I dont think you can find many books about this in amazon?

Hung tasks in linux: nice articule for troubleshooting hung tasks in linux.

Do the work

Photonics in Computing, Usb cable hack, Stutz, Building AI Networks Arista

Lightmatter: Based on this video, they are using photonics to connect chips, looks interesting, I remember Google has something with optical but for networking. But It is pretty clear this is not photonics computing.

Hacking USB cable: impressive, and expensive 🙂

Phil Stutz: Interesting conversation. But somehow, I am still looking for that thing that unlocks me…. can’t find it for the life of me….

Building AI Networks Arista:

- allreduce: collect elements from all nodes, apply a reduction operator(eg sum) then distribute reduction to all nodes
-allgather: collect elements from all nodes, and distribute the to all other nodes
- gpu: cpu for parallelization
- RDMA: RoCE2 GPU memory to GPU memory - origin in IB
- issues: flow collision, trafic polarization. low entropy!!! > dificult to ecmp => Dynamic LB
incast: many2one -> ECN + buffering (in spine!)
- use chassis!

With an operations hat on, dealing with chassis is expensive and no efficient. It kind of a vendor lock-in. AWS is all in pizza boxes and I remember one presentation in Cisco Live where the Cisco EVPN authority recommended pizza boxes.

Potato Pizza, TCP Conversation Completeness, IBM power10, AI developer kit, 2 not 3

This is a pizza that I tried several years ago, and it seems the restaurant is out of business. I have done some pizzas trying to emulate it but never matching that memory. So this is a placeholder to try:

Original Ingredients: Bechamel, Smoked Mozzarella, red onions, pancetta, sliced potatoes, and Buffalo Mozzarella.

Some example1, example2

This a old for today’s news. IBM Power10 Memory network but looks interesting:

...thanks to the coherence IBM already has across NVLink (which is really BlueLink running a slightly different protocol that makes the GPUs think DRAM is really slow but really fat HBM2, in effect, and also makes the CPUs think the HBM2 on the GPUs is really skinny but really fast DRAM).

Checking some wireshark traces last week, I cam across the concept of TCP Conversation Completeness. This was totally new for me. This video gave some idea too. This was useful for me for finding TCP conversation that showed retransmissions when trying to stablish the TCP handshake, and not just showing the retransmission, so I used “tcp.completeness<33” so I should see TCP flows with a sync + RST.

AI developer Kit by NVIDIA: This card looks nice, I would mind to buy it and give a chance to learn, but it is sold out everywhere…. This is a video about it.

2 not 3 (make a choice!):

Quantum AI Chip, InfraHub, python UV, SR controller, Kobe Bryant, Hell

Google Quantum AI: This looks remarkable.

python vu: replacement for pip, pyenv, etc. Need to try

InfraHub: As a network engineer interested in Automation. This looks interesting and I would like to go deeper to fully understand as it is the merge of the typical source of truth (DB) that you can’t get in git.

Segment Routing Controller: This is another thing I played with some years ago, but never found a controller to make TE. I dont see clearly this software is OSS but at least doesnt look like is a vendor-lock…

Kobe Bryant: venting, and it is ok.

Jordan B Peterson: Hell

AWS re:Invent 2024, Oracle Cloud AI, GenCast, videos

AWS re:Invent 2024 – Monday Night:

Graviton evolution: ARM based chip for EC2. 50% new capacity of last 2y is Graviton.
Nitron Cards: security chip too.
AES Trainium2: min 47. 2xHead per rack and then accelerators, and switch. Trainnium != CPU|GPU. And this is a great analysis about Trainium2
Neurnlink: min 60, I guess this is the equivalent of NVLink, etc
Ultraserver, quite beefy pic, min 61.
Networking: min 73: 10p10u is a fabric = 10petabits under 10micro latency.
Cabling proprietary trunk connector 16:1 fiber. min 77. I pretty use i have used pig-tails some years ago, so not sure why this is new?
Firefly optic plug: loopback testing. This is interesting for DC operations. Min 78.
AWS design their own optics, reduced failure
Network topology: Min 81, new protocol SIDR – Scalable Intent Driven Routing. <1s reconvergence. not centralized.
And this is a better summary than mine.

AWS re:Invent 2024 – NET201: The only interesting thing is minute 29 with the usage of hollow core fiber, to improve latency. I assume it is used in very specific parts of the network, looks a bit fragile. Elastic Fabric Adapter, not really good explanation what it is, where doest it run: network, server, nic? but it seems important. Looks like SIDR?

AWS re:Invent 2024 – NET403: I think 401 and 402 were more interesting. There were repeated things from the two other talks. Still worth watching and hopefully there is a new one in 2025.

Oracle Cloud Infra – AI: First time I visit the OCI page about their AI infra.

GenCast: weather predict by Google Mind. Not sure until what point, this can be used by anybody? And how much hardware you need to run it?

we’ve made GenCast an open model and released its code and weights, as we did for our deterministic medium-range global weather forecasting model.

Videos:

510km nonstop – Ross Edgley: I have read several of his books and it is the first time I watch a full interview. Still I am not clear what his dark side is.

A man with few friends or not circle at all – Jordan B Peterson: I need to watch this more often

TPUv6, Alphafold, OOB design, OpenInterpreter, Walkie-Talkies, Zero Trust SSH, Videos, Finger Strength

Google TPUv6 Analysis: “… cloud infrastructure and which also is being tuned up by Google and Nvidia to run Google’s preferred JAX framework (written in Python) and its XLA cross-platform compiler, which speaks both TPU and GPU fluently.” So I guess this is a cross-compiler for CUDA?

“The A3 Ultra instances will be coming out “later this year,” and they will include Google’s own “Titanium” offload engine paired with Nvidia ConnectX-7 SmartNICs, which will have 3.2 Tb/sec of bandwidth interconnecting GPUs in the cluster using Google’s switching tweaks to RoCE Ethernet.” So again custom ethernet tweaks for RoCE, I hope it makes to the UEC? Not sure I understand having a Titanium offload and a connectx-7, are they not the same?

Alphafold: It is open to be used. Haven’t read properly the license.

OOB Design:

Open Interpreter: The next step in LLMs is to control/interact with your system.

In my laptop fails because I have the free version 🙁 need to try a different one, but looks promising!

open-interpreter main$ interpreter --model gpt-3.5-turbo

●                                                                                                                                                                                                                                                                  

Welcome to Open Interpreter.                                                                                                                                                                                                                                       

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

▌ OpenAI API key not found                                                                                                                                                                                                                                       

To use gpt-4o (recommended) please provide an OpenAI API key.                                                                                                                                                                                                      

To use another language model, run interpreter --local or consult the documentation at docs.openinterpreter.com.                                                                                                                                                   

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

OpenAI API key: ********************************************************************************************************************************************************************


Tip: To save this key for later, run one of the following and then restart your terminal.                                                                                                                                                                          
MacOS: echo 'export OPENAI_API_KEY=your_api_key' >> ~/.zshrc                                                                                                                                                                                                       
Linux: echo 'export OPENAI_API_KEY=your_api_key' >> ~/.bashrc                                                                                                                                                                                                      
Windows: setx OPENAI_API_KEY your_api_key                                                                                                                                                                                                                          

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

▌ Model set to gpt-3.5-turbo                                                                                                                                                                                                                                     

Open Interpreter will require approval before running code.                                                                                                                                                                                                        

Use interpreter -y to bypass this.                                                                                                                                                                                                                                 

Press CTRL-C to exit.                                                                                                                                                                                                                                              

> what is my os?
Traceback (most recent call last):

Walkie-Talkies: Out of James Bond world.

Zero Trust SSH. From Cloudflare. And this video I watched some months ago (and it is already 4y).

Finger Strength: I follow similar protocol, although not everyday, for warm up and I think it works. I am not getting that super results but at least my fingers are stronger…. and I am not getting injuries!!!! \o/

Cisco AI/ML DC Infra Challenges: I am not quiet fan of Cisco products but this is a good overview.

Key points:

Create different networks (inter-GPU, front-end, storage, mgmt),
Inter-GPU:
- – non-blocking, rails-optimized (fig.3)
Inter-GPU challenges:
– Packet loss: Use PFC +ECN (flow aware)
– Network delay: “Rich” QoS – proprietary QoS to handle mice flows. Needs good telemetry
– Network congestion: Some kind of communication switch-NIC
– Non-uniform utilization: Most vendors have something proprietary here, some dynamic LB and static-pinning?
– Simultaneous Elephant flows with large bursts: dynamic buffer protection (proprietary)

Videos:

Raoul Pal: Crypto Investment. His company. Go long run, invest a bit you can lose
Scott Galloway: Interesting his political analysis. Trump won and it seems Latins voted massively for him.
Bruce Dickinson: I read Bruce’s books some years ago so I was surprised to see him in a podcast. Need to finish it.
Eric Schmidt: I read one of his books some time ago so again, surprised to find him in a podcast. Still think Google has become evil and most of the good things he says are gone.
Javier Milei: I am not economist but it “seems” things are improving in Argentina. He is a character nonetheless. Need to finish it.
Matthew McConaughey: His book was really refreshing, and seeing him talking is the same. Raw, real.
Alex Honnold: You have to try hard if you want to do hard things.

SemiAnalysis – 100k cluster

This is site that a friend shared with me some months ago. And it is PURE gold from my point of view. They share a lot info free but not all, you have to subscribe/pay for the whole report. I would pay for it if my job were in that “business”

This is the link for a 100k GPU cluster.

It covers all details for building such infrastructure up to the network/hardware side. So from power distribution, cooling, racking, network design, etc. All is there.

It is something to read slowly to try to digest all the info.

This report for electrical systems (p1) shows the power facilities can be as big as the datacenter itself! So it is not rear to read hyperscalers want nuclear reactors.