Potato Pizza, TCP Conversation Completeness, IBM power10, AI developer kit, 2 not 3

This is a pizza that I tried several years ago, and it seems the restaurant is out of business. I have done some pizzas trying to emulate it but never matching that memory. So this is a placeholder to try:

Original Ingredients: Bechamel, Smoked Mozzarella, red onions, pancetta, sliced potatoes, and Buffalo Mozzarella.

Some example1, example2


This a old for today’s news. IBM Power10 Memory network but looks interesting:

...thanks to the coherence IBM already has across NVLink (which is really BlueLink running a slightly different protocol that makes the GPUs think DRAM is really slow but really fat HBM2, in effect, and also makes the CPUs think the HBM2 on the GPUs is really skinny but really fast DRAM). 

Checking some wireshark traces last week, I cam across the concept of TCP Conversation Completeness. This was totally new for me. This video gave some idea too. This was useful for me for finding TCP conversation that showed retransmissions when trying to stablish the TCP handshake, and not just showing the retransmission, so I used “tcp.completeness<33” so I should see TCP flows with a sync + RST.


AI developer Kit by NVIDIA: This card looks nice, I would mind to buy it and give a chance to learn, but it is sold out everywhere…. This is a video about it.


2 not 3 (make a choice!):

Quantum AI Chip, InfraHub, python UV, SR controller, Kobe Bryant, Hell

Google Quantum AI: This looks remarkable.

python vu: replacement for pip, pyenv, etc. Need to try

InfraHub: As a network engineer interested in Automation. This looks interesting and I would like to go deeper to fully understand as it is the merge of the typical source of truth (DB) that you can’t get in git.

Segment Routing Controller: This is another thing I played with some years ago, but never found a controller to make TE. I dont see clearly this software is OSS but at least doesnt look like is a vendor-lock…

Kobe Bryant: venting, and it is ok.

Jordan B Peterson: Hell

AWS re:Invent 2024, Oracle Cloud AI, GenCast, videos

AWS re:Invent 2024 – Monday Night:

  • Graviton evolution: ARM based chip for EC2. 50% new capacity of last 2y is Graviton.
  • Nitron Cards: security chip too.
  • AES Trainium2: min 47. 2xHead per rack and then accelerators, and switch. Trainnium != CPU|GPU. And this is a great analysis about Trainium2
  • Neurnlink: min 60, I guess this is the equivalent of NVLink, etc
  • Ultraserver, quite beefy pic, min 61.
  • Networking: min 73: 10p10u is a fabric = 10petabits under 10micro latency.
  • Cabling proprietary trunk connector 16:1 fiber. min 77. I pretty use i have used pig-tails some years ago, so not sure why this is new?
  • Firefly optic plug: loopback testing. This is interesting for DC operations. Min 78.
  • AWS design their own optics, reduced failure
  • Network topology: Min 81, new protocol SIDR – Scalable Intent Driven Routing. <1s reconvergence. not centralized.
  • And this is a better summary than mine.

AWS re:Invent 2024 – NET201: The only interesting thing is minute 29 with the usage of hollow core fiber, to improve latency. I assume it is used in very specific parts of the network, looks a bit fragile. Elastic Fabric Adapter, not really good explanation what it is, where doest it run: network, server, nic? but it seems important. Looks like SIDR?

AWS re:Invent 2024 – NET403: I think 401 and 402 were more interesting. There were repeated things from the two other talks. Still worth watching and hopefully there is a new one in 2025.

Oracle Cloud Infra – AI: First time I visit the OCI page about their AI infra.

GenCast: weather predict by Google Mind. Not sure until what point, this can be used by anybody? And how much hardware you need to run it?

we’ve made GenCast an open model and released its code and weights, as we did for our deterministic medium-range global weather forecasting model.

Videos:

510km nonstop – Ross Edgley: I have read several of his books and it is the first time I watch a full interview. Still I am not clear what his dark side is.

A man with few friends or not circle at all – Jordan B Peterson: I need to watch this more often

TPUv6, Alphafold, OOB design, OpenInterpreter, Walkie-Talkies, Zero Trust SSH, Videos, Finger Strength

Google TPUv6 Analysis: “… cloud infrastructure and which also is being tuned up by Google and Nvidia to run Google’s preferred JAX framework (written in Python) and its XLA cross-platform compiler, which speaks both TPU and GPU fluently.” So I guess this is a cross-compiler for CUDA?

“The A3 Ultra instances will be coming out “later this year,” and they will include Google’s own “Titanium” offload engine paired with Nvidia ConnectX-7 SmartNICs, which will have 3.2 Tb/sec of bandwidth interconnecting GPUs in the cluster using Google’s switching tweaks to RoCE Ethernet.” So again custom ethernet tweaks for RoCE, I hope it makes to the UEC? Not sure I understand having a Titanium offload and a connectx-7, are they not the same?

Alphafold: It is open to be used. Haven’t read properly the license.

OOB Design:

Open Interpreter: The next step in LLMs is to control/interact with your system.

In my laptop fails because I have the free version 🙁 need to try a different one, but looks promising!

open-interpreter main$ interpreter --model gpt-3.5-turbo



Welcome to Open Interpreter.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

▌ OpenAI API key not found

To use gpt-4o (recommended) please provide an OpenAI API key.

To use another language model, run interpreter --local or consult the documentation at docs.openinterpreter.com.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

OpenAI API key: ********************************************************************************************************************************************************************


Tip: To save this key for later, run one of the following and then restart your terminal.
MacOS: echo 'export OPENAI_API_KEY=your_api_key' >> ~/.zshrc
Linux: echo 'export OPENAI_API_KEY=your_api_key' >> ~/.bashrc
Windows: setx OPENAI_API_KEY your_api_key

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

▌ Model set to gpt-3.5-turbo

Open Interpreter will require approval before running code.

Use interpreter -y to bypass this.

Press CTRL-C to exit.

> what is my os?
Traceback (most recent call last):

Walkie-Talkies: Out of James Bond world.

Zero Trust SSH. From Cloudflare. And this video I watched some months ago (and it is already 4y).

Finger Strength: I follow similar protocol, although not everyday, for warm up and I think it works. I am not getting that super results but at least my fingers are stronger…. and I am not getting injuries!!!! \o/

Cisco AI/ML DC Infra Challenges: I am not quiet fan of Cisco products but this is a good overview.

Key points:

  • Create different networks (inter-GPU, front-end, storage, mgmt),
  • Inter-GPU:
    • – non-blocking, rails-optimized (fig.3)
  • Inter-GPU challenges:
  • – Packet loss: Use PFC +ECN (flow aware)
  • – Network delay: “Rich” QoS – proprietary QoS to handle mice flows. Needs good telemetry
  • – Network congestion: Some kind of communication switch-NIC
  • – Non-uniform utilization: Most vendors have something proprietary here, some dynamic LB and static-pinning?
  • – Simultaneous Elephant flows with large bursts: dynamic buffer protection (proprietary)

Videos:

  • Raoul Pal: Crypto Investment. His company. Go long run, invest a bit you can lose
  • Scott Galloway: Interesting his political analysis. Trump won and it seems Latins voted massively for him.
  • Bruce Dickinson: I read Bruce’s books some years ago so I was surprised to see him in a podcast. Need to finish it.
  • Eric Schmidt: I read one of his books some time ago so again, surprised to find him in a podcast. Still think Google has become evil and most of the good things he says are gone.
  • Javier Milei: I am not economist but it “seems” things are improving in Argentina. He is a character nonetheless. Need to finish it.
  • Matthew McConaughey: His book was really refreshing, and seeing him talking is the same. Raw, real.
  • Alex Honnold: You have to try hard if you want to do hard things.

SemiAnalysis – 100k cluster

This is site that a friend shared with me some months ago. And it is PURE gold from my point of view. They share a lot info free but not all, you have to subscribe/pay for the whole report. I would pay for it if my job were in that “business”

This is the link for a 100k GPU cluster.

It covers all details for building such infrastructure up to the network/hardware side. So from power distribution, cooling, racking, network design, etc. All is there.

It is something to read slowly to try to digest all the info.

This report for electrical systems (p1) shows the power facilities can be as big as the datacenter itself! So it is not rear to read hyperscalers want nuclear reactors.

Inch, Dig, Microarch, LLM101, Eureka, D-Wave, Etched Transformers, MatMul, Born To Run, Paris Bakeries, Demons

Inch by inch

Biggest impacts on health (35m): Tracking, Environment and Genetics.

Just Dig it: Very interesting company. I just dont understand if it is an ONG or for profit. Still, very cool mission

Microarch club: Looks very interesting but I can’t find time to listen to it.

LLM101: Another if you have time.

Eureka labs: Company behind LLM101

D-Wave 2024: Now everything is about AI, but quantum computing still can be a thing? I remember when the news came out about the fist commercial quantum computer

Etched: Transformers chip

MatMul free: I already linked this paper but these are “applications”: link1 link2

Interview about Born to Run: I liked the book and somehow now youtube shows me things related. This is the first video I watch from Rich Roll’s channel and I liked it. It starts with the announcement of the 2nd book that I think I will buy at some point. And I need to check Eric Orton work as I want to improve my running (and recover from my injury) and see how i can do it with my “broken” knee and age.. I am starting to do the exercises recommended at the end as want to see if I can get to a bit more minimal shoes at some point. Let’s see how it goes.

Paris Bakeries: I need to get back to Paris. I am going to start collecting sites to visit there:

Apollonia Poilâne
Cedric Grolet
pastry combos

Overcome your demons: I read this book a couple of years ago, and this video is a reminder that I need to read it again.

AI will save the world, Nutanix kernel upgrade, GPU Programming

AI will save the world: Positive view of the AI development. Interesting the attack to China/Karl Marx at the end. In general I feel confident this will be good.

Nutanix kernel upgrade story: This is a bit hardcore for me (and looks a bit old from 2021) but still quite interesting how they did the troubleshooting.

GPU programming: I have never read about how to code for a GPU and this looks interesting and quite different from what I would do in a CPU. From the “Execution Model of the GPU” I started to lose track. Still is nice to see a summary at the end and resources/books.

Meta GenAI Infra, Oracle RDMA, Cerebras, Co-packaged optics, devin, figure01, summarize youtube videos, pdf linux cli, levulinic acid

Meta GenAI infra: link. Interesting they have built two cluster one Ethernet and the other Infiniband, both without bottlenecks. I don’t understand if Gran Teton is where they install the NVIDIA GPUs? And for storage, I would expect something based on ZFS or similar. For performance, “We also optimized our network routing strategy”. And it is critical the “debuggability” for a system of this size. How quick you can detect a faulty cable, port, gpu, etc?

Oracle RDMA: This is an ethernet deployment with RDMA. The interesting part is the development DC-QCN (some ECN enhancement)

Cerebras WSE-3: Looks like outside NVIDIA and AMD, this is the only other option. I wonder how much you need to change your code to work in this setup? They say it is easier… I like the pictures about the cooling and racks.

Co-packaged optics: Interesting to see if this becomes a new “normal”. No more flapping links anybody? It is the fiber or replace the whole switch….

I have been watching several videos lately and I would like to be able to get a tool to give a quick summary of the video so I can have notes (and check if the tool is good). Some tools: summarize.tech, sumtubeai

video1, video2, video3, video4, video5, video6, video7, video8, video9, video10, video11

Devin and Figure01: Looks amazing and scary. I will need one robot for my dream bakery.

I wanted to “extract” some pages from different pdfs in just one file. “qpdf” looks like the tool for it.

qpdf --empty --pages first.pdf 1-2 second.pdf 1 -- combined.pdf

levulinic acid: I learnt about it from this news.

Life, Love, Sex, Negative Beliefs, startup regrets, nanog90, Groq LPU, LLM from scratch, ssh3, eBFP BGP, RPKI, TIANHE-3

I hit rock bottom this week. I hope I finally closed one door in my life so I give myself the chance to open others. Made the wrong decision? It is easy when you look back. Do I regret it? The most annoying thing is these are failures so you can’t go back and recover. But I was so bloody newbie!!!…. At least after 5 years…

“For every reason it’s not possible, there are hundreds of people who have faced the same circumstances and succeeded.” Jack Canfield

Head down, crying, cursing, whatever, but forwards. As it has always been.

—-

Somehow managed to list to long videos, something I normally can’t manage (because lack of time, etc)

Negative Beliefs, avoid bitterness, aim for greatness (remarkable things), scape the darkness: Jordan B Peterson with Modern Wisdom: video, podcast.

Find and keep Love: video. 1st Get your shit together. Communication is critical. Be careful with your shopping list….

Good Sex: video. Communicate….

Orgasm: video. Haven’t seen it completely yet but very interesting. Use your tongue wisely.

— Other things:

Startup decisions and regrets: page. Interesting. I think most of things are very specific but still good to read.

Nanog90: agenda I didnt want the videos but I reviewed several pdfs and these ones look interesting:

Abstract Ponderings: A ten-year retrospective. Rob Shakir – Google: video

https://rob.sh/post/reimagining-network-devices/
https://rob.sh/post/coaching/
https://cdn.rob.sh/files/the-next-spring-forward_2018.pdf
https://research.google/research-areas/networking/

AI Data Center networks – Juniper – video

Using gNOI capabilities to simplify software upgrade use case: video – I had to idea about gNOI so looks interesting. It is crazy that still in XXI, automating a network device is so painful. Thanks to all vendors to make your life miserable.

Go lang for network engineers: video slides– I always thought that Golang had a massive potential for network automation but there was always lack of support and python is the king. So nice to see that Arista has things to offer.

PTP in Meta: video and blog.

There are more things, but havent had the chance to review them.

—-

It looks there is new chatbot that is not using the standard NVIDIA GPU. Groq uses LPU (Language Processing Unit). And they say it is better than a GPU. They have this paper but I can’t really see feature of that LPU.

Slurp’it: Show this blog, and the product looks interesting but although is free, it is not opensource and at the end of they you dont want a new vendor-lockin

Container lab in kubernetes: Clabernetes. I would like to play with this one day.

NetDev0x17: videos and sessions. link This is quite low details and most of the time beyond my knowledge. Again, something to take a look at some point.

LLM from scratch: repo. Looks very interesting. But the book it is going to take a long time to hit the market.

ssh3: repo. Interesting experiment.

eBFP and BGP: blog. Really interesting. Another thing that always wanted to play with.

Orange RPKI: old news but still interesting to see how much damaged can cause RPKI in the wrong hands…

China TIANHE-3 Supercomputer: Very interesting. Link.

AWS Intent-Driven 2023- Groq – Graviton4 -Liquid Cooling – Petals – Google – Crawler – VAX – dmesg

AWS Reinvent Intent-Driven Network Infra: Interesting video about Intent-driven networking in AWS. This is the paper he shows in the presentation. Same note as last year, leaf-spine, pizza boxes, all home made. The development of the SIDR as the control plane for scale. And somehow the talk about UltraCluster for AI (20k+ GPU). Maybe that is related to this collaboration NVIDIA-AWS. Interesting that there is no mention to QoS, he said no oversubscription. In general, everything is high level, and done in-house, and very likely they facing problems that very few companies in the world are facing. Still would be nice to open all those techs (like Google has done – but never for network infra). As well, I think he hits the nail on the head how he defines himself from Network Engineer to Technologist, as at the end of the day, you touch all topics.

AWS backbone: No chassis, all pizza boxes

Graviton4: More ARM chips in cloud-scale

Groq: Didnt know this “GPU” alternative. Interesting numbers. Let’s see if somebody buys it.

Petals: Run LLMs bittorrent style!

Google view after 18 years: Very nice read about the culture shift in the company, from do not evil, to make lots of many at any cost.

GTP-Crawler: Negative thing, you need the pay version of chatgpt. I wonder, If I crawke cisco, juniper and arista, what would be nearly all network knowledge in the planet? If that crawler can get ALL that date.

Linux/VAX porting: Something that I want to keep (ATP).

dmesg -T: How many times (in even more years!!!!) I wondered how to make those timestamp to something I could compare with then debugging.