aws – T.I.L

MCP, Manus, Brain Computer, Spectrum-X, Quantum, DC, Hung Task, Do The Work

MCP: It is “old” news news from Dec 2024 but looks like a big thing now.

Manus: new hype, but looks cool. Need to try.

Brain Computer: You have to replace the neurons….

Spectrum-X with Cisco Silicon: I dont understand this move much. You are selling your Ethernet solution is the best for AI and then you bring a different one?

Quantum Computing: Several news lately from MS Majorana (official)and AWS Ocelot. Still, is being used in real problems? Just PR?

Build your own DC: good intro, I dont think you can find many books about this in amazon?

Hung tasks in linux: nice articule for troubleshooting hung tasks in linux.

Do the work

DeepSeek, AWS HPC SDR vs Multipath-TCP, OCSP death, AlphaChip, Visual AI agents, Ollama, Local AI, Bob Bowman

Nice analysis about DeepSeek without hype.

AWS HPC: Didn’t know AWS offered HPC services (articule from 2021). I liked to find more details about SDR: Multipath LB, Out of Order delivery, congestion control similar to BBR. I wonder, this is not the same as UltraEthernet consortium is trying to achieve?

Multipath-tcp: The above probably works in “close” networks (managed by one entity) but maybe it is not going to work in the Wild internet. Still this looks still quite far from production. I believe this like QUID. Somebody like google deploys it and the rest jump in the wagon (more or less)

OCSP death: “OCSP is not making anyone more secure. Browsers are either not checking it or are implementing it in a way that provides no security benefits. “

AlphaChip: As far as I have read, designing chip is one of the most complex things and getting help from AI can even increase the advances in chip design. I read that NVIDIA had something similar. And this should be applied to ASICs too so networking is benefited

Vision Agent AskUI: need to try

ByteDance UI Agent – UI-TARS: as above

Crawl4AI: Interesting for digestion your local knowledge base sites and using with your local LLM….

Run your locally AI: I tried this in my work MacBook and it worked! I want to create an AI agent for a work project (actually i am dreaming to be able to achieve it….)

Open Web UI + Ollama: I tested this too in my MacBook and works like magic! You can even use DeepSeek 🙂

Bolt.diy + DeepSeek: I didnt manage to install bolt.diy ….

Training your AI: My idea is to get an open-source LLM trained with my data so I can use it to do my “job” But in the video there was too much publicity and I dont have access to a GPU… but I dont much data neither (or that’s what I think)

Bob Bowman (Michael Phelps coach): Show up, do the job.

AWS re:Invent 2024, Oracle Cloud AI, GenCast, videos

AWS re:Invent 2024 – Monday Night:

Graviton evolution: ARM based chip for EC2. 50% new capacity of last 2y is Graviton.
Nitron Cards: security chip too.
AES Trainium2: min 47. 2xHead per rack and then accelerators, and switch. Trainnium != CPU|GPU. And this is a great analysis about Trainium2
Neurnlink: min 60, I guess this is the equivalent of NVLink, etc
Ultraserver, quite beefy pic, min 61.
Networking: min 73: 10p10u is a fabric = 10petabits under 10micro latency.
Cabling proprietary trunk connector 16:1 fiber. min 77. I pretty use i have used pig-tails some years ago, so not sure why this is new?
Firefly optic plug: loopback testing. This is interesting for DC operations. Min 78.
AWS design their own optics, reduced failure
Network topology: Min 81, new protocol SIDR – Scalable Intent Driven Routing. <1s reconvergence. not centralized.
And this is a better summary than mine.

AWS re:Invent 2024 – NET201: The only interesting thing is minute 29 with the usage of hollow core fiber, to improve latency. I assume it is used in very specific parts of the network, looks a bit fragile. Elastic Fabric Adapter, not really good explanation what it is, where doest it run: network, server, nic? but it seems important. Looks like SIDR?

AWS re:Invent 2024 – NET403: I think 401 and 402 were more interesting. There were repeated things from the two other talks. Still worth watching and hopefully there is a new one in 2025.

Oracle Cloud Infra – AI: First time I visit the OCI page about their AI infra.

GenCast: weather predict by Google Mind. Not sure until what point, this can be used by anybody? And how much hardware you need to run it?

we’ve made GenCast an open model and released its code and weights, as we did for our deterministic medium-range global weather forecasting model.

Videos:

510km nonstop – Ross Edgley: I have read several of his books and it is the first time I watch a full interview. Still I am not clear what his dark side is.

A man with few friends or not circle at all – Jordan B Peterson: I need to watch this more often

Tesla TCP, Cerebras Inference, Leopold AIG race, Cursor + Sonnet, AI AWS Engineering Infra, NVLink HGX B200 and UALink, Netflix encoding challenges, Food waste snacks, career advice AWS, Thick Skin

Testa TCP replacement: Instead of buying and spending a lot of money, built what you need. I assume very smart people around and real network engineering taking place.It is like a re-write of TCP but doesnt break it so your switches can still play with it. It seems videos are not available in the hotchips webpage yet. And this link looks even better, even mentions Arista as the switching vendor. (video from hotchips24)

Cerebras Inference: From hotchips 2024. I am still blow away for the waferscale solution. Obviously, the presentation says its product is the best but I wonder, can you install a “standard” linux and run your LLM/Inference that easily?

Leopold AIG race: Via linkedin, then the source. I read the chapter 3 regarding the race to the Trillion-Dollar cluster. It all looks Sci-Fi, but I think it may be not that far from reallity.

Cursor + Sonet: Replacement for copilot? original I haven’t used Copilot but at some point I would like to get into the wagon and try things and decide for myself.

AI AWS Engineering Infra: low-latency and large-scale networking (\o/), energy efficiency, security, AI chips.

NVLink HGX B200: To be honest, I always forger the concept of NVLink and I told my self it is an “in-server” switch to connect all GPUs in a rack. Still this can help:

At a high level, the consortium’s goal (UltraEthernet/ UA) is to develop an open standard alternative to Nvidia’s NVLInk that can be used for intra-server or inter-server high-speed connectivity between GPU/Accelerators to build scale-up AI/HPC systems. The plan is to use AMD’s interconnect (Infinity Fabric) as the baseline for this standard.

Netflix encoding challenges: From encoding per quality of connection, to per-title, to per-shot. Still there are challenges for live streaming. Amazon does already live streaming for sports, have they “solved” the problem? I dont use Netflix or similar but still, the challenges and engineering behind is quite interesting.

Food Waste snacks: Indeed, we need more of this.

Some career advice from AWS: I “get” the point but still you want to be up to speed (at certain level) with new technologies, you dont want to become a dinosaur (ATM, frame-relay, pascal, etc).

Again, it’s not about how much you technically know but how you put into use what you know to generate amazing results for a value chain.

Get the data – be a data-driven nerd if you will – define a problem statement, demonstrate how your solution translates to real value, and fix it.

Thick Skin:

“Not taking things personally is a superpower.” –James Clear

Because “no” is normal.

Before John Paul DeJoria built his billion-dollar empire with Patrón and hair products, he hustled door-to-door selling encyclopedias. His wisdom shared at Stanford Business School on embracing rejection is pure gold (start clip at 5:06).

You see, life is a numbers game. Today’s winners often got rejected the most (but persevered). They kept taking smart shots on goal and, eventually, broke through.

Cloudflare backbone 2024, Cisco AI, Leetcode, Alibaba HPN, Altman UBI, xAI 100k GPU, Crowdstrike RCA, Github deleted data, DGX SuperPod, how ssh works, Grace Hooper Nvidia

Cloudflare backbone 2024: Everything very high level. 500% backbone capacity increase since 2021. Use of MPLS + SR-TE. Would be interesting to see how the operate/automate those many PoPs.

Cisco AI: “three of the top four hyperscalers deploying our Ethernet AI fabric” I assume it is Google, Microsoft and Meta? AWS is the forth and biggest.

Huawei Cloud Monitor: Haven’t read the paper RD-Probe. I would expect a git repo with the code 🙂 And refers to AWS pdf and video.

Automated Leetcode: One day, I should have time to use it a learn more programming, although AI can solve them quicker than me 🙂

Alibaba Cloud HPN: linkedin, paper, AIDC material

LLM Traffic Pattern: periodically burst flows, few flows (LB harder)

Sensitive to failures: GPU, link, switch, etc

Limitations of Traditional Clos: ECMP (hash polarization) and SPOF in TORs

HPN goals:

-Scalability: up to 100k GPU

-Performance: low latency (minimum amount of hops) and maximum network utilization

-Reliability: Use two TORs with LACP from the host.

Tier1

– Use single-chip switch 51.2Tbps. They are more reliable. Dual TOR

– 1k GPUs in a segment (like nv-link) Rail-optimized network

Tier2: Eliminating load imbalance: Using dual plane. It has oversubscription

Tier3: connects several pod. Can reach 100k GPUs. Independent front-end network

Altman Universal Base Income Study: It doesnt fixt all problems, but in my opinion, it helps, and it is a good direction.

xAI 100k GPU cluster: 100k liquid-cooled H100s on single RDMA fabric. Looks like Supermicro involved for servers and Juniper only front-end network. NVIDIA provides all ethernet switches with Spectrum-4. Very interesting. Confirmation from NVIDIA (Spectrum used = Ethernet). More details with a video.

Crowdstrike RCA:

Github access deleted data: Didn’t know about it. Interesting and scary.

Nvidia DGX SuperPod: reference architecture. video. 1 pod is 16 racks with 4 DGX each (128×8=1024 GPU per pod), 2xIB fabric: compute + storage, fat tree, rail-optimized, liquid cooling. 32k GPU fills a DC.

How SSH works: So powerful, and I am still so clueless about it

Chips and Cheese GH200: Nice analysis for Nvidia Grace CPU (ARM Neoverse) and Hopper H100 GPU

LaVague, S3, Stratego

LaVague: There are web services that dont have API so this could help me to automate the interaction with them? I need to test. Another question, i am not sure if lavague has an API itself!

S3: I had this in my to-read list for a long time… and I after reading today I was a bit surprised because it wasn’t really technical as I expected. The takeouts are: Durability reviews, lightweight formal verification and ownership.

Stratego: I have never played this game but I was surprised that is more “complex” than chess and go. And how DeepNash can bluff and do unexpected things.

AWS Intent-Driven 2023- Groq – Graviton4 -Liquid Cooling – Petals – Google – Crawler – VAX – dmesg

AWS Reinvent Intent-Driven Network Infra: Interesting video about Intent-driven networking in AWS. This is the paper he shows in the presentation. Same note as last year, leaf-spine, pizza boxes, all home made. The development of the SIDR as the control plane for scale. And somehow the talk about UltraCluster for AI (20k+ GPU). Maybe that is related to this collaboration NVIDIA-AWS. Interesting that there is no mention to QoS, he said no oversubscription. In general, everything is high level, and done in-house, and very likely they facing problems that very few companies in the world are facing. Still would be nice to open all those techs (like Google has done – but never for network infra). As well, I think he hits the nail on the head how he defines himself from Network Engineer to Technologist, as at the end of the day, you touch all topics.

AWS backbone: No chassis, all pizza boxes

Graviton4: More ARM chips in cloud-scale

Groq: Didnt know this “GPU” alternative. Interesting numbers. Let’s see if somebody buys it.

Petals: Run LLMs bittorrent style!

Google view after 18 years: Very nice read about the culture shift in the company, from do not evil, to make lots of many at any cost.

GTP-Crawler: Negative thing, you need the pay version of chatgpt. I wonder, If I crawke cisco, juniper and arista, what would be nearly all network knowledge in the planet? If that crawler can get ALL that date.

Linux/VAX porting: Something that I want to keep (ATP).

dmesg -T: How many times (in even more years!!!!) I wondered how to make those timestamp to something I could compare with then debugging.

AusNOG 2023

Nice NOG meeting:

Vendor Support API: Interesting how Telstra uses Juniper TAC API to handle power supplies replacement. I was surprised that they are able to get the RMA and just try to replace it. If they dont need it, they send it back… That saves time to Telstra for sure. The problem I can see here is when you need to open ticket for inbound/outbound deliveries in the datacenters, that dont have any API at all. If datacenters and big courier companies had API as 1st class citizends, incredible things could happens. Still just being able to have zero-touch replacement for power supplies is a start.

No Packet Behind – AWS: I think until pass the first 30 minutes, there is nothing new that hasnt been published in other NOG meeting between 2022 and 2023. At least the mention the name of the latest fabric, Final Cat. As well, they mention issues with IPv6 deployment.

There are other interesting talks but without video so the pdf only doesnt really give me much (like the AWS live premium talk)

Google Spanner

From an email list, I read something about Gmail migration to Spanner. I was a bit surprised because I use gmail and didnt know anything about it. That email sent me to this page. That migration had to be a monster one! More details here. From the first page, I had a bit more info about Falcon. In summary, that is part of a bigger picture about building the “AI-driven” future infrastructure.

AWS Networking Videos – March 2023

I watched very interesting videos about AWS networking. They are high level, so they dont tell you the magic sauce you would like to know but it is nice that this info is out in the public.

DKNOG – How AWS is evolving its peering-edge in 2023 and onwards link + event:

— Evolution from buying chassis to building your own devices: consume -> create (NOC-less, auto-remediation, active telemetry, etc)-> innovate (freedom to examine trade-offs, 1U devices). Clearly use of “Clos” networks and they linux-based software.

— Delighted: low complexity + high innovation

— Simplicity Scales

— It is interesting the view of a router/brick like a set of 1U devices (rack 102.8T – 200x400G ports for customers, non-blocking). An it is very good they have pictures of the concept of “bricks” and “spines”.

— Challenges with cabling (SN connector — no patching rack needed) and 400G ZR+ (heating!)

— BGP peering is actually with a container:

— James Hamilton paper – link + pdf

AWS re:Invent 2022 – Dive deep on AWS networking infrastructure (NET402)– link

— summary: This is “similar” to the DKNOG but with longer and some other details like:

— “We dont like chassis”. 1+million devices

— SDR at NIC level so one TCP flow is actually load-balanced in several paths

— Hybrid SDN approach: You have controllers to give you a big picture view (I guess it provides the visibility to say “just send 70% traffic to this device” – but not sure how) and the own device device capability to deal with changes.

— Telemetry, continuous monitoring, triangulation: Be able to detect the port/device is causing the problem.

AWS re:Invent 2022 – Leaping ahead: The power of cloud network innovation (NET211-L) – link:

— AWS Global Infrastructure: Backbone capacity

— Customer SW/HW

— Everything fails all the time

— GPS locations in fibers! + inject light in fiber to double check fault -> intelligent optical routing/failover -> better than BGP….

— Termite sheet fibers for Australia 🙂

— Nitro card = NIC (offload card)

— SDR: not need in-order packet deliver as required by TCP. 25Gbps flows allowed now.