networks – Page 5

VimGPT – Maia AI – Mirai – Reptar – Mellanox Debian – RISC-V DC – Mojo – Moors Law

VimGTP: Very interesting project. I haven’t used it. But thinking aloud, you could use it to interact with sites that dont have API (couriers)? I think with Selenium you can do things like that?

Maia AI: CLoud providers like to be masters of their own destiny so try to build as many things by themselves as possible. So now MS has developed its GPU for AI. It is quite interesting the custom rack they had to built with the sidekick for cooling down the new chips. There are no many figures about the chip (5nm, 105b transistors) to compare with other things in the market.

Reptar: new Intel CPU vulnerability. It looks like is a feature from Ice Lake architecture. It looks like you can crash the cores but no yet take over. Still interesting.

I am not affected 🙂

$ grep fsrm /proc/cpuinfo $

Mellanox with Debian: Interesting how you can install a nearly standard Debian into a Mellanox SN2700 switch.

RISC-V into datacenter: Happy to see RISC-V chips in the datacenter. But not clear who is going to use them?

Mirai history: I think most of wired articles read like a holywood movie 🙂 Although 2016 security issues are “old” school, still interesting how teenagers got that far.

Mojo: Interesting because of the people behind of it… really impressive.

Moor’s law analysis: I liked the part about networks, that is not very common mentioned in these type of analysis.

AusNOG 2023

Nice NOG meeting:

Vendor Support API: Interesting how Telstra uses Juniper TAC API to handle power supplies replacement. I was surprised that they are able to get the RMA and just try to replace it. If they dont need it, they send it back… That saves time to Telstra for sure. The problem I can see here is when you need to open ticket for inbound/outbound deliveries in the datacenters, that dont have any API at all. If datacenters and big courier companies had API as 1st class citizends, incredible things could happens. Still just being able to have zero-touch replacement for power supplies is a start.

No Packet Behind – AWS: I think until pass the first 30 minutes, there is nothing new that hasnt been published in other NOG meeting between 2022 and 2023. At least the mention the name of the latest fabric, Final Cat. As well, they mention issues with IPv6 deployment.

There are other interesting talks but without video so the pdf only doesnt really give me much (like the AWS live premium talk)

Wistron

I have never heard of Wistron until I reached this page. Maybe it is because:

Wistron typically only sells to hyper-scalers

I guess the hyperscalers put their own NOS on top? Anyway, quite interesting the model with 16 ports for Optical SN (that is 4x400G per port).

DarkWeb + Chess + Tropical House

I really want to try one day access the darkweb. No idea if this video is good but could be a starting point.

Unrelated, I am trying to get better at playing chess (extremely slowly if any progress). This video is amazing. And learn Go some day… (as usual no enough time)

Me vuelve loco esta session.

Networking Scale 2023

This is a conference about networks that I was interested and I finally got some emails with the presentations. They are mainly from Meta.

Meta’s Network Journey to Enable AI: video – second part interesting.

AI fabric (backend: gpu to gpu) hanging from DC fabric.
SPC (Space, Power, Cooling)
Fiber, Automation
RDMA requires (lossless, low-latency, in-order) -> ROCEv2(Ethernet) or IB
Servers have 8x400G to TOR. Tor 400G to Spines
1xAI zone per DH. 1xDC has several DHs.
Oversubscribed between zones, eBGP, ECMP.

Scaling RoCE Networks for AI Training: video — Really really good.

RMDA/IB used for long time in Research.
Training: learning a new capability from existing data (focus of the video)
Inference: Applying this capability to new data (real time)
Distributed training for complex models. GPU to GPU sync -> High BW and low/predictable latency.
ROCEv2 with (tuned) PFC/ECN. TE + ECMP (flow multplexing)
Oversubscription is fine in spine (higher layer)
Challenges: Load balancing (elefant flows), Slow receivers/back pressure, packet loss from L1 issues (those flapping links, faulty optics, cables, etc xD), debugging (find jobs failures)

Traffic Engineering for AI Training Networks : video – interesting both parts.

Non-blocking. RTSW=TOR. CTSW=Spine. Fat-Tree Architecture. 2xServer per rack. 1xserver=8xGPU. CTSW=16 downlinks -> 16 uplinks. Up to 208 racks?
ROCE since 2020. CTSW are high redix and deep buffer switches.
AI Workload challenges: low entropy (flow repetitive, predictable), bursty, high intensity elephant flows.
SW based TE: dynamic routing adapted on real-time. Adaptive job placement. Controller (stateless)
Data plane: Overlay (features from broadcom chips) and Underly (BGP)
Flow granularity: nic to host flow.
Handle network failures with minimum convergence time. Backdoor channel with inhouse protocol.
Simulation platform. NCCL benchmark.

Networking for GenAI Training and Inference Clusters: video Super Good!

Recommendation Model: training 100GFlops/interation. inference: few GFlops/s for 100ms latency.
LLM: training 1PetaFlops/sentence (3 orders magnitude > recommendation), inference: 10PF/s for 1sec time-to-first token. +10k GPUs for training. Distributed inferencce. Need Compute too.
LLama2 70Billion tokens -> 1.7M hours of GPU. IB 200G per GPU, 51.2 TB/s bisection bw. 800 ZetaFlops. 2 Trillion dataset. 2k A100 GPUs. As well, used ROCEv2 (LLama2 34B).
+30 ExaFlops (30% of H100 GPUs fp8 peak) + LLama65B training < 1day.
Massive cluster: 32k GPUs! Model Parallelism.
LLM inference: dual-edge problem. Prefill large messages (High BW) + Decode small messages (latency sensitive).
Scale out (-bw, large domain. Scalable RDMA (IB or Ethernet), data parallel traffic) + Scale up (+BW, smaller domain. NVLink 400G, model parallel traffic)
32k GPU. TOR (252), Spine (18), AGG (18). 3 levels. Oversubscription Spine-Agg 7:1. 8 clusters. 252 racks per cluster. 16 GPUs per rack (8x252x16=32k GPUs). ROCEv2!
Model Parallelism harder for computation. Model Parallel traffic: all-reduced/all-to-all, big messages (inside cluster = sclae-up). Data Parallel traffic: all-gather & reduce-scatter (between cluster = scale-out, NVLink)
Challenges: Latency matters more than ranking. Reliability !!!!!
LLM inference needs a fabric.

Scale out vs scale up: storage DB

scale up (vertical): more bw(links), more storage, etc

scale out (horizontal): distribute load into different devices

Network Observability for AI/HPC Training Workflows: video

ROCET: Automating RDMA metric collection and analysis for GPU training. Info from hosts/nics and switches.
Report: out-of-sequence, nic flaps, local ack timeouts.
PARAM + pytorch. Chakra.

EVPN into IXP – ESNOG

From ESNOG (although I am subscribed, I dont really receive notifications…) I saw this presentation, and then I found the main video reference.

BUM: 1.5Mbps -> that goes to all customer ports!! Small devices can’t cope with that.

Ratelimiting for BUM ingress/egress

Originally IXP is VPLS. Issue with long lasting flows / long lasting MACs

Solution: EVPN + ProxyARP/ND Agent.

I know EVPN is mainly for Datacenters but it is interesting the move to IXP. Although I remember reading about EVPN into Cisco IOS-XR platform that is mainly a ISP device.

Mapping the Internet

From this blog, I found some interesting links. I didnt listen to the podcast though.

Awesome Connective Repo: Where I found this map about submarine cables (It rings the bell I saw it before) And this map about CDNs.

GEO-LEO-Starlink

From this blog, I could read an interesting presentation about network performance using satellite services. I guess there are many blogs about the performance of Starlink but this was the first time I read something about it. I was surprised the results were not that good even with the latest version of Startlink (that I didnt know neither)

Falcon

I checked out this blog note from Google about Falcon. To be honest, I dont really understand the “implementation”. Is it purely software? Does it interact with merchant Ethernet silicon ASICs? There is so much happening trying to get Ethernet similar to Infiniband that I am wonder how this fits outside Google infra. At the end of the day, Ethernet has been successful because everybody could use it for nearly anything. Just my opinion.

Ring Memory

Reading this news I was surprised by the mentioned paper where LLM can take up to millions of tokens. My knowledge of LLM infrastucture is very little (and the paper is a bit beyond me…) but I thought the implementation of this models followed kind of “chain/waterfall” where the output of some GPUs fed other GPUs.

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30