Google Networking, AI Cooling, MATx

OpenFlow at Google – 2012: Openflow to manage to network, to simulate your network. 2 backbones: first for customer traffic and second for inter-DC traffic

UKNOF32 – Google Datacenter networking 2015: Evolution until Jupiter. Moving from chassis based solutions to pizza boxes. Smaller blast radius than a chassis. This switches have small buffers but Google uses ECN (QoS) for dealing with it.

Google DC Network via Optical Circuit 2022: (other video paper google post) Adding optical circuit switches, no more Clos network !!! Full mesh connection of aggregation blocks. Spines are expensive and bottlenecks. Traffic flows are predictable at large scale. Not building for worse scenario. Drawback: complex topology and routing control! Shortest path routing is insufficient. TE: variable hedging allows operation on different points along the continuum to tradeoff optimality under correct prediction vs robustness under misprediction -> no more spikes. Hitless topology reconfig. It seems it has been running already for 5y…. To be honest, It goes a bit… beyond my knowledge.

Google TPUv4 + Optical reconfigurable AI Network 2023: Based on the above but for AI at scale. Although there is already TPUv5. From this page, the pictures help to get a view of the connectivity. Still complex though.

Open Computer Project 2023: AI Datacenter – Mainly about how to cool down the AI infra with some much requirement of GPU/power.

MATx: A new company to design hw for AI models