Fat Tree – Drangonfly – OpenAI infra

I haven’t played much with ChatGPT but my first question was “how is the network infrastructure for building something like ChatGPT?” or similar. Obviously I didnt have the answer I was looking for and I think i think ask properly neither.

Today, I came to this video and at 3:30 starts something very interesting as this is an official video as says the OpenAI cluster built in 2020 for ChatGTP was actullay based on 285k AMD CPU “infinibad” plus 10k V100 GPU “infiniband” connected. They dont mention more lower level details but looks like two separated networks? And I have seen in several other pages/videos, M$ is hardcode in infiniband.

Then regarding the infiniband architectures, it seems the most common are “fat-tree” and “dragon-fly”. This video is quite good although I have to watch it again (or more) to fully understand.

These blog, pdf and wikipedia (high level) are good for learning about “Fat-Tree”.

Although most info I found is “old”, these technologies are not old. Frontier and looks like most of supercomputers use it.