Cloudflare had an outage last week. And this time, I felt quite identify with that situation as it could happen to me:
- Design: When you aim for HA, even a single patch panel is a SPOF no matther how much redundancy you have in your transit providers, routers, switches, firewalls, etc etc. So, look for SPOF!
- Documentation: For DC stuff, in my current employer we use patchmanager. It is supper handy for remote locations and it is our source of truth. Keep in mind that tool is as good as you keep it updated…. For example, for the PoPs we visit more often and we make more changes, we find more failures that we would like… For remote PoPs, as we know we are not going to come back for a couple of years, we are much more throrough. For network kit, we have RANCID+Git so we know always the lattest config and when changes where introduced (in 30m intervals at least).
- Process: We follow a risk assesment for any change we plan to introduce. Then on Thursday we have a CAB metting to schedule what changes are going to happen during the weekend. The aim is to have several people from different teams to understand and have a say in what is going to happen. This has proobed very useful. Four pairs of eyes are better than half 🙂 Still you need to be regirous in this process
Even having all this into account, you will have an outage. Have a retrospective, learn from it (no finger pointing) and apply it. Trully agile 😛