I got escalated an issue recently that had caused several outages and needed an urgent fix.
For different reasons, we had asymmetric routing in SITE-A. The normal flow is the green arrow. During the asymmetric routing, the flow is the red line. Routing wise, things should work. BUT, we have firewalls in the path. The firewalls were configured to allow asymmetric connections (I was told). As far as I could see in the config and logs, nothing was dropped in the firewalls during the issue.
So first thing, I fixed the asymmetric routing so it didnt happen again. I took me a while to come up with the solution (and it was quite simple) as I had to understand properly the routing before and during the issue. The diagram is quite simplified at the end of the day.
So during the maintenance window when I applied the fix for the asymmetric routing, I managed to take some traces in the firewalls, as I was trying to understand where the traffic was dropped/lost during the asymmetric scenario. As well, I was not very familiar with several parts of the network and the monitoring, I didnt know which links where already tapped or not. Once I was happy with the routing fix, I tried to take a look at the traces. At high level, I could see the return traffic leaving FW1 and leaving DC1-SW1. Based on that, I started to think that the firewalls were fine…..
In another maintenance, I tried to take more logs in different part of the network and I could see clearly the traffic reaching A-SW1. As I ran of time and missed to tap some links, I couldnt carry on.
So based on the second maintenance, the issue had to be inside SITE-A. Somehow it didnt make sense. I checked I didnt have uRPF enabled. The rest was pure L2 so it couldnt see the L3…
So in the third maintenance, I got all my debugging tools to verify that any network kit was dropping the traffic in SITE-A…. and it was useless. I realized that I could do a tcpdump in the client IP1 i was using for testing and I could see some return traffic!!!!
So, I was just socked. I didnt get it. It didnt make sense.
Somehow, I reviewed the tcp captures I was doing in each interface of both firewalls. I was trying to get to basics.
I was assuming the TCP handshake was completed properly. After paying a bit of attention to the client logs… I could see the TCP handshake completed. And I could see the HTTP GET getting to and leaving DC2-FW…. so why the server IP2 was not answering!!!!???
So back to the tcp handshake and firewall captures, I was comparing step by step. Somehow, I missed that the TCP ACK from client IP2 was reaching DC2-FW…. but it was not leaving DC2-FW!!!! even worse, the HTTP GET it was actually crossing the DC2-FW !!!
SLAP IN THE FACE!!!
This is the TCP handshake. This is networking 101…..
The TCP state-machine in client and server during the asymmetric scenario
So I was asumming that because the client was sending HTTP get, the tcp handshake was completed in both ends!!!!
It didnt make sense why I was seeing TCP SYN-ACK retransmissions from the server IP1…. BECAUSE the TCP ACK from client IP2 never reached.
For that reason server IP2 never answered the HTTP GET, because from its end the tcp hanshake was not completed.
I banged my head several times on the table. I “saw” this during the first maintenance window when I took the tcpdump in the firewalls BUT I didnt pay attention to the basic details.
I trusted too much to see a wireshark trace because it is more visual and shows more info but the clues were all the time in the tcpdump from the firewalls that I didnt bother to pay full attention.
At least, I found out where and why the connections failed during the asymmetric routing scenario. A firewall upgrade did the job.
So all fixed.
Lessons learned:
- without proper foundation, you can’t build knowledge (tcp handshake state in client and server)
- when things dont make sense, get back to basics (tcp handshake)
- get the most of the tools at hand (tcpdump – PSH packets were the HTTP GET !!!!)