{"id":1053,"date":"2022-11-10T23:17:30","date_gmt":"2022-11-10T23:17:30","guid":{"rendered":"https:\/\/blog.thomarite.uk\/?p=1053"},"modified":"2022-11-10T23:46:25","modified_gmt":"2022-11-10T23:46:25","slug":"arp-storms-evpn","status":"publish","type":"post","link":"https:\/\/blog.thomarite.uk\/index.php\/2022\/11\/10\/arp-storms-evpn\/","title":{"rendered":"ARP Storms &#8211; EVPN"},"content":{"rendered":"\n<p>We have had an issue with broadcast storms in our network.  Checking the CoPP setup in the switches, we could see massive drops of ARP. This is a good <a href=\"https:\/\/thejordanburnett.com\/how-to-verify-copp-policy-and-drops-in-cisco-nx-os\/\">link<\/a> to know how to check CoPP drops in NXOS.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">N9K:# show copp status\nN9K# show policy-map interface control-plane | grep 'dropped [1-9]' | diff<\/pre>\n\n\n\n<p>Having so many ARP drops by CoPP is bad because very likely good ARP requests are going to be dropped.<\/p>\n\n\n\n<p>Initially i thought it was related to ARP problems in EVPN like this <a href=\"https:\/\/blog.apnic.net\/2021\/12\/01\/arp-problems-in-evpn\/\">link<\/a>. But after taking a packet capture in a switch from an interface connected to a server,  I could see that over 90% ARP traffic coming from the server was not getting a reply&#8230;. Checking in different switches, I could see the same pattern all over the place.<\/p>\n\n\n\n<p>So why the server was making so many ARP requests?<\/p>\n\n\n\n<p>After some time, managed to help help from a sysadmin with access to the servers so could troubleshoot the problem.<\/p>\n\n\n\n<p>But, how do you find the process that is triggering the ARP requests? I didnt make the effort to think about it and started to search for an easy answer. This <a href=\"https:\/\/unix.stackexchange.com\/questions\/343855\/how-does-one-determine-the-process-causing-an-arp-request\">post<\/a> gave me a clue.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">ss does show you connections that have not yet been resolved by arp. They are in state SYN-SENT. The problem is that such a state is only held for a few seconds then the connection fails, so you may not see it. You could try rapid polling for it with\n\nwhile ! ss -p state syn-sent | grep 1.1.1.100; do sleep .1; done<\/pre>\n\n\n\n<p>Somehow I couldnt see anything anything with &#8220;ss&#8221; so tried netstat as it shows you too the status of the TCP connection (I wonder what would happen is the connection was UDP instead???)<\/p>\n\n\n\n<p>Initially I tried &#8220;netstat -a&#8221; and it was too slow to show me &#8220;SYN-SENT&#8221; status<\/p>\n\n\n\n<p>Shame on me, I had to search how to get to show the ports quickly <a href=\"https:\/\/serverfault.com\/questions\/398234\/netstat-continuous-refresh-watch-changes-the-output\">here<\/a>:<\/p>\n\n\n\n<pre id=\"block-7236df2d-73c4-4bb7-ac78-3706caac7ec3\" class=\"wp-block-preformatted\">watch netstat -ntup | grep -i syn_sent | awk '{print $4,$5,$6,$7}'<\/pre>\n\n\n\n<p>It was slow because it was trying to resolve all IPs to hostname&#8230;. :facepalm. Tha is fixed with &#8220;-n&#8221; (no-resolve)<\/p>\n\n\n\n<p>Anyway, with the command above, finally managed to see the process that were in &#8220;SYN_SENT&#8221; state<\/p>\n\n\n\n<p>This is not the real thing, just an example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#  netstat -ntup | grep -i syn_sent \ntcp        0      1 192.168.1.203:35460     4.4.4.4:23              SYN_SENT    98690\/telnet        \n# \n<\/pre>\n\n\n\n<p>We could see that the destination port was TCP 179, so something in the node was trying to talk BGP! They were &#8220;bird&#8221; processes. As the node belonged to a kubernetes cluster, we could see a <a href=\"https:\/\/projectcalico.docs.tigera.io\/reference\/architecture\/overview\">calico<\/a> container as CNI. Then we connected to the container and tried to check the bird config. We could see clearly the IPs  that dont get ARP reply were configured there.<\/p>\n\n\n\n<p>So in summary, <strong>basic<\/strong> TCP:<\/p>\n\n\n\n<p> Very summarize, TCP is L4, then goes down to L3 IP. For getting to L2, you  need to know the MAC of the IP, so that triggers the ARP request. Once the MAC is learned, it is cached for the next request. For that reason the first time you  make a connection is slow (ping, traceroute, etc)<\/p>\n\n\n\n<p>Now we need to workout why the calico\/bird config is that way. Fix it to only use IPs of real BGP speakers and then verify the ARP storms stop.<\/p>\n\n\n\n<p>Hopefully, I will learn a bit about calico.<\/p>\n\n\n\n<p><strong>Notes for UDP<\/strong>:<\/p>\n\n\n\n<p> If I generate an UDP connection to a non-existing IP<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ nc -u 4.4.4.4 4000<\/pre>\n\n\n\n<p>netstat tells me the UDP connection is established and I can&#8217;t see anything in the ARP table for an external IP, for an internal IP (in my own network) I can see an incomplete entry. <strong>Why<\/strong>?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#  netstat -ntup | grep -i 4.4.4.4\nudp        0      0 192.168.1.203:42653     4.4.4.4:4000            ESTABLISHED 102014\/nc           \n# \n#  netstat -ntup | grep -i '192.168.1.2:'\nudp        0      0 192.168.1.203:44576     192.168.1.2:4000        ESTABLISHED 102369\/nc           \n# \n#\n# arp -a\n? (192.168.1.2) at &lt;incomplete> on wlp2s0\nsomething.mynet (192.168.1.1) at xx:xx:xx:yy:yy:zz [ether] on wlp2s0\n# \n\n# tcpdump -i wlp2s0 host 4.4.4.4\ntcpdump: verbose output suppressed, use -v[v]... for full protocol decode\nlistening on wlp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes\n23:35:45.081819 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1\n23:35:45.081850 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1\n23:35:46.082075 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1\n23:35:47.082294 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1\n23:35:48.082504 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1\n^C\n5 packets captured\n5 packets received by filter\n0 packets dropped by kernel\n# <\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>UDP is <strong>stateless<\/strong> so we can&#8217;t have states&#8230;. so it is always going to be &#8220;established&#8221;. Basic TCP\/UDP<\/li>\n\n\n\n<li>When trying to open an UDP connection to an external IP, you need to &#8220;route&#8221; so my laptop knows it needs to send the UDP connection to the default gateway, so when getting to L2, the destination MAC address is not 4.4.4.4 is the default gateway MAC. BASIC ROUTING !!!! For that reason you dont see 4.4.4.4 in ARP table \n<ul class=\"wp-block-list\">\n<li>When trying to open an UDP connection to a local IP, my laptop knows it is in the same network so it should be able to find the destination MAC address using ARP.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have had an issue with broadcast storms in our network. Checking the CoPP setup in the switches, we could see massive drops of ARP. This is a good link to know how to check CoPP drops in NXOS. N9K:# show copp status N9K# show policy-map interface control-plane | grep &#8216;dropped [1-9]&#8217; | diff Having &hellip; <a href=\"https:\/\/blog.thomarite.uk\/index.php\/2022\/11\/10\/arp-storms-evpn\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;ARP Storms &#8211; EVPN&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,27,2],"tags":[],"class_list":["post-1053","post","type-post","status-publish","format-standard","hentry","category-unix","category-kubernetes","category-networks"],"_links":{"self":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1053","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/comments?post=1053"}],"version-history":[{"count":3,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1053\/revisions"}],"predecessor-version":[{"id":1056,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1053\/revisions\/1056"}],"wp:attachment":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/media?parent=1053"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/categories?post=1053"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/tags?post=1053"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}