CCNA DevNet Notes

1) Python Requests status code checks:

r.status_code == requests.codes.ok

2) Docker publish ports:

$ docker run -p 127.0.0.1:80:8080/tcp ubuntu bash

This binds port 8080 of the container to TCP port 80 on 127.0.0.1 of the host machine. You can also specify udp and sctp ports. The Docker User Guide explains in detail how to manipulate ports in Docker.

3) HTTP status codes:

1xx informational
2xx Successful
 201 created
 204 no content (post received by server)
3xx Redirect
 301 moved permanently - future requests should be directed to the given URI
 302 found - requested resource resides temporally under a different URI
 304 not modified
4xx Client Error
 400 bad request
 401 unauthorized (user not authenticated or failed)
 403 forbidden (need permissions)
 404 not found
5xx Server Error
 500 internal server err - generic error message
 501 not implemented
 503 service unavailable

4) Python dictionary filters:

my_dict = {8:'u',4:'t',9:'z',10:'j',5:'k',3:'s'}

# filter(function,iterables)
new_dict = dict(filter(lambda val: val[0] % 3 == 0, my_dict.items()))

print("Filter dictionary:",new_filt)

5) HTTP Authentication

Basic: For "Basic" authentication the credentials are constructed by first combining the username and the password with a colon (aladdin:opensesame), and then by encoding the resulting string in base64 (YWxhZGRpbjpvcGVuc2VzYW1l).

Authorization: Basic YWxhZGRpbjpvcGVuc2VzYW1l

---
auth_type = 'Basic'
creds = '{}:{}'.format(user,pass)
creds_b64 = base64.b64encode(creds)
header = {'Authorization': '{}{}'.format(auth_type,creds_b64)}

Bearer:

Authorization: Bearer <TOKEN>

6) “diff -u file1.txt file2.txt”. link1 link2

The unified format is an option you can add to display output without any redundant context lines

$ diff -u file1.txt file2.txt                                                                                                            
--- file1.txt   2018-01-11 10:39:38.237464052 +0000                                                                                              
+++ file2.txt   2018-01-11 10:40:00.323423021 +0000                                                                                              
@@ -1,4 +1,4 @@                                                                                                                                  
 cat                                                                                                                                             
-mv                                                                                                                                              
-comm                                                                                                                                            
 cp                                                                                                                                              
+diff                                                                                                                                            
+comm
  • The first file is indicated by —
  • The second file is indicated by +++
  • The first two lines of this output show us information about file 1 and file 2. It lists the file name, modification date, and modification time of each of our files, one per line. 
  • The lines below display the content of the files and how to modify file1.txt to make it identical to file2.txt.
  • - (minus) – it needs to be deleted from the first file.
    + (plus) – it needs to be added to the first file.
  • The next line has two at sign @ followed by a line range from the first file (in our case lines 1 through 4, separated by a comma) prefixed by “-“ and then space and then again followed by a line range from the second file prefixed by “+” and at the end two at sign @. Followed by the file content in output tells us which line remain unchanged and which lines needs to added or deleted(indicated by symbols) in the file 1 to make it identical to file 2

7) Python Testing: Assertions

.assertEqual(a, b)	a == b
.assertTrue(x)	        bool(x) is True
.assertFalse(x)	        bool(x) is False
.assertIs(a, b)	        a is b
.assertIsNone(x)	x is None
.assertIn(a, b)	        a in b
.assertIsInstance(a, b)	isinstance(a, b)

*** .assertIs(), .assertIsNone(), .assertIn(), and .assertIsInstance() all have opposite methods, named .assertIsNot(), and so forth.

ARP Storms – EVPN

We have had an issue with broadcast storms in our network. Checking the CoPP setup in the switches, we could see massive drops of ARP. This is a good link to know how to check CoPP drops in NXOS.

N9K:# show copp status
N9K# show policy-map interface control-plane | grep 'dropped [1-9]' | diff

Having so many ARP drops by CoPP is bad because very likely good ARP requests are going to be dropped.

Initially i thought it was related to ARP problems in EVPN like this link. But after taking a packet capture in a switch from an interface connected to a server, I could see that over 90% ARP traffic coming from the server was not getting a reply…. Checking in different switches, I could see the same pattern all over the place.

So why the server was making so many ARP requests?

After some time, managed to help help from a sysadmin with access to the servers so could troubleshoot the problem.

But, how do you find the process that is triggering the ARP requests? I didnt make the effort to think about it and started to search for an easy answer. This post gave me a clue.

ss does show you connections that have not yet been resolved by arp. They are in state SYN-SENT. The problem is that such a state is only held for a few seconds then the connection fails, so you may not see it. You could try rapid polling for it with

while ! ss -p state syn-sent | grep 1.1.1.100; do sleep .1; done

Somehow I couldnt see anything anything with “ss” so tried netstat as it shows you too the status of the TCP connection (I wonder what would happen is the connection was UDP instead???)

Initially I tried “netstat -a” and it was too slow to show me “SYN-SENT” status

Shame on me, I had to search how to get to show the ports quickly here:

watch netstat -ntup | grep -i syn_sent | awk '{print $4,$5,$6,$7}'

It was slow because it was trying to resolve all IPs to hostname…. :facepalm. Tha is fixed with “-n” (no-resolve)

Anyway, with the command above, finally managed to see the process that were in “SYN_SENT” state

This is not the real thing, just an example:

#  netstat -ntup | grep -i syn_sent 
tcp        0      1 192.168.1.203:35460     4.4.4.4:23              SYN_SENT    98690/telnet        
# 

We could see that the destination port was TCP 179, so something in the node was trying to talk BGP! They were “bird” processes. As the node belonged to a kubernetes cluster, we could see a calico container as CNI. Then we connected to the container and tried to check the bird config. We could see clearly the IPs that dont get ARP reply were configured there.

So in summary, basic TCP:

Very summarize, TCP is L4, then goes down to L3 IP. For getting to L2, you need to know the MAC of the IP, so that triggers the ARP request. Once the MAC is learned, it is cached for the next request. For that reason the first time you make a connection is slow (ping, traceroute, etc)

Now we need to workout why the calico/bird config is that way. Fix it to only use IPs of real BGP speakers and then verify the ARP storms stop.

Hopefully, I will learn a bit about calico.

Notes for UDP:

If I generate an UDP connection to a non-existing IP

$ nc -u 4.4.4.4 4000

netstat tells me the UDP connection is established and I can’t see anything in the ARP table for an external IP, for an internal IP (in my own network) I can see an incomplete entry. Why?

#  netstat -ntup | grep -i 4.4.4.4
udp        0      0 192.168.1.203:42653     4.4.4.4:4000            ESTABLISHED 102014/nc           
# 
#  netstat -ntup | grep -i '192.168.1.2:'
udp        0      0 192.168.1.203:44576     192.168.1.2:4000        ESTABLISHED 102369/nc           
# 
#
# arp -a
? (192.168.1.2) at <incomplete> on wlp2s0
something.mynet (192.168.1.1) at xx:xx:xx:yy:yy:zz [ether] on wlp2s0
# 

# tcpdump -i wlp2s0 host 4.4.4.4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wlp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:35:45.081819 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:45.081850 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:46.082075 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:47.082294 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:48.082504 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
^C
5 packets captured
5 packets received by filter
0 packets dropped by kernel
# 
  • UDP is stateless so we can’t have states…. so it is always going to be “established”. Basic TCP/UDP
  • When trying to open an UDP connection to an external IP, you need to “route” so my laptop knows it needs to send the UDP connection to the default gateway, so when getting to L2, the destination MAC address is not 4.4.4.4 is the default gateway MAC. BASIC ROUTING !!!! For that reason you dont see 4.4.4.4 in ARP table
    • When trying to open an UDP connection to a local IP, my laptop knows it is in the same network so it should be able to find the destination MAC address using ARP.

TCP Asymmetric

I got escalated an issue recently that had caused several outages and needed an urgent fix.

For different reasons, we had asymmetric routing in SITE-A. The normal flow is the green arrow. During the asymmetric routing, the flow is the red line. Routing wise, things should work. BUT, we have firewalls in the path. The firewalls were configured to allow asymmetric connections (I was told). As far as I could see in the config and logs, nothing was dropped in the firewalls during the issue.

So first thing, I fixed the asymmetric routing so it didnt happen again. I took me a while to come up with the solution (and it was quite simple) as I had to understand properly the routing before and during the issue. The diagram is quite simplified at the end of the day.

So during the maintenance window when I applied the fix for the asymmetric routing, I managed to take some traces in the firewalls, as I was trying to understand where the traffic was dropped/lost during the asymmetric scenario. As well, I was not very familiar with several parts of the network and the monitoring, I didnt know which links where already tapped or not. Once I was happy with the routing fix, I tried to take a look at the traces. At high level, I could see the return traffic leaving FW1 and leaving DC1-SW1. Based on that, I started to think that the firewalls were fine…..

In another maintenance, I tried to take more logs in different part of the network and I could see clearly the traffic reaching A-SW1. As I ran of time and missed to tap some links, I couldnt carry on.

So based on the second maintenance, the issue had to be inside SITE-A. Somehow it didnt make sense. I checked I didnt have uRPF enabled. The rest was pure L2 so it couldnt see the L3…

So in the third maintenance, I got all my debugging tools to verify that any network kit was dropping the traffic in SITE-A…. and it was useless. I realized that I could do a tcpdump in the client IP1 i was using for testing and I could see some return traffic!!!!

So, I was just socked. I didnt get it. It didnt make sense.

Somehow, I reviewed the tcp captures I was doing in each interface of both firewalls. I was trying to get to basics.

I was assuming the TCP handshake was completed properly. After paying a bit of attention to the client logs… I could see the TCP handshake completed. And I could see the HTTP GET getting to and leaving DC2-FW…. so why the server IP2 was not answering!!!!???

So back to the tcp handshake and firewall captures, I was comparing step by step. Somehow, I missed that the TCP ACK from client IP2 was reaching DC2-FW…. but it was not leaving DC2-FW!!!! even worse, the HTTP GET it was actually crossing the DC2-FW !!!

SLAP IN THE FACE!!!

This is the TCP handshake. This is networking 101…..

The TCP state-machine in client and server during the asymmetric scenario

So I was asumming that because the client was sending HTTP get, the tcp handshake was completed in both ends!!!!

It didnt make sense why I was seeing TCP SYN-ACK retransmissions from the server IP1…. BECAUSE the TCP ACK from client IP2 never reached.

For that reason server IP2 never answered the HTTP GET, because from its end the tcp hanshake was not completed.

I banged my head several times on the table. I “saw” this during the first maintenance window when I took the tcpdump in the firewalls BUT I didnt pay attention to the basic details.

I trusted too much to see a wireshark trace because it is more visual and shows more info but the clues were all the time in the tcpdump from the firewalls that I didnt bother to pay full attention.

At least, I found out where and why the connections failed during the asymmetric routing scenario. A firewall upgrade did the job.

So all fixed.

Lessons learned:

  • without proper foundation, you can’t build knowledge (tcp handshake state in client and server)
  • when things dont make sense, get back to basics (tcp handshake)
  • get the most of the tools at hand (tcpdump – PSH packets were the HTTP GET !!!!)

Smallest Audience – TCPLS – ByPass CDN WAF – Packet Generator

A bit of mix of things:

Smallest (viable) audience: Specificity is the way

TCPLS: I know about QUIC (just the big picture) but this TCP+TLS implementation looks interesting. Although I am not sure if their test is that meaningful. A more “real” life example would be ideal (packet loss, jitter, etc)

ByPass CDN: I am not well versed in Cloud services but this looks like a interesting article CDN and WAF from a security perspective. It is the typical example of thinking out of the box, why the attacker can’t be a “customer” of the CDN too???

Packet Generator – BNG Blaster: I knew about TReX but never had the chance to use it and I know how expensive are the commercial solutions (shocking!) so this looks like a nice tool.

BiDir + SR4

This week at work I have to deal with a couple of physical connections that were new for me. I am more familiar with LC connector and MTP is still something I am dealing with. So I learned about BiDir optics (LC) something never heard about it and the difference with SR4 optics (MTP). The vantage of using BiDir is that you can keep your current fiber install in place and upgrade from 10G to 40G without extra cost. But with SR4 you need to deploy MTP/MTO cabling although you can do breakouts (40G to4x10G or 100G to 4x25G). This is the blog entry I found quite useful for comparing both types. And this one just for BiDir.

As well, somewhere I read (can’t find the link), SR4 are more common than BiDir because is BiDir is proprietary so it has compatibility issues

This is a good link about cables and connectors.

SSD + Breakout Cables

I am not as up to speed with storage as I would like to but this blog gives you a quick intro for the different types of SSD.

As well, this week struggled a bit with a 100G port that had to connect as 4x25G to some servers. It was a bit of time since I dealt with something similar. Initially I though I had to wait for the servers to power up so once the link in the other side comes up and the switch recognizes the port as 25G and creates the necessary interfaces. But I had eth1/1 only. I was expecting eth1/1, eth1/2, eth1/3 and eth1/4… The three extra interfaces were created ONLY after I “forced” the speed setting to “25G” in eth1/1….. I hope I remember this for next time.

Containerlab

Some months ago I read about containerlab (and here). It looked like a very simple way to build labs quickly, easily and multi-vendor. I have used in the past gns3 and docker-topo for my labs but somehow I liked the documentation and the idea to try to mix cEOS with FRR images.

As I have felt more comfortable with the years with Arista and I had some images in my laptop, I installed the software (no mayor issues following the instructions for Debian) and try the example for a cEOS lab.

It didnt work. The containers started but I didnt get to the Arista CLI, just bash CLI and couldnt see anything runing on them… I remembered some Arista specific processes but none was there. In the following weeks, I tried newer cEOS but no luck always stuck in the same point. But at the end, never had enough time (or put the effort and interest) to troubleshoot the problem properly.

For too many months, I havent had the chance (I can write a post with excuses) to do much tech self-learning (I can write a book of all things I would like to learn), it was easier cooking or reading.

But finally, this week, talking with a colleague at work, he mentioned containerlab was great and he used it. I commented that I tried and failed. With that, I finally find a bit of interest and time today to give another go.

Firstly, I made sure I was running the latest containerlab version and my cEOS was recent enough (4.26.0F) and get to basics, check T-H-E logs!

So one thing I noticed after paying attention to the startup logs, I could see an warning about lack of memory in my laptop. So I closed several applications and tried again. My lab looked stuck in the same point:

go:1.16.3|py:3.7.3|tomas@athens:~/storage/technology/containerlabs/ceos$ sudo containerlab deploy --topo ceos-lab1.yaml 
INFO[0000] Parsing & checking topology file: ceos-lab1.yaml 
INFO[0000] Creating lab directory: /home/tomas/storage/technology/containerlabs/ceos/clab-ceos 
INFO[0000] Creating docker network: Name='clab', IPv4Subnet='172.20.20.0/24', IPv6Subnet='2001:172:20:20::/64', MTU='1500' 
INFO[0000] config file '/home/tomas/storage/technology/containerlabs/ceos/clab-ceos/ceos1/flash/startup-config' for node 'ceos1' already exists and will not be generated/reset 
INFO[0000] Creating container: ceos1                    
INFO[0000] config file '/home/tomas/storage/technology/containerlabs/ceos/clab-ceos/ceos2/flash/startup-config' for node 'ceos2' already exists and will not be generated/reset 
INFO[0000] Creating container: ceos2                    
INFO[0003] Creating virtual wire: ceos1:eth1 <--> ceos2:eth1 
INFO[0003] Running postdeploy actions for Arista cEOS 'ceos2' node 
INFO[0003] Running postdeploy actions for Arista cEOS 'ceos1' node 

I did a bit of searching about containerlab and ceos, for example, I could see this blog where the author started up successfully a lab with cEOS and I could see his logs!

So it was clear, my containers were stuck. So I searched for that message “Running postdeploy actions for Arista cEOS”.

I didnt see anything promising, just links back to the main container lab ceos page. I read it again and I noticed something in the bottom of the page regarding a known issue…. So I checked if that applied to me (although I doubted as it looked like it was for CentOS…) and indeed it applied to me too!

$ docker logs clab-ceos-ceos2
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted

So I started to find info about what is cgroup: link1, link2

First I wanted to check what cgroup version I was running. With this link, I could see that based on my kernel version, I should have cgroup2:

$ grep cgroup /proc/filesystems
nodev	cgroup
nodev	cgroup2

$ ls /sys/fs/cgroup/memory/
cgroup.clone_children           memory.kmem.tcp.limit_in_bytes      memory.stat
cgroup.event_control            memory.kmem.tcp.max_usage_in_bytes  memory.swappiness
cgroup.procs                    memory.kmem.tcp.usage_in_bytes      memory.usage_in_bytes
cgroup.sane_behavior            memory.kmem.usage_in_bytes          memory.use_hierarchy
dev-hugepages.mount             memory.limit_in_bytes               notify_on_release
dev-mqueue.mount                memory.max_usage_in_bytes           proc-fs-nfsd.mount
docker                          memory.memsw.failcnt                proc-sys-fs-binfmt_misc.mount
machine.slice                   memory.memsw.limit_in_bytes         release_agent
memory.failcnt                  memory.memsw.max_usage_in_bytes     sys-fs-fuse-connections.mount
memory.force_empty              memory.memsw.usage_in_bytes         sys-kernel-config.mount
memory.kmem.failcnt             memory.move_charge_at_immigrate     sys-kernel-debug.mount
memory.kmem.limit_in_bytes      memory.numa_stat                    sys-kernel-tracing.mount
memory.kmem.max_usage_in_bytes  memory.oom_control                  system.slice
memory.kmem.slabinfo            memory.pressure_level               tasks
memory.kmem.tcp.failcnt         memory.soft_limit_in_bytes          user.slice

As I had “cgroup.*” in my “/sys/fs/cgroup/memory” it was confirmed I was running cgroup2.

So how could I change to cgroup1 for docker only?

It seems that I couldnt change that only for an application because it is parameter that you pass to the kernel in boot time.

I learned that there is something called podman to replace docker in this blog.

So at the end, searching how to change cgroup in Debian, I used this link:

$ cat /etc/default/grub
...
# systemd.unified_cgroup_hierarchy=0 enables cgroupv1 that is needed for containerlabs to run ceos.... 
# https://github.com/srl-labs/containerlab/issues/467
# https://mbien.dev/blog/entry/java-in-rootless-containers-with
GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0"
....
$ sudo grub-mkconfig -o /boot/grub/grub.cfg
....
$ sudo reboot.

Good thing that the laptop rebooted fine! That was a relief 🙂

Then I checked if the change made any difference. It failed but because it containerlab couldnt connect to docker… somehow docker had died. I restarted again docker and tried container lab…

$ sudo containerlab deploy --topo ceos-lab1.yaml 
INFO[0000] Parsing & checking topology file: ceos-lab1.yaml 
INFO[0000] Creating lab directory: /home/xxx/storage/technology/containerlabs/ceos/clab-ceos 
INFO[0000] Creating docker network: Name='clab', IPv4Subnet='172.20.20.0/24', IPv6Subnet='2001:172:20:20::/64', MTU='1500' 
INFO[0000] config file '/home/xxx/storage/technology/containerlabs/ceos/clab-ceos/ceos1/flash/startup-config' for node 'ceos1' already exists and will not be generated/reset 
INFO[0000] Creating container: ceos1                    
INFO[0000] config file '/home/xxx/storage/technology/containerlabs/ceos/clab-ceos/ceos2/flash/startup-config' for node 'ceos2' already exists and will not be generated/reset 
INFO[0000] Creating container: ceos2                    
INFO[0003] Creating virtual wire: ceos1:eth1 <--> ceos2:eth1 
INFO[0003] Running postdeploy actions for Arista cEOS 'ceos2' node 
INFO[0003] Running postdeploy actions for Arista cEOS 'ceos1' node 
INFO[0145] Adding containerlab host entries to /etc/hosts file 
+---+-----------------+--------------+------------------+------+-------+---------+----------------+----------------------+
| # |      Name       | Container ID |      Image       | Kind | Group |  State  |  IPv4 Address  |     IPv6 Address     |
+---+-----------------+--------------+------------------+------+-------+---------+----------------+----------------------+
| 1 | clab-ceos-ceos1 | 2807cd2f689f | ceos-lab:4.26.0F | ceos |       | running | 172.20.20.2/24 | 2001:172:20:20::2/64 |
| 2 | clab-ceos-ceos2 | e5d2aa4578b5 | ceos-lab:4.26.0F | ceos |       | running | 172.20.20.3/24 | 2001:172:20:20::3/64 |
+---+-----------------+--------------+------------------+------+-------+---------+----------------+----------------------+
$ sudo clab graph -t ceos-lab1.yaml 
INFO[0000] Parsing & checking topology file: ceos-lab1.yaml 
INFO[0000] Listening on :50080...       

After a bit, it seems it worked! And learned about an option to show a graph of your topology with “graph”

I checked the ceos container logs

$ docker logs clab-ceos-ceos1
....
[  OK  ] Started SYSV: Eos system init scrip...uns after POST, before ProcMgr).
         Starting Power-On Self Test...
         Starting EOS Warmup Service...
[  OK  ] Started Power-On Self Test.
[  OK  ] Reached target EOS regular mode.
[  OK  ] Started EOS Diagnostic Mode.
[     *] A start job is running for EOS Warmup Service (2min 9s / no limit)Reloading.
$ 
$ docker exec -it clab-ceos-ceos1 Cli
ceos1>
ceos1>enable 
ceos1#show version 
 cEOSLab
Hardware version: 
Serial number: 
Hardware MAC address: 001c.7389.2099
System MAC address: 001c.7389.2099

Software image version: 4.26.0F-21792469.4260F (engineering build)
Architecture: i686
Internal build version: 4.26.0F-21792469.4260F
Internal build ID: c5b41f65-54cd-44b1-b576-b5c48700ee19

cEOS tools version: 1.1
Kernel version: 5.10.0-8-amd64

Uptime: 0 minutes
Total memory: 8049260 kB
Free memory: 2469328 kB

ceos1#
ceos1#show interfaces description 
Interface                      Status         Protocol           Description
Et1                            up             up                 
Ma0                            up             up                 
ceos1#show running-config interfaces ethernet 1
interface Ethernet1
ceos1#

Yes! Finally working!

So now, I dont have excuses to keep learning new things!

BTW, these are the different versions I am using at the moment:

$ uname -a
Linux athens 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux 

$ docker -v
Docker version 20.10.5+dfsg1, build 55c4c88

$ containerlab version

                           _                   _       _     
                 _        (_)                 | |     | |    
 ____ ___  ____ | |_  ____ _ ____   ____  ____| | ____| | _  
/ ___) _ \|  _ \|  _)/ _  | |  _ \ / _  )/ ___) |/ _  | || \ 
( (__| |_|| | | | |_( ( | | | | | ( (/ /| |   | ( ( | | |_) )
\____)___/|_| |_|\___)_||_|_|_| |_|\____)_|   |_|\_||_|____/ 

    version: 0.17.0
     commit: eba1b82
       date: 2021-08-25T09:31:53Z
     source: https://github.com/srl-labs/containerlab
 rel. notes: https://containerlab.srlinux.dev/rn/0.17/

My concern is, how this cgroup1 will affect other applications like kubernetes?

BTW, I have the same issue with containerlab as with docker-topo, when I use “Alt+Home(left arrow)” my laptop leave X-Windows and gets to the tty!!!

BGP AIGP and BGP churn

I have read BGP churn before, I find the meaning and then forget about it until next time.

BGP churn = the rate of routing updates that BGP routers must process.

And I can’t really find a straight definition.

As well, I didnt know the actual meaning of churn was. So it s a machine to produce butter and the amount of customer stopping using a product.


I read at work something about AIGP from a different department. I searched and found that is a new BGP optional non-transitive path attribute and the BGP decision process is updated for that. And this is from 2015!!! So I am 6 years behind…. And there is a RFC7311

Note: BGP routers that do not support the optional non-transitive attributes (e.g. AIGP) must delete such attributes and must not pass them to other BGP peers.

So it seems a good feature if you run a big AS and want BGP to take into account the IGP cost.

Linux+MPLS-Part4

Finally I am trying to setup MPLS L3VPN.

Again, I am following the author post but adapting it to my environment using libvirt instead of VirtualBox and Debian10 as VM. All my data is here.

This is the diagram for the lab:

Difference from lab3 and lab2. We have P1, that is a pure P router, only handling labels, it doesnt do any BGP.

This time all devices FRR config are generated automatically via gen_frr_config.py (in lab2 all config was manual).

Again the environment is configured via Vagrant file + l3vpn_provisioning script. This is mix of lab2 (install FRR), lab3 (define VRFs) and lab1 (configure MPLS at linux level).

So after some tuning, everything is installed, routing looks correct (although I dont know why but I have to reload FRR to get the proper generated BGP config in PE1 and PE2. P1 is fine).

So let’s see PE1:

IGP (IS-IS) is up:

PE1# show isis neighbor 
 Area ISIS:
   System Id           Interface   L  State        Holdtime SNPA
   P1                  ens8        2  Up            30       2020.2020.2020
 PE1# 
 PE1# exit
 root@PE1:/home/vagrant# 

BGP is up to PE2 and we can see routes received in AF IPv4VPN:

PE1# 
 PE1# show bgp summary 
 IPv4 Unicast Summary:
 BGP router identifier 172.20.5.1, local AS number 65010 vrf-id 0
 BGP table version 0
 RIB entries 0, using 0 bytes of memory
 Peers 1, using 21 KiB of memory
 Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
 172.20.5.2      4      65010       111       105        0    0    0 01:39:14            0        0
 Total number of neighbors 1
 IPv4 VPN Summary:
 BGP router identifier 172.20.5.1, local AS number 65010 vrf-id 0
 BGP table version 0
 RIB entries 11, using 2112 bytes of memory
 Peers 1, using 21 KiB of memory
 Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
 172.20.5.2      4      65010       111       105        0    0    0 01:39:14            2        2
 Total number of neighbors 1
 PE1# 

Check routing tables, we can see prefixes in both VRFs, so that’s good. And the labels needed.

PE1# show ip route vrf all 
 Codes: K - kernel route, C - connected, S - static, R - RIP,
        O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
        T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
        F - PBR, f - OpenFabric,
        > - selected route, * - FIB route, q - queued, r - rejected, b - backup
 VRF default:
 C>* 172.20.5.1/32 is directly connected, lo, 02:19:16
 I>* 172.20.5.2/32 [115/30] via 192.168.66.102, ens8, label 17, weight 1, 02:16:10
 I>* 172.20.5.5/32 [115/20] via 192.168.66.102, ens8, label implicit-null, weight 1, 02:18:34
 I   192.168.66.0/24 [115/20] via 192.168.66.102, ens8 inactive, weight 1, 02:18:34
 C>* 192.168.66.0/24 is directly connected, ens8, 02:19:16
 I>* 192.168.77.0/24 [115/20] via 192.168.66.102, ens8, label implicit-null, weight 1, 02:18:34
 C>* 192.168.121.0/24 is directly connected, ens5, 02:19:16
 K>* 192.168.121.1/32 [0/1024] is directly connected, ens5, 02:19:16
 VRF vrf_cust1:
 C>* 192.168.11.0/24 is directly connected, ens6, 02:19:05
 B>  192.168.23.0/24 [200/0] via 172.20.5.2 (vrf default) (recursive), label 80, weight 1, 02:13:32
 via 192.168.66.102, ens8 (vrf default), label 17/80, weight 1, 02:13:32 
 VRF vrf_cust2:
 C>* 192.168.12.0/24 is directly connected, ens7, 02:19:05
 B>  192.168.24.0/24 [200/0] via 172.20.5.2 (vrf default) (recursive), label 81, weight 1, 02:13:32
 via 192.168.66.102, ens8 (vrf default), label 17/81, weight 1, 02:13:32
 PE1#  

Now check LDP and MPLS labels. Everything looks sane. We have LDP labels for P1 (17) and PE2 (18). And labels for each VFR.

PE1# show mpls table 
  Inbound Label  Type  Nexthop         Outbound Label  
 
 16             LDP   192.168.66.102  implicit-null   
  17             LDP   192.168.66.102  implicit-null   
  18             LDP   192.168.66.102  17              
  80             BGP   vrf_cust1       -               
  81             BGP   vrf_cust2       -               
 PE1# 
 PE1# show mpls ldp neighbor 
 AF   ID              State       Remote Address    Uptime
 ipv4 172.20.5.5      OPERATIONAL 172.20.5.5      02:20:20
 PE1# 
 PE1# 
 PE1# show mpls ldp binding  
 AF   Destination          Nexthop         Local Label Remote Label  In Use
 ipv4 172.20.5.1/32        172.20.5.5      imp-null    16                no
 ipv4 172.20.5.2/32        172.20.5.5      18          17               yes
 ipv4 172.20.5.5/32        172.20.5.5      16          imp-null         yes
 ipv4 192.168.11.0/24      0.0.0.0         imp-null    -                 no
 ipv4 192.168.12.0/24      0.0.0.0         imp-null    -                 no
 ipv4 192.168.66.0/24      172.20.5.5      imp-null    imp-null          no
 ipv4 192.168.77.0/24      172.20.5.5      17          imp-null         yes
 ipv4 192.168.121.0/24     172.20.5.5      imp-null    imp-null          no
 PE1# 

Similar view happens in PE2.

From P1 that is our P router. We only care about LDP and ISIS

P1# 
 P1# show mpls table 
  Inbound Label  Type  Nexthop         Outbound Label  
 
 16             LDP   192.168.66.101  implicit-null   
  17             LDP   192.168.77.101  implicit-null   
 P1# show mpls ldp neighbor 
 AF   ID              State       Remote Address    Uptime
 ipv4 172.20.5.1      OPERATIONAL 172.20.5.1      02:23:55
 ipv4 172.20.5.2      OPERATIONAL 172.20.5.2      02:21:01
 P1# 
 P1# show isis neighbor 
 Area ISIS:
   System Id           Interface   L  State        Holdtime SNPA
   PE1                 ens6        2  Up            28       2020.2020.2020
   PE2                 ens7        2  Up            29       2020.2020.2020
 P1# 
 P1# show ip route
 Codes: K - kernel route, C - connected, S - static, R - RIP,
        O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
        T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
        F - PBR, f - OpenFabric,
        > - selected route, * - FIB route, q - queued, r - rejected, b - backup
 K>* 0.0.0.0/0 [0/1024] via 192.168.121.1, ens5, src 192.168.121.253, 02:24:45
 I>* 172.20.5.1/32 [115/20] via 192.168.66.101, ens6, label implicit-null, weight 1, 02:24:04
 I>* 172.20.5.2/32 [115/20] via 192.168.77.101, ens7, label implicit-null, weight 1, 02:21:39
 C>* 172.20.5.5/32 is directly connected, lo, 02:24:45
 I   192.168.66.0/24 [115/20] via 192.168.66.101, ens6 inactive, weight 1, 02:24:04
 C>* 192.168.66.0/24 is directly connected, ens6, 02:24:45
 I   192.168.77.0/24 [115/20] via 192.168.77.101, ens7 inactive, weight 1, 02:21:39
 C>* 192.168.77.0/24 is directly connected, ens7, 02:24:45
 C>* 192.168.121.0/24 is directly connected, ens5, 02:24:45
 K>* 192.168.121.1/32 [0/1024] is directly connected, ens5, 02:24:45
 P1# 

So as usual, let’s try to test connectivity. Will ping from CE1 (connected to PE1) to CE3 (connected to PE2) that belong to the same VRF vrf_cust1.

First of all, I had to modify iptables in my host to avoid unnecessary NAT (iptables masquerade) between CE1 and CE3.

# iptables -t nat -vnL LIBVIRT_PRT --line-numbers
 Chain LIBVIRT_PRT (1 references)
 num   pkts bytes target     prot opt in     out     source               destination         
 1       15  1451 RETURN     all  --  *      *       192.168.77.0/24      224.0.0.0/24        
 2        0     0 RETURN     all  --  *      *       192.168.77.0/24      255.255.255.255     
 3        0     0 MASQUERADE  tcp  --  *      *       192.168.77.0/24     !192.168.77.0/24      masq ports: 1024-65535
 4       18  3476 MASQUERADE  udp  --  *      *       192.168.77.0/24     !192.168.77.0/24      masq ports: 1024-65535
 5        0     0 MASQUERADE  all  --  *      *       192.168.77.0/24     !192.168.77.0/24     
 6       13  1754 RETURN     all  --  *      *       192.168.122.0/24     224.0.0.0/24        
 7        0     0 RETURN     all  --  *      *       192.168.122.0/24     255.255.255.255     
 8        0     0 MASQUERADE  tcp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
 9        0     0 MASQUERADE  udp  --  *      *       192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
 10       0     0 MASQUERADE  all  --  *      *       192.168.122.0/24    !192.168.122.0/24    
 11      24  2301 RETURN     all  --  *      *       192.168.11.0/24      224.0.0.0/24        
 12       0     0 RETURN     all  --  *      *       192.168.11.0/24      255.255.255.255     
 13       0     0 MASQUERADE  tcp  --  *      *       192.168.11.0/24     !192.168.11.0/24      masq ports: 1024-65535
 14      23  4476 MASQUERADE  udp  --  *      *       192.168.11.0/24     !192.168.11.0/24      masq ports: 1024-65535
 15       1    84 MASQUERADE  all  --  *      *       192.168.11.0/24     !192.168.11.0/24     
 16      29  2541 RETURN     all  --  *      *       192.168.121.0/24     224.0.0.0/24        
 17       0     0 RETURN     all  --  *      *       192.168.121.0/24     255.255.255.255     
 18      36  2160 MASQUERADE  tcp  --  *      *       192.168.121.0/24    !192.168.121.0/24     masq ports: 1024-65535
 19      65  7792 MASQUERADE  udp  --  *      *       192.168.121.0/24    !192.168.121.0/24     masq ports: 1024-65535
 20       0     0 MASQUERADE  all  --  *      *       192.168.121.0/24    !192.168.121.0/24    
 21      20  2119 RETURN     all  --  *      *       192.168.24.0/24      224.0.0.0/24        
 22       0     0 RETURN     all  --  *      *       192.168.24.0/24      255.255.255.255     
 23       0     0 MASQUERADE  tcp  --  *      *       192.168.24.0/24     !192.168.24.0/24      masq ports: 1024-65535
 24      21  4076 MASQUERADE  udp  --  *      *       192.168.24.0/24     !192.168.24.0/24      masq ports: 1024-65535
 25       0     0 MASQUERADE  all  --  *      *       192.168.24.0/24     !192.168.24.0/24     
 26      20  2119 RETURN     all  --  *      *       192.168.23.0/24      224.0.0.0/24        
 27       0     0 RETURN     all  --  *      *       192.168.23.0/24      255.255.255.255     
 28       1    60 MASQUERADE  tcp  --  *      *       192.168.23.0/24     !192.168.23.0/24      masq ports: 1024-65535
 29      20  3876 MASQUERADE  udp  --  *      *       192.168.23.0/24     !192.168.23.0/24      masq ports: 1024-65535
 30       1    84 MASQUERADE  all  --  *      *       192.168.23.0/24     !192.168.23.0/24     
 31      25  2389 RETURN     all  --  *      *       192.168.66.0/24      224.0.0.0/24        
 32       0     0 RETURN     all  --  *      *       192.168.66.0/24      255.255.255.255     
 33       0     0 MASQUERADE  tcp  --  *      *       192.168.66.0/24     !192.168.66.0/24      masq ports: 1024-65535
 34      23  4476 MASQUERADE  udp  --  *      *       192.168.66.0/24     !192.168.66.0/24      masq ports: 1024-65535
 35       0     0 MASQUERADE  all  --  *      *       192.168.66.0/24     !192.168.66.0/24     
 36      24  2298 RETURN     all  --  *      *       192.168.12.0/24      224.0.0.0/24        
 37       0     0 RETURN     all  --  *      *       192.168.12.0/24      255.255.255.255     
 38       0     0 MASQUERADE  tcp  --  *      *       192.168.12.0/24     !192.168.12.0/24      masq ports: 1024-65535
 39      23  4476 MASQUERADE  udp  --  *      *       192.168.12.0/24     !192.168.12.0/24      masq ports: 1024-65535
 40       0     0 MASQUERADE  all  --  *      *       192.168.12.0/24     !192.168.12.0/24     
#


# iptables -t nat -I LIBVIRT_PRT 13 -s 192.168.11.0/24 -d 192.168.23.0/24 -j RETURN
# iptables -t nat -I LIBVIRT_PRT 29 -s 192.168.23.0/24 -d 192.168.11.0/24 -j RETURN

Ok, staring pinging from CE1 to CE3:

vagrant@CE1:~$ ping 192.168.23.102
 PING 192.168.23.102 (192.168.23.102) 56(84) bytes of data.

No good. Let’s check what the next hop, PE1, is doing. It seem it is sending the traffic double encapsulated to P1 as expected

root@PE1:/home/vagrant# tcpdump -i ens8
...
20:29:16.648325 MPLS (label 17, exp 0, ttl 63) (label 80, exp 0, [S], ttl 63) IP 192.168.11.102 > 192.168.23.102: ICMP echo request, id 2298, seq 2627, length 64
20:29:17.672287 MPLS (label 17, exp 0, ttl 63) (label 80, exp 0, [S], ttl 63) IP 192.168.11.102 > 192.168.23.102: ICMP echo request, id 2298, seq 2628, length 64
...

Let’s check next hop, P1. I can see it is sending the traffic to PE2 doing PHP, so removing the top label (LDP) and only leaving the BGP label:

root@PE2:/home/vagrant# tcpdump -i ens8
...
20:29:16.648176 MPLS (label 80, exp 0, [S], ttl 63) IP 192.168.11.102 > 192.168.23.102: ICMP echo request, id 2298, seq 2627, length 64
20:29:17.671968 MPLS (label 80, exp 0, [S], ttl 63) IP 192.168.11.102 > 192.168.23.102: ICMP echo request, id 2298, seq 2628, length 64
...

But then PE2 is not sending anything to CE3. I can’t see anything in the links:

root@CE3:/home/vagrant# tcpdump -i ens6
 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
 listening on ens6, link-type EN10MB (Ethernet), capture size 262144 bytes
 20:32:03.174796 STP 802.1d, Config, Flags [none], bridge-id 8000.52:54:00:e2:cb:54.8001, length 35
 20:32:05.158761 STP 802.1d, Config, Flags [none], bridge-id 8000.52:54:00:e2:cb:54.8001, length 35
 20:32:07.174742 STP 802.1d, Config, Flags [none], bridge-id 8000.52:54:00:e2:cb:54.8001, length 35

I have double-checked the configs. All routing and config looks sane in PE2:

vagrant@PE2:~$ ip route
 default via 192.168.121.1 dev ens5 proto dhcp src 192.168.121.31 metric 1024 
 172.20.5.1  encap mpls  16 via 192.168.77.102 dev ens8 proto isis metric 20 
 172.20.5.5 via 192.168.77.102 dev ens8 proto isis metric 20 
 192.168.66.0/24 via 192.168.77.102 dev ens8 proto isis metric 20 
 192.168.77.0/24 dev ens8 proto kernel scope link src 192.168.77.101 
 192.168.121.0/24 dev ens5 proto kernel scope link src 192.168.121.31 
 192.168.121.1 dev ens5 proto dhcp scope link src 192.168.121.31 metric 1024 
 vagrant@PE2:~$ 
 vagrant@PE2:~$ ip -4 a
 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet 172.20.5.2/32 scope global lo
        valid_lft forever preferred_lft forever
 2: ens5:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
     inet 192.168.121.31/24 brd 192.168.121.255 scope global dynamic ens5
        valid_lft 2524sec preferred_lft 2524sec
 3: ens6:  mtu 1500 qdisc pfifo_fast master vrf_cust1 state UP group default qlen 1000
     inet 192.168.23.101/24 brd 192.168.23.255 scope global ens6
        valid_lft forever preferred_lft forever
 4: ens7:  mtu 1500 qdisc pfifo_fast master vrf_cust2 state UP group default qlen 1000
     inet 192.168.24.101/24 brd 192.168.24.255 scope global ens7
        valid_lft forever preferred_lft forever
 5: ens8:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
     inet 192.168.77.101/24 brd 192.168.77.255 scope global ens8
        valid_lft forever preferred_lft forever
 vagrant@PE2:~$ 
 vagrant@PE2:~$ 
 vagrant@PE2:~$ 
 vagrant@PE2:~$ 
 vagrant@PE2:~$ ip -M route
 16 as to 16 via inet 192.168.77.102 dev ens8 proto ldp 
 17 via inet 192.168.77.102 dev ens8 proto ldp 
 18 via inet 192.168.77.102 dev ens8 proto ldp 
 vagrant@PE2:~$ 
 vagrant@PE2:~$ ip route show table 10
 blackhole default 
 192.168.11.0/24  encap mpls  16/80 via 192.168.77.102 dev ens8 proto bgp metric 20 
 broadcast 192.168.23.0 dev ens6 proto kernel scope link src 192.168.23.101 
 192.168.23.0/24 dev ens6 proto kernel scope link src 192.168.23.101 
 local 192.168.23.101 dev ens6 proto kernel scope host src 192.168.23.101 
 broadcast 192.168.23.255 dev ens6 proto kernel scope link src 192.168.23.101 
 vagrant@PE2:~$ 
 vagrant@PE2:~$                       
 vagrant@PE2:~$ ip vrf      
 Name              Table
 vrf_cust1           10
 vrf_cust2           20
 vagrant@PE2:~$ 

root@PE2:/home/vagrant# sysctl -a | grep mpls
 net.mpls.conf.ens5.input = 0
 net.mpls.conf.ens6.input = 0
 net.mpls.conf.ens7.input = 0
 net.mpls.conf.ens8.input = 1
 net.mpls.conf.lo.input = 0
 net.mpls.conf.vrf_cust1.input = 0
 net.mpls.conf.vrf_cust2.input = 0
 net.mpls.default_ttl = 255
 net.mpls.ip_ttl_propagate = 1
 net.mpls.platform_labels = 100000
root@PE2:/home/vagrant# 
root@PE2:/home/vagrant# lsmod | grep mpls
 mpls_iptunnel          16384  3
 mpls_router            36864  1 mpls_iptunnel
 ip_tunnel              24576  1 mpls_router
root@PE2:/home/vagrant# 

So I am a bit puzzled the last couple of weeks about this issue. I was thinking that iptables was fooling me again and was dropping the traffic somehow but as far as I can see. PE2 is not sending anything and I dont really know how to troubleshoot FRR in this case. I have asked for help in the FRR list. Let’s see how it goes. I think I am doing something wrong because I am not doing anything new.