NVIDIA GTC March 2023

I watched this week some interesting videos from NVIDA GTC related to networking. And it is a pain that you need to use a “work” email to register….

  • S51839 – Designing the Next Generation of AI Systems:

— A quick summary, it seems any HPC networks needs to use InfiniBand… NVIDA has solution for all sizes. They can provide a POD solution!!! All Cloud providers provide their services.

  • S51112 – How to Design an AI Supercomputer for Fast Distributed Training, and its Use Cases:

— Very interesting talk from NEC Japan. They built a network based on Ethernet switches for HPC with GPUs (and not IB as seen in the other video). As well they are heavy in RDMA/ROCEv2. And seems they have dedicated ports in the network for storage, management, etc. They are very happy with Cumulus/Linux as NOS.

  • S51339 – Hit the Ground Running with Data Center Digital Twin Automation:

— Interesting tool NVIDIA air for creating labs. I expected in the demonstration to show off and built a huge network. “Digital Twin” looks like the new buzzword in the network automation world.

  • S51751 – Powering Telco Cloud Services with Open Accelerated Ethernet:

— This is from COMCAST. And it is very interesting how “big” looks like SONIC is becoming. And NVIDIA is the second contributor to SONIC after M$! I need to try SONIC at some point.

  • S51204 – Transforming Clouds to Cloud-Native Supercomputing Best Practices with Microsoft Azure:

— Obviously, building NVIDIA based supercomputers in M$ Azure. Again, all infiniband.

And another thing, the Spectrum-4 switch looks insane.

AWS Networking Videos – March 2023

I watched very interesting videos about AWS networking. They are high level, so they dont tell you the magic sauce you would like to know but it is nice that this info is out in the public.

  • DKNOG – How AWS is evolving its peering-edge in 2023 and onwards link + event:

— Evolution from buying chassis to building your own devices: consume -> create (NOC-less, auto-remediation, active telemetry, etc)-> innovate (freedom to examine trade-offs, 1U devices). Clearly use of “Clos” networks and they linux-based software.

— Delighted: low complexity + high innovation

— Simplicity Scales

— It is interesting the view of a router/brick like a set of 1U devices (rack 102.8T – 200x400G ports for customers, non-blocking). An it is very good they have pictures of the concept of “bricks” and “spines”.

— Challenges with cabling (SN connector — no patching rack needed) and 400G ZR+ (heating!)

— BGP peering is actually with a container:

— James Hamilton paper – link + pdf

  • AWS re:Invent 2022 – Dive deep on AWS networking infrastructure (NET402)– link

— summary: This is “similar” to the DKNOG but with longer and some other details like:

— “We dont like chassis”. 1+million devices

— SDR at NIC level so one TCP flow is actually load-balanced in several paths

— Hybrid SDN approach: You have controllers to give you a big picture view (I guess it provides the visibility to say “just send 70% traffic to this device” – but not sure how) and the own device device capability to deal with changes.

— Telemetry, continuous monitoring, triangulation: Be able to detect the port/device is causing the problem.

  • AWS re:Invent 2022 – Leaping ahead: The power of cloud network innovation (NET211-L) – link:

— AWS Global Infrastructure: Backbone capacity

— Customer SW/HW

— Everything fails all the time

— GPS locations in fibers! + inject light in fiber to double check fault -> intelligent optical routing/failover -> better than BGP….

— Termite sheet fibers for Australia 🙂

— Nitro card = NIC (offload card)

— SDR: not need in-order packet deliver as required by TCP. 25Gbps flows allowed now.

Building DCs with VXLAN BGP EVPN

VXLAN/EVPN is a technology that I am trying to understand in more detail and depth since I started my current job. All my networking theory/knowledge comes from books so this one is a good base. Keep in mind that is a bit “old” as it was released on 2017. In the last months I have built my confidence with VXLAN/EVPN via some issues and testing designs (Arista EVPN L3 Gateway).

As I used to do in the past, I made notes of the book and I will put them here too so it is a good refresh.

1 INTRO

STP for DC issues:

  • Convergence: tree recalculation
  • Unused links: follow tree…
  • Suboptimal forwarding: follow tree…
  • No ECMP
  • Traffic storm (no TTL in L2)
  • Scale: only 4k vlans (12 bits tag)

Leaf/Spine improvements as per above (Clos Network):

  • Scalability
  • Smple
  • REsilience
  • Efficience
  • No oversubscription
  • ECMP
  • Deterministic latency
  • scale out -> + leaves // scale up (+bw) -> + spines

BUM = Broadcast, Unknown Unicast, Multicast

Fabric Path = MAC in MAC (technology earlier to vxlan). Proprietary to Cisco

VXLAN = Standard, MAC in IP/UDP, VNI = 24 bits -> 16M valns! Flood & Learn (F&L): each network has its L2VNI + multicast group (control-plane). BGP EVPN doesnt need F&L, so better control-plane

Border Leaf or Border Spine = for external connectivity.

Route-Reflector (RR) or RP (multicast) in Spine

VXLAN – dataplane / EVPN – control-plane

2 BASICS

In DC, most traffic is east-west.

Limit vlan 12 bits (4k) + multi-tenancy? -> overlay: (indirection) abstraction of existing network tech + extend classic network capabilities. (David Wheeler: problem -> indirection)

Underlay -> increase MTU! (overlay overhead)

Handle BUM in underlay? Multicast (PIM – VNI mapped to multicas group = dst IP outer header) or Ingress Replication (head-end replication)

VTEP = Edge device, encap/decap, build overlay

VNI – Virtual Network ID

VXLAN header: original inner 802.1q header of l2 frame is removed and mapped to a VNI, to complete vxlan header. UDP dst port = 4789, src port = based on inner header

overhead = 50 bytes (14 (l2) +20 (l3) + 8 (l4) + 8 (vxlan) = 50). 54 bytes if optional 802.1q tag (4bytes) is added.

ECMP: 5-tuple: src IP, dst IP, proto, src port, dst port -> but only src port changes in vxlan -> that;s the entropy!

F&L: Doesnt scale, no control-plane. Multicast replication has a limit. So Ingress replication (IR) . Every VTEP must be aware of other VTEPS in same VNI. Source VTEP has to replicate each packet to each VTEP.

BGP EVPN: solution for F&L. Eliminates unnecessary flooding. EVPN carries host MAC, IP, network, VRF and VTEP info. If a VTEP that detects a host and doesnt send an EVPN update -> remote VTEP doesn’t age out entry for that host.

IMPORTANT! Broadcast traffic (ARP, DHCP, etc) is still flooding!!!

Tenat += VRF

eBGP = implicit next-hop-self for originated

RD = 8 bytes

— type 0: 2 byte ASN + 4 byte value

— type 1: 4 byte IP + 2 byte value

— type 2: 4 byte ASN + 2 byte value

RT: control import/export prefixes in VRF (auto derivation ASN:VNI)

VXLAN EVPN: RFC7432. Focus NVO (Network Virtualization Overlay) -> Route-Types:

— type 2: MAC/IP (host: /32 or /128). Sent once host is learnt. Info about IP is optional. Ext Community: RMAC = Router MAC = source VTEP.

— type 3: Inclusive multicast ethernet tag route -> create distribution list for IR. Generated and sent out immediately as a VNI is configured. Need ASIC support !!!

— type 5: IP prefix route (L3VNI)

“show bgp l2vpn evpn MAC” ->

[Route Type]:[Eth Segment ID]:[Eth Tag Id]:[MAC lenght]:[MAC]:[IP prefix]:[bit count]

bit count:

— 216: type2 only MAC

— 272: type2 IP/MAC

— 224: type5 IPv4

ExtCommunity: ENCAP:8 -> it is VXLAN

ARP request triggers an IP-MAC.

MAC learnt via BGP is not aged-out via normal process: only BGP delete message deletes the MAC

L3 learnt: depends on hw (FIB)

— HRT (Host Route Table): only for /32 or /128 (big)

— LPM (Longest Prefix Match): TCAM (small)

FIB: [Bridge Domain, RMAC] -> BD maps to L3VPN and RMAC maps to dst VTEP MAC

Type5:

— advertises first-hop-routing: prefix where VTEP is default gateway (IP anycast gateway)

— advertise prefixes from other protocols

Host detection:

ARP aging = 1500 sec -> If ARP request fails -> type2 deletes are sent. ** Even when ARP entry is deleted, MAC only type2 is still in BGP EVPN CP until MAC aging expires (1800 sec) (sent BGP withdraw)

ARP aging < MAC aging -> avoid unnecessary flooding

Host mobility: VM to send GARP (gratuitous ARP): Highest MAC mobility seq ext community => Best

3 FORWARDING

  • Handling BUM or multidestination traffic:

— MC replication in the underlay:

Use MC in UL => 1xL2VNI = 1xMC IP => problem: 2^24 VNI available -> is a stretch for MC IPs available, sw/hw limits (1000’s PIM, IGMP, etc) -> doesnt scale

How to manage VNI-MC mapping? VNI randomly assigned to MC or MC is localized for a set of VNIs.

— Ingress Replication (IR = HER = Head-End-Replication): Unicast mode. VTEP makes n-1 copies of BUM packet and send them as unicast to the n-1 VTEPs of that VNI

replication list? dynamic with BGP EVPN. type 3 (IMET). Replication list is updated when config of a L2VNI in a VTEP occurs –> Big overhead compared with MC.

  • ARP Suppression:

— Use ARP snooping. ARP request -> populates BGP EVPN CP. 1) If VTEP knows dst MAC, then responds (this is ARP suppresion). If not, using IR or MC, sned ARP to all vTEP. Egress VTEP that has the host connected, receives ARP reply, makes a EVPN Typ2 announce to all VTEPs + send ARP reply (as unicast) to avoid any delay.

NX-OS uses MC for BUM by default = flood L2 locally and to all VTEPs in VNI.

MC group for overlay != MC group for underlay.

IGMP snoopnig (if supported), optional solution, it doesnt depend on hw, just sw.

  • Distributed IP Anycast Gateway: Implemented at each VTEP, reduces traffic transit. Anycast = ne to the nearest association.

Anycast GW VTEPs share the same MAC -> prevent black-holing for host-mobility (AGM = Anycast GW MAC address). Same AGM is used in all default gw IPs -> no hair-pining.

  • Integrated Routing and Bridging (IRB)

— Asymmetric:

bridge-route-bridge at local VTEP

traffic eggresing towards a remote VTEP uses a different VNI than the return traffic from the remote VTEP

requires consistent VNI config in all VTEPS

— Symmetric (NXOS):

bridge-route-route-bridge

egress and return use same L3VNI. L2VNI are not used for routing in symmetric IRB

Not all VNIs need to be configured in all VTEPS but for a VRF, L3VNI needs to be configured in all VTEPs.

Inter VRF routing -> route leaking -> external router or firewall.

  • End Point Mobility:

BGP extended community = MAC mobility seq. Higher wins. With each move, seq++

End point move triggered by (update via BGP EVPN CP)

— Reverse ARP: only advertises new MAC

— Gratuitous ARP: adverts new MAC/IP

VTEP verifies if endpoint has actually moved.

  • VPC: MCLAG + LACP: Cisco -> vPC: 2 devices: 1 peer link + 1 keepalive link.

— PIP: primary IP. individual per VPC member per VTEP

— VIP: secondary IP in nve interface. Virtual IP = anycast VTEP at VPC level. It is the next-hop used in EVPN typ2/5. ** anycast VTEP != anycast gw

–orphans: blackhole if using VIP -> solution: “advertise-pip” VPC members use PIP instead of VIP for NH in originated EVPN type5 (type2 still uses VIP)

— Router MAC ext community in typ2/5:

—- PIP uses switch RMAC

—- VIP uses local derived MAC based on VIP. Both VPC members derive the same MAC because the share the same VIP. As RMAC ext community is non-transitive and VIP are unique, no issue

  • DHCP: discovery, offer, request, offer. DHCP relay: configured in default gw: relay agent uses default gw IP in the GiAdr field of DHCP payload. DHCP servers uses GiAdd field to find correct scope. As well, uses GiAddr as dst IP for the answer. Problem with anycast gw because all VTEP uses the same IP -> sol: each VTEP dhcp relay uses unique IP (lox) and must be routable. how to choose scope? DHCP option 92.

4 UNDERLAY

  • Considerations

— Clos network = each port equidistant + consistent latency => multistage.

— MTU: vxlan -> avoid fragmentation. vlxan overhead = 50 bytes (14 outer MAC header + 20 outer IP header + 8 vxlan header — extra 4 if QinQ in VNI). Normal ethernet MTU 1500 -> Ethernet Frame = 1518 (or 1522 if 8021q) 18 = 6 MAC src + 6 MAC dst + 2 ether type + 4 FCS. If using vxlan => MTU 1450. If using jumbo frame 9000 => vxlan is 9050. Most network kit supports up to 9216 MTU

— IP Addressing: RID = lo. Use /31 or unnumbered (lo is used for RID) as much as possible. Lo0 (BGP) and Lo1 (VTEP) on IGP. Leaf = Lo0 + Lo1. Spine = only Lo0 because it is not vtep (if using multisite gateway need lo1 as it is vtep). Be sure your ip schema aggregates!!! -> reduce routing table (1x/24 all lo0, 1x/23 all p2p, etc)

  • Unicast Routing

— IGP is OSPF or BGP -> ECMP.

OSPF: use p2p type instead of broadcast -> only LSA-1 !!! low convergence time !!! and small LSDB. If ipv6 -> ospfv3 -> dual stack… two protocols!

ISIS: no IP, works on L2 (CNLS). SPF algo. TLV. NSAP addressing. IP independent.

BGP: path vector (no SPF) if eBGP -> next-hop unchanged (if spine not a vtep). underlay eBGP -> phy to phy // overlay eBGP -> lo to lo (multihop!). If eBGP “route reflector” => “retain router-target all”. If using “Two AS” design (if Spine no vtep) -> spine: ipv4 + evpn => “disable peer-as-check” // leaf: ipv4 + evpn => allowas-in

  • Multicast Routing: more efficient than unicast but needs one extra protocol

— BUM traffic: unicast mode = ingress replication in underlay // multicast mode = use multicast in underlay.

— Unicast: VTEP host to generate n-1 copies of packet. Replication of data traffic is data plane operation. VTEP-VNI membership distribution is dynamic via CP BGP EVPN or static via FnL (doesnt scale!).

— Multicast: PIM Any Source Multicast ASM (PIM SM) or PIM BiDir (depens on hw). Can’t mix PIM modes. RP in Spines!

— PIM ASM Anycast RP: in each spine. 1 IP for all spines -> load balancing. 9S,G) at VTEP.

— PIM BiDIR: (*,G) at RP = Spines. Difference with Anycast, BiDir creates only a shared tree (*,G) on a per multicast group instead of creating a source tree (S,G) per VTEP per multicast group. Redundancy achieved with “phantom” RP that uses lo with different prefix length.

5 MULTITENANCY (L2-> vlan / L3 -> vrf)

  • Bridge Domain: Broadcast domain that represents the scope of a L2 network (vlan). Way of stretching a vlan -> vlan (12bits), vni (24 bits), switch.

  • VLANS in VXLAN: vlan local significant, vni is global significant (per switch, per port)

— L2VNI: RD -> RID: vlan+32767. RT -> autogenerate / AS:l2vni (RT+eBGP is manual at underlay)

  • L2 Multitenancy:

— VLAN mode: restriction 4K to VNI mapping per switch.

vlan 10
  vn-segment 30001

— Bridge domain mode: BD is used instead of vlan-mode. BD implements a BDI instead of a SVI. No retrictions of 4k VNI mapping -> hw restriction:

  • VRF in VXLAN BGP EVPN: VRF-Lite doesnt scale. L3 at Leaf. EVPN -> scale CP -> RD+RT

  • L3 Multitenancy: L3VNI global scope, vrf name is local significant. Auto: RD= RID:VRF_ID / RT= AS:L3VNI (RT+eBGP is manual at underlay)

— Summary: 1) Associate L3VNI into VTEP interface 2) core-vlan associated with L3VNI 3) SVI created in VRF

router bgp X
 vrf VRF-A
  addressing ipv4 unicast
    advertise l2vpn evpn
---
interface nve1
  member vni 50001 associate-vrf
---
vlan 2501
  vni-segment 50001
---
interface vlan 2501
  vrf member VRF-A
  no shut
  mtu 9216
  ip forwarding
---
vrf context VRF-A
  vni 50001
  rd auto
  address-family ipv4 unicast
  route-target both auto
  route-target both auto evpn

6 UNICAST FORWARDING

  • Intra-Subnet Unicast Forwading (Bridging) (Classic Ethernet)

— ARP suppression disabled: ARP request -> BUM mode => Multicast or IR -> BGP EVPN for source MAC

— ARP suppression enabled: ARP snooping -> source MAC -> generated EVPN type2. If dst MAC is know by ingress VTEP then it generates ARP reply (ARP proxy)

— commands:

show bgp l2vpn evpn vni-id 30001
show l2route evpn mac all        <--|-- verifies FIB is updated
show mac address-table vlan X    <--|

// Anounce IP L3 GW manually
interface vlan 10
 vrf X
 ip address a/b tag 12345
---
route-map RM permit 10
 match tag 12345
---
router bgp Z
 vrf X
  address-family ipv4 unicast
     advertise l2vpn evpn
     redistribute direct route-map RM
  • Inter-subnet unicast forwarding (routing)

Symmetric IRB (bridge-routing-routing-bridge): VXLAN-router traffic uses same L3VNI in each direction. VRF -> l3vni -> mapping in all VTEPs.

— Distributed IP Anycast GW: anycast GW MAC (AGM) It is a VTEP. local routing in a VTEP -> no vxlan is used.

— Distributed behind remote VTEP (routing) -> vxlan > inner MAC header (SMAC = VTEP1 router MAC / DMAC = VTEP2 router MAC). RMAC is encoded in BGP EVPN NLRI as extended -community.

— Silent Hosts:

— Dest IP unknown + dst bridge domain is local to ingress VTEP => IP lookup hits LPM (ie /24) -> because L3 distribution IP Anycast FW -> chose local route (lowest AD) -> trigger ARP request for dst IP (because unknow) in different VNI !! -> BUM forwarding -> reach other VTEP.

— No L2 extension present:

show bgp l2vpn vpn vni-id X   <-- 1) verify BGP RIB
show bgp ip unicast vrf Y         2) verify RIB (RT worked fine)
show ip arp vrf Y                 3) verify FIB
  • Forwarding with dual-home endpoints: VPC -> anycast VTEP = VIP. Egress (outer src IP = VIP when traffic leaving ingress VETP). Ingress (outer dst IP = VIP when return traffic leaves egress VTEP -> ECMP to either of VTEP behind VIP)

— orphan: traffic may cross VPC peer-link because NH=VIP. L2/L3 announcements in VPC -> NH=VIP. If routing needed between VTEP1 and VTEP2 (both belong to same VPC) -> BGP or VRF-lite or advertise type5 with “PIP” from each VTEP instead of VIP (preferred)

  • IPv6: Anycast GW MAC (AGM) is shared between ipv4/ipv6. Underlay only ipv4 -> overlay ipv6 communication => NH VTEP=ipv4.

7 MULTICAST FORWARDING

Handling MC in overlay.

EVPN Type3 -> (unicast is used to handle BUM) VTP announces interest in a L2VNI

Initially not VXLAN L2 MC without IGMP snooping => L2 MC flooded to all VTEPs in that VNI even if not interested.

  • L2 MC forwarding = Intra-subnet MC. Same VNI = broadcast domain. In MC mode, underlay maps L2VNI to MC group.

— IGMP in VXLAN BGP EVPN:

— Classic IGMP snooping: Traffic is still flooded unconditionally as long as VTEPs are member of that VNI. MC is dropped at VTEPs egress.

— Improved IGMP snooping: “ip igmp snooping disable-nve-static-route-port” -> conditional addition of a VTEP to the Outgoing Interface List (OIL) for a given VNI.

  • L2 MC forwarding in VPC: one of the two peers of VPC -> elected DF (lowest cost to RP). Election process: Both VPC peers send PIM join to RP using Anycast VTEP IP (secondary IP in lo1). RP sends only 1 reply to anycast IP, this is hashed to one VPC peer -> the peer with the (S,G) is the DF (S=VTEP anycast IP, G=MC VNI mapping)
  • L3 MC forwarding = inter-subnet MC. Not much info, something expected in 2017.

8 EXTERNAL CONNECTIVITY

  • Placement:

— Border Leaf: VTEP, few flows N-S. Extra hop. No end-points. SS doesnt ned to be a VTEP.

— Border Spine: Spine becomes VTEP. Most flows N-S

— Extended L3 connectivity (L3 handoff):

— Wiring:

—- Full mesh: most resilient, no require sync between border nodes.

—- U-shape: sync link between border nodes.

— VRF-Lite/Inter-AS opt-A: BGP + redist + summarization, 802.1q. VRF-Lite-> SVI (needs BFD), subinterface (recommended) + ebgp

— Extended L2 connectivity: End-point mobility -> RARP (non-IP)

  • Classic Ethernet + VPC: VPC -> anycast VTEP IP (secondary IP in lo1) -> NH = anycast VIP (type2). “advertise pip” for type5 NH = VPC physical IP (primary lo1).

* BPDU not transported in VXLAN -> Use VPC between STP switch and VTEPs.

  • Extranet + Shared Services: Internet, DNS, DHCP, etc.

— VRF route-leaking: tenant VRF <-> shared VRF (dhcp, dns, etc) -> route leaking: CP leaking at ingress VTEP, DP leaking at egress VTEP. VXLAN uses VNI associated with source VRF for remote traffic. Problem: force consistent config in VTEP with leaking. Scalability (asymmetric IRB)

— Downstream VNI assigment: egress VTEP dictates the VNI to be used by ingress VTEP with downstream VNI-assigment via CP

9 MULTIPOD, MULTIFABRIC, DCI

  • OTV vs VXLAN: VXLAN frame similar to OTV. OTV is transport agnostic IP-based solution.

— OTV includes CP and DP. VXLAN only DP (it needs BGP EVPN for CP)

— OTV provides multihoming (redundancy) using DF on per VLAN, doesnt need VPC. VXLAN needs VPC to provide multihoming.

— OTV has loop prevention. VXLAN needs BPDU guards + storm control.

— ARP suppresion enabled in both. Unknown multicast is dropped in OTV. VXLAN+EVPN doesnt stop unknow unicast.

  • Multipod: LS-SS + super spine layer. Prefix scale MAC/IP? Spine or Super-Spine needs to be BGP RR. MC -> escale Output Interface list (OIF). Max 65k LS. Single DP extends pod to pod = single fabric.

  • Multifabric: Difference from multi-pod, complete segregation CP and DP -> interconnect at border -> stitching VNIs, -> DCI design.

  • Interpod / Interfabric: Broadcast storm in overlay reaches all pods if L2 extended to all pods.

— opt-1: Multipod, single DP end to end. problem: failure domain, no separation pods (vxlan encap end to end)

— opt-2: Multifabric: DCI at border of fabric using classic Ethernet (VRF-lite + 802.1q). Better scale, MAC/IP not spread across all VTEPs (VXLAN encap only inside fabric). VXLAN ends at border device. Problem: DCI is bottleneck.

— opt-3: Multisite: option2 + re-originate L3 routing info (MPLS L3EVPN) VXLAN ends at border fabric -> DCI encap in MPLS -> other end removes MPLS and then back to VXLAN.

— opt-4: Multisite L2: option 3 for L2. OTV or EVPN. VNI-VNI stitching.

* Multiste EVPN VXLAN using BGW -> IETF draft-sharma-multi-site-evpn 2016

10 L4-7 SERVICES INTEGRATION

  • Firewalls in VXLAN BGP EVPN:

— routing mode: use L3

— bridging mode: “bump in the wire”, VLAN stitching

— FW redundancy with static routing: ok if HA FW connected to same LS pair (VPC). If FW in different LS -> suboptimal routing -> 2 solutions: 1) static route tracking, 2) static route at remote LS -> static route in ALL LS that need to reach the FW -> LS will learn type2 of FW via active LS.

  • Inter-Tenant / Tenant-Edge FW: security enforcement at edge/exit of a tenant/VRF. VRF stitching located at Border LS.

— Inra-tenant FW: E-W firewall = FW inside VRF.

— deployment:

—-FW route mode + default GW for all VLANs => VXLAN only at L2 => no VRFS, no anycast gw.

—- FW bridge mode: all network belong to same subnet. VXLAN + distributed IP anycast gw. FW connected to distributed IP anycast GW LS.

—- PBR: Policy-Base Routing

— Mixing intra-tenant and inter-tenant:

— Intra-tenant:

—-L2 (E-W): FW is GW. LS only extends L2 -> vxlan only l2, no distributed IP anycast gw. BL trunk to FW to extend L2.

—- L3: LS uses distributed IP anycast gw.

— Inter-tenant: default route pointing to FW -> redistribute via BGP EVPN

  • Load Balancer: “statefull”

— one-arm source-NAT: LB connected with 1 link / PO to LS.

— Direct VIP subnet approach: LB VIP + LB physical IP in same range. VIP advertised via type2

— Indirect VIP subnet approach: needs static route (like FW example) -> type5.

— source-NAT -> client IP is hidden, servers return traffic to LB

— service chains: LB+FW: FW belongs to BL, LB belongs to Service Leaf. If 2-Arm LB -> VRF-transit between FW-LB. If 1-arm LB -> no transit-vrf, source NAT.

11 FABRIC MANAGEMENT

  • POAP: out-of-ban (mgmt port) needs dhpc relayy. inband (front panel ports)
  • NRFU
  • OAM:
show mac addres-table
show l2route evpn mac all
show vlan id X vn-segment
show bgp l2vpn evpn vni-id Z
show bgp l2vpn evpn MAC
show ip arp vrf Y
show forwarding vrf Y adj
show forwarding up local-host-db vrf Y
show l2route evpn mac-ip all
show bgp l2vpn evpn IP
show ip route vrf Y IP
show nve internal bgp remote database
show nve peers detail

ping nve up unknown vrf X payload IP DST SRC port SRC DST proto 6 payload-end vni 50000 verbose
traceroute ...
pathtrace ...

IPv6 BIG TCP / Replace TCP in DC: Homa

This week a colleague pass this link about running kubernetes cluster running on Cilium. The interesting point is the high throughput is achieved by BIG TCP and IPv6!

The summary (copied) is:

TCP segments in the OS are up to 65K, NIC hardware does the segmentation – we do this now, but the 65K is a limitation of IPv4 addressing.  BIG TCP uses IPv6 and allows much large TCP segments within OS currently 512K but theoretically higher.  End result – better perf (>20% higher in this video) and latency (2.2x faster through the OS).

Then I saw this other video from John Ousterhout. It is similar topic as the Kubernetes video above as K8S is used mainly in datacenters.

High performance:
– data throughput: full link speed for large messages
– low tail latency: <10us for short messages? (DC)
– message throughput: 100M short messages per second? (DC)

TCP issues in DC:
1- stream oriented (no load balancing) -> message based
2- connection oriented (can break infiniband!, expensive,)-> connectionless
3- fair scheduling (bw sharing) -> run to completion (SRPT)
4- sender-driven congestion control (based on buffer occupancy) -> receiver- driven congestion control
5- in-order delivery -> no ordering requirements

As well, it is important the move to NIC (as there is already a lot of NIC offloading).

His proposal for HOMA looks very nice but I like how he explains how dificult is going to be successful. Still worth trying.

VMware Co-stop / LPM in hardware

This is a very interesting article about how Longest Prefix Matching is done in networks chips. I remember reading about bloom filters in some Cloudfare blog but I didnt think that would be use too in network chips. As well, I forgot how critical is LPM in networking.

I had to deal lately with some performance issues with an application running in a VM. I am quite a noob regarding virtualization and always assumed the bigger the VM the better (very masculine thing I guess…) But a virtualization expert at work explained me issues regarding that assumption with this link. I learnt a lot from it (still a noob though). But I agree that I see most vendors asking for crazy requirements when offering products to run in VM…. and that looks like that kills that idea itself of having a virtualization environment because that VM looks like requires a dedicated server…. So right-sizing your product/VM is very important. I agree with the statement that vendors dont really do load testing for their VM offering and the high requirements it is an excuse to “avoid” problems from customers.

CCNA DevNet Notes

1) Python Requests status code checks:

r.status_code == requests.codes.ok

2) Docker publish ports:

$ docker run -p 127.0.0.1:80:8080/tcp ubuntu bash

This binds port 8080 of the container to TCP port 80 on 127.0.0.1 of the host machine. You can also specify udp and sctp ports. The Docker User Guide explains in detail how to manipulate ports in Docker.

3) HTTP status codes:

1xx informational
2xx Successful
 201 created
 204 no content (post received by server)
3xx Redirect
 301 moved permanently - future requests should be directed to the given URI
 302 found - requested resource resides temporally under a different URI
 304 not modified
4xx Client Error
 400 bad request
 401 unauthorized (user not authenticated or failed)
 403 forbidden (need permissions)
 404 not found
5xx Server Error
 500 internal server err - generic error message
 501 not implemented
 503 service unavailable

4) Python dictionary filters:

my_dict = {8:'u',4:'t',9:'z',10:'j',5:'k',3:'s'}

# filter(function,iterables)
new_dict = dict(filter(lambda val: val[0] % 3 == 0, my_dict.items()))

print("Filter dictionary:",new_filt)

5) HTTP Authentication

Basic: For "Basic" authentication the credentials are constructed by first combining the username and the password with a colon (aladdin:opensesame), and then by encoding the resulting string in base64 (YWxhZGRpbjpvcGVuc2VzYW1l).

Authorization: Basic YWxhZGRpbjpvcGVuc2VzYW1l

---
auth_type = 'Basic'
creds = '{}:{}'.format(user,pass)
creds_b64 = base64.b64encode(creds)
header = {'Authorization': '{}{}'.format(auth_type,creds_b64)}

Bearer:

Authorization: Bearer <TOKEN>

6) “diff -u file1.txt file2.txt”. link1 link2

The unified format is an option you can add to display output without any redundant context lines

$ diff -u file1.txt file2.txt                                                                                                            
--- file1.txt   2018-01-11 10:39:38.237464052 +0000                                                                                              
+++ file2.txt   2018-01-11 10:40:00.323423021 +0000                                                                                              
@@ -1,4 +1,4 @@                                                                                                                                  
 cat                                                                                                                                             
-mv                                                                                                                                              
-comm                                                                                                                                            
 cp                                                                                                                                              
+diff                                                                                                                                            
+comm
  • The first file is indicated by —
  • The second file is indicated by +++
  • The first two lines of this output show us information about file 1 and file 2. It lists the file name, modification date, and modification time of each of our files, one per line. 
  • The lines below display the content of the files and how to modify file1.txt to make it identical to file2.txt.
  • - (minus) – it needs to be deleted from the first file.
    + (plus) – it needs to be added to the first file.
  • The next line has two at sign @ followed by a line range from the first file (in our case lines 1 through 4, separated by a comma) prefixed by “-“ and then space and then again followed by a line range from the second file prefixed by “+” and at the end two at sign @. Followed by the file content in output tells us which line remain unchanged and which lines needs to added or deleted(indicated by symbols) in the file 1 to make it identical to file 2

7) Python Testing: Assertions

.assertEqual(a, b)	a == b
.assertTrue(x)	        bool(x) is True
.assertFalse(x)	        bool(x) is False
.assertIs(a, b)	        a is b
.assertIsNone(x)	x is None
.assertIn(a, b)	        a in b
.assertIsInstance(a, b)	isinstance(a, b)

*** .assertIs(), .assertIsNone(), .assertIn(), and .assertIsInstance() all have opposite methods, named .assertIsNot(), and so forth.

ARP Storms – EVPN

We have had an issue with broadcast storms in our network. Checking the CoPP setup in the switches, we could see massive drops of ARP. This is a good link to know how to check CoPP drops in NXOS.

N9K:# show copp status
N9K# show policy-map interface control-plane | grep 'dropped [1-9]' | diff

Having so many ARP drops by CoPP is bad because very likely good ARP requests are going to be dropped.

Initially i thought it was related to ARP problems in EVPN like this link. But after taking a packet capture in a switch from an interface connected to a server, I could see that over 90% ARP traffic coming from the server was not getting a reply…. Checking in different switches, I could see the same pattern all over the place.

So why the server was making so many ARP requests?

After some time, managed to help help from a sysadmin with access to the servers so could troubleshoot the problem.

But, how do you find the process that is triggering the ARP requests? I didnt make the effort to think about it and started to search for an easy answer. This post gave me a clue.

ss does show you connections that have not yet been resolved by arp. They are in state SYN-SENT. The problem is that such a state is only held for a few seconds then the connection fails, so you may not see it. You could try rapid polling for it with

while ! ss -p state syn-sent | grep 1.1.1.100; do sleep .1; done

Somehow I couldnt see anything anything with “ss” so tried netstat as it shows you too the status of the TCP connection (I wonder what would happen is the connection was UDP instead???)

Initially I tried “netstat -a” and it was too slow to show me “SYN-SENT” status

Shame on me, I had to search how to get to show the ports quickly here:

watch netstat -ntup | grep -i syn_sent | awk '{print $4,$5,$6,$7}'

It was slow because it was trying to resolve all IPs to hostname…. :facepalm. Tha is fixed with “-n” (no-resolve)

Anyway, with the command above, finally managed to see the process that were in “SYN_SENT” state

This is not the real thing, just an example:

#  netstat -ntup | grep -i syn_sent 
tcp        0      1 192.168.1.203:35460     4.4.4.4:23              SYN_SENT    98690/telnet        
# 

We could see that the destination port was TCP 179, so something in the node was trying to talk BGP! They were “bird” processes. As the node belonged to a kubernetes cluster, we could see a calico container as CNI. Then we connected to the container and tried to check the bird config. We could see clearly the IPs that dont get ARP reply were configured there.

So in summary, basic TCP:

Very summarize, TCP is L4, then goes down to L3 IP. For getting to L2, you need to know the MAC of the IP, so that triggers the ARP request. Once the MAC is learned, it is cached for the next request. For that reason the first time you make a connection is slow (ping, traceroute, etc)

Now we need to workout why the calico/bird config is that way. Fix it to only use IPs of real BGP speakers and then verify the ARP storms stop.

Hopefully, I will learn a bit about calico.

Notes for UDP:

If I generate an UDP connection to a non-existing IP

$ nc -u 4.4.4.4 4000

netstat tells me the UDP connection is established and I can’t see anything in the ARP table for an external IP, for an internal IP (in my own network) I can see an incomplete entry. Why?

#  netstat -ntup | grep -i 4.4.4.4
udp        0      0 192.168.1.203:42653     4.4.4.4:4000            ESTABLISHED 102014/nc           
# 
#  netstat -ntup | grep -i '192.168.1.2:'
udp        0      0 192.168.1.203:44576     192.168.1.2:4000        ESTABLISHED 102369/nc           
# 
#
# arp -a
? (192.168.1.2) at <incomplete> on wlp2s0
something.mynet (192.168.1.1) at xx:xx:xx:yy:yy:zz [ether] on wlp2s0
# 

# tcpdump -i wlp2s0 host 4.4.4.4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on wlp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:35:45.081819 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:45.081850 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:46.082075 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:47.082294 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
23:35:48.082504 IP 192.168.1.203.50186 > 4.4.4.4.4000: UDP, length 1
^C
5 packets captured
5 packets received by filter
0 packets dropped by kernel
# 
  • UDP is stateless so we can’t have states…. so it is always going to be “established”. Basic TCP/UDP
  • When trying to open an UDP connection to an external IP, you need to “route” so my laptop knows it needs to send the UDP connection to the default gateway, so when getting to L2, the destination MAC address is not 4.4.4.4 is the default gateway MAC. BASIC ROUTING !!!! For that reason you dont see 4.4.4.4 in ARP table
    • When trying to open an UDP connection to a local IP, my laptop knows it is in the same network so it should be able to find the destination MAC address using ARP.

TCP Asymmetric

I got escalated an issue recently that had caused several outages and needed an urgent fix.

For different reasons, we had asymmetric routing in SITE-A. The normal flow is the green arrow. During the asymmetric routing, the flow is the red line. Routing wise, things should work. BUT, we have firewalls in the path. The firewalls were configured to allow asymmetric connections (I was told). As far as I could see in the config and logs, nothing was dropped in the firewalls during the issue.

So first thing, I fixed the asymmetric routing so it didnt happen again. I took me a while to come up with the solution (and it was quite simple) as I had to understand properly the routing before and during the issue. The diagram is quite simplified at the end of the day.

So during the maintenance window when I applied the fix for the asymmetric routing, I managed to take some traces in the firewalls, as I was trying to understand where the traffic was dropped/lost during the asymmetric scenario. As well, I was not very familiar with several parts of the network and the monitoring, I didnt know which links where already tapped or not. Once I was happy with the routing fix, I tried to take a look at the traces. At high level, I could see the return traffic leaving FW1 and leaving DC1-SW1. Based on that, I started to think that the firewalls were fine…..

In another maintenance, I tried to take more logs in different part of the network and I could see clearly the traffic reaching A-SW1. As I ran of time and missed to tap some links, I couldnt carry on.

So based on the second maintenance, the issue had to be inside SITE-A. Somehow it didnt make sense. I checked I didnt have uRPF enabled. The rest was pure L2 so it couldnt see the L3…

So in the third maintenance, I got all my debugging tools to verify that any network kit was dropping the traffic in SITE-A…. and it was useless. I realized that I could do a tcpdump in the client IP1 i was using for testing and I could see some return traffic!!!!

So, I was just socked. I didnt get it. It didnt make sense.

Somehow, I reviewed the tcp captures I was doing in each interface of both firewalls. I was trying to get to basics.

I was assuming the TCP handshake was completed properly. After paying a bit of attention to the client logs… I could see the TCP handshake completed. And I could see the HTTP GET getting to and leaving DC2-FW…. so why the server IP2 was not answering!!!!???

So back to the tcp handshake and firewall captures, I was comparing step by step. Somehow, I missed that the TCP ACK from client IP2 was reaching DC2-FW…. but it was not leaving DC2-FW!!!! even worse, the HTTP GET it was actually crossing the DC2-FW !!!

SLAP IN THE FACE!!!

This is the TCP handshake. This is networking 101…..

The TCP state-machine in client and server during the asymmetric scenario

So I was asumming that because the client was sending HTTP get, the tcp handshake was completed in both ends!!!!

It didnt make sense why I was seeing TCP SYN-ACK retransmissions from the server IP1…. BECAUSE the TCP ACK from client IP2 never reached.

For that reason server IP2 never answered the HTTP GET, because from its end the tcp hanshake was not completed.

I banged my head several times on the table. I “saw” this during the first maintenance window when I took the tcpdump in the firewalls BUT I didnt pay attention to the basic details.

I trusted too much to see a wireshark trace because it is more visual and shows more info but the clues were all the time in the tcpdump from the firewalls that I didnt bother to pay full attention.

At least, I found out where and why the connections failed during the asymmetric routing scenario. A firewall upgrade did the job.

So all fixed.

Lessons learned:

  • without proper foundation, you can’t build knowledge (tcp handshake state in client and server)
  • when things dont make sense, get back to basics (tcp handshake)
  • get the most of the tools at hand (tcpdump – PSH packets were the HTTP GET !!!!)

Smallest Audience – TCPLS – ByPass CDN WAF – Packet Generator

A bit of mix of things:

Smallest (viable) audience: Specificity is the way

TCPLS: I know about QUIC (just the big picture) but this TCP+TLS implementation looks interesting. Although I am not sure if their test is that meaningful. A more “real” life example would be ideal (packet loss, jitter, etc)

ByPass CDN: I am not well versed in Cloud services but this looks like a interesting article CDN and WAF from a security perspective. It is the typical example of thinking out of the box, why the attacker can’t be a “customer” of the CDN too???

Packet Generator – BNG Blaster: I knew about TReX but never had the chance to use it and I know how expensive are the commercial solutions (shocking!) so this looks like a nice tool.