{"id":1585,"date":"2024-02-05T10:02:56","date_gmt":"2024-02-05T10:02:56","guid":{"rendered":"https:\/\/blog.thomarite.uk\/?p=1585"},"modified":"2024-02-09T11:06:32","modified_gmt":"2024-02-09T11:06:32","slug":"infiniband-professional","status":"publish","type":"post","link":"https:\/\/blog.thomarite.uk\/index.php\/2024\/02\/05\/infiniband-professional\/","title":{"rendered":"Infiniband Professional"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">1) Intro IB<\/h1>\n\n\n\n<p>open standard: IBTA<br>features:<br>simple mgmt: each fabric has a SM: subnet manager<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>nodes and links discovery<\/li>\n\n\n\n<li>local id assigment: LIDS<\/li>\n\n\n\n<li>routing table calculations and deployment<\/li>\n\n\n\n<li>configure nodes and ports ie: qos<br>high bw: non-blocking, bi-dir. 4 physical lanes (max 12) EDR: 25G per lane \/ HRD: 50G per lane \/ NDR: 100G per lane<br>cpu offload: kernel bypass, RDMA for CPU and GPU.<br>low lat: 1micro for RDMA<br>scale out\/flex: up to 48k nodes in one subnet. Beyond that use IB routers\/<br>qos:<br>resilience: self healing. 1ms.<br>LB: adaptive routing, dynamic load balancing<br>sharp: mpi super performance: scalable hierarchical aggregation and reduction protocol. offload collective operations from host cpu\/gpu.<br>variety topologies: fat tree, torus 3d, dragonfly<\/li>\n<\/ul>\n\n\n\n<p>componets:<br>gateway: translate IB&lt;&gt;Ethernet<br>switch, router (between different subnets)<br>hca: host channel adapter: nic?<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">2) Intro IB Arch<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Arch<\/h2>\n\n\n\n<p>L5 Upper: Mgmt protocols: subnet mgmt and subnet svcs. Verbs to interact with Transport Layer<br>L4 Transport: services to complete specified operation. Reassemble and split packets.<br>L3 Network: describes the protocol for routing a packet between subnets<br>L2 Link: describes the packet format and protocols for packet operation. (routing within a subenet)<br>L1 Physical: framing and signaling<\/p>\n\n\n\n<p>L2: LRH 8B + (L345) + Trailer (ICRC 4B + VCRC 2B)<br>LRH: Local Route Header: local src and local dst port. Includes SL (Svc Level) and VL (?). VL is the only field that changes while the packet traverses the subnet.<br>ICRC: Invariant CRC \/\/ VCRC: Variant CRC<br>L3: GRH 40B + L45<br>GRH: Global Route Header: present in packet that traverses multiple subnets. Routers forward packet based on GRH. Router recalculate VCRF but not ICRC.<br>L4: BTH 23B + ETH var + L5<br>BTH: Base Transport Header: operation code (first, last, intermediate or only packet + operattion type: send, rdma wr, read, atomic), seq num (PSN) and partition.<br>ETH: Extended Transport Header: conditionally present depending on CoS and operation code.<br>L5: Payload 256-4096B<\/p>\n\n\n\n<p>Wireshark: (L3 only in packet that need to be routed to a different subnet.)<br>Local Route Header -&gt; L2<br>Base Transport -&gt; L4<br>DETH &#8211; Datagram Ext Transport Header -&gt; L4<br>MAD Header &#8211; Common mgmt datagram -&gt; L5<br>SMP (Directed Route) -&gt; L5<br>ICRC &#8211; L2<br>VCRC &#8211; L2<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"455\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure-1024x455.png\" alt=\"\" class=\"wp-image-1586\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure-1024x455.png 1024w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure-300x133.png 300w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure-768x341.png 768w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure-1200x533.png 1200w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-data-packet-structure.png 1311w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Mgmt<\/h2>\n\n\n\n<p>fabric: link, switches and routers than connect channel adaptor<br>subnet: port and links with comom subnet id and managed by same SM.<br>-router connects subnets<\/p>\n\n\n\n<p>SM: subnet manager. Centralized routing mgmt. plug and play. One master SM, the rest standy.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>discovering topology<\/li>\n\n\n\n<li>assigning local ids to nodes (LIDs)<\/li>\n\n\n\n<li>calculate and program switching forwarding tables<\/li>\n\n\n\n<li>managin elements<\/li>\n\n\n\n<li>monitoring elements<br>Impleted in a server, switch or specialized device.<\/li>\n<\/ul>\n\n\n\n<p>elements:<br>Manager: active entity<br>Node: managed entity: switch, HCA, router<br>Agent: each node has a SMA (subnet manager agent). Passive, responds to Manager. Can send traps<\/p>\n\n\n\n<p>MADs: standard message format betwen Agent and Manager<\/p>\n\n\n\n<p>Addressing:<br>L1: GUID: Global Unique Id: unique address burned by vendor in hw: chassis, HCAs, switches, routers and ports.<br>L2: LID: Local Id: Assigned by SM. Unique within the subnet. Src and Dst LIDs are present in LRH. Dst LID is used by switch to send packet.<br>L3: GID: Global Id: identify end port or multicast group. Unique across subnets. Src and dst GID are in GRH. Dst GID is used by router.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OFED Monitoring Utilities<\/h2>\n\n\n\n<p>OpenFabrics Entreprise Distribution (OFED): sw stack for RDMA and kernel bypass apps.<br>OFED utilities facilitate control, mgmt and diagnosis of IB fabrics.<\/p>\n\n\n\n<p>verify OFED installation: $ ofed_info | head -1<br>verify OFED running: $ \/etc\/init.d\/openibd status<br>verify HCA (nic) installe: $ lscpi | grep -i mellanox<br>verify IB running: $ ibstat -&gt; list all local HCAs. info from IB driver. GUID, LID, por state, rate<br>verify connectivity: ibping (verify connectiity between hosts). It is Client-Servre command<br>destination: # ibping -S (server mode)<br>source: # ibping -L<br>verify path: ibtracert: source LID to dst LID.<br># ibtracert ===&gt; You dont have to run the command from the source LID itself !!!<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">3) Physical Layer<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Overview<\/h2>\n\n\n\n<p>functions: bit sync, bit rate control, phy topologies, transmission mode<br>specifications: start, end delimeter, data symbos<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">HCAs = Host Channel Adapter = NICs<\/h2>\n\n\n\n<p>connect server to switch. NIC + offload.<br>1 or 2 ports.<br>GUID = MAC<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Media Types and Interconnection<\/h2>\n\n\n\n<p>link width: 1,4,8,16 lanes. Current usage: only 4<br>link rate: link speed * link width DAC ACO<br>EDR: Enhanced Data Rate &#8211; 25G per lane = 100G 5m 100m<br>HDR: High DR &#8211; 50G per lane = 200G 2m 100m<br>NDR: Next DR &#8211; 100g per lane = 400G 4m<br>XDR: Extreme DR &#8211; 200G per lane = 800G<\/p>\n\n\n\n<p>DAC: direct attach &#8211; copper cable<br>AOC: active optical cable: each line: 1xtx 1xrx &#8211; total 8 (more expensive than DAC). MultiMode (3-100m)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Responsabilities<\/h2>\n\n\n\n<p>establlish physical link, monitor status, inform link layer, guaranteeing signal integrity for best Bit Error Rate (BER)<\/p>\n\n\n\n<p>status: polling (not cable connected), disabled, portConfigTraining, LinkUp, LinkError Recovery (cable is faulty)<br># ibstat =&gt; show you status of hca<\/p>\n\n\n\n<p>BER = number bit errors \/ total number bit transferred<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Addressing<\/h2>\n\n\n\n<p>GUID: (like MAC) Globally Unique Id = 65 bit (assigned by vendor)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>system GUID: abstract several GUID in one (like a cluster of devices)<\/li>\n\n\n\n<li>Node GUID: HCA, switches or routers<\/li>\n\n\n\n<li>Port GUID: HCA port.<\/li>\n<\/ul>\n\n\n\n<p>HCA has: 1x System GUID, 1x Node GUID, 1x Port GUID per physical port &#8211;&gt; # ibstat<\/p>\n\n\n\n<p>Switch (Fixed): 1xASIC (1xNode GUID), 1xSystem GUID. It doesnt have Port GUID<\/p>\n\n\n\n<p>Director (Modular) switch: 1xSystem GUID, each module has 1xNode GUI<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OFED<\/h2>\n\n\n\n<pre class=\"wp-block-preformatted\"># ibportstate -&gt; state, speed, lanes, etc\n\n# ibswitches -&gt; list switches in the subnet and GUIDs\n\n# ibhosts: list all HCAs in the subnet and GUIDs\n\n# ibnodes: list both HCAs and switches in subnet.<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">4) Link Layer<\/h1>\n\n\n\n<p>switching inside local subnet<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Link Layer Services<\/h2>\n\n\n\n<p>Packet Mgmt:<br>Link mgmt packets<br>data packets: send, read, write, ack<br>header= LRH 8B + GRH 40B + BTH 12B + ETH var<br>payload= 256-4096B<br>ICRC 4B + VCRC 2B<\/p>\n\n\n\n<p>L2 Addressing<br>routing inside local subnet. Each node has LID (local ID) 2B inside LRH<br>LID assigned by Subet Manager when initilization and when topology changes.<br>HCAs: LID per port<br>Fix form switch: 1 LID<br>Modular switches: 1 LID per module<br>Each subnet max 48k unicast LID<br>16k multicast LID<\/p>\n\n\n\n<p>QoS<br>enabled prioritization app\/users\/data flows.<br>Service Levels (SL) and Virtaul Lanes (VL)<br>SL is in LRH: defines class of packet<br>VL is in LRH: implements multiple logical flows over a single physical link<br>different packets are mapped to different VLs based on SL (marking)<br>each VL has a weight and priority<br>each VL uses different buffers<br>each VL has a scheduler<br>Max 16 VL:<br>special VL: VL15: Subnet Manager traffic only<br>VL0: all data traffic<br>VL1-14: free to use to implement your QoS policy<\/p>\n\n\n\n<p>Packet Forwarding<br>LID is read by switch to route to destination, checking the LFT (Linear Forwarding Table: table of LDID -&gt; Exit Port)<br>Implementing QoS: LFT contains SL to VL mappings<br># ibswitches -&gt; list of switches with LID<br># ibroute &#8211;&gt; shows LFT of switch with LID 10 \/\/ OutPort=000 means the packet is processed by switch.<\/p>\n\n\n\n<p>Flow Control<br>Lossless Fabric. Flow Control: prevents fast sender to overwhelm slow receiver to avoid drops and retransmissions.<br>Credit based FC: receiver sends credit to sender to indicate availability of receive buffers. Sender waits for credits before transmissing.<br>packet are not held forever. There is timeout, if expires, packet is dropped.<br>Each VL can have a separate FC.<\/p>\n\n\n\n<p>Data Integriy<br>by CRC: Cyclic Redundancy Check. Hash function. If calculation of CRC doesnt match, packet is dropped and request resend. end-to-end integrity<br>ICRC: invariant &#8211; all field that dont change 32bit<br>VCRC: variant &#8211; whole packet. 16bit<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OFED<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code># iblinkinfo: all nodes in fabric: LID, GUID, hostname, link speeds\n\n# ibnetdisconer: fabric discovery and list all ndoes: LID, GUID, hostnames and link speeds. Generates a file with topology<\/code><\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">5) Network Layer<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">routing solution overview<\/h2>\n\n\n\n<p>connect different subnets (each max 48k nodes)<\/p>\n\n\n\n<p>routing benefits:<br>-scaling<br>-isolation: separation, fault resilience, reliability, availability<br>-subnet management per each subnet<br>-connectivity: each subnet can have different topology<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">network layer overview<\/h2>\n\n\n\n<p>handles routing of packets between subnets using GID in GRH 40B (Global Routing Header)<br>unicast and multicast<br>GID: Global ID &#8211; 128 bit &#8212; identifies single port or multicast group: GID= 64bit subnet prefix + port GUID (kindoff ipv6)<br>globally unique across subnets<\/p>\n\n\n\n<p>each HCA port has an automatic assigned default GID (fe80::) that can be used only in local subnet (kindoff ipv6 link-local)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OFED<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code># ibv_devices -&gt; ib devices installed in server (hcas)<br># ibaddr -&gt; displays GID and LID<\/code><\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">6) Transport Layer<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">overview<\/h2>\n\n\n\n<p>end-2-end communication services for apps &#8211; virtual channel. segment\/reassembly<br>channel end-point are called Queue Paros (QPs): Each QP represents one end of a channel. QP bypass kernel during data transfer. HCA oversees reliability<br>QP has a send and receive queue. QP id is 24 bits. apps have direct access to hw: mapping app&#8217;s virtual address into the QP.<br>If an app required more than 1 connection -&gt; more QPs are created<br>QP workflow: A work queue is the app&#8217;s interface to the IB fabric.<br>If app wants to send\/receive data -&gt; post a Work Request (WR) to a work queue (that is a WQE &#8211; WQ Element)<br>When the HCA completes a WQE, a completion queue element (CQE) is placed on a completion queue.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"642\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow-1024x642.png\" alt=\"\" class=\"wp-image-1587\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow-1024x642.png 1024w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow-300x188.png 300w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow-768x481.png 768w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow-1200x752.png 1200w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/IB-QP-Workflow.png 1307w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Responsibilities: Three below<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\">segmentation\/reassembly<\/h2>\n\n\n\n<p>segment when message bigger than MTU, done by HCA. HCA receiver side reassembles.<br>payload: 256-4096 bytes<br>default mtu = 4096<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">transport modes<\/h2>\n\n\n\n<p>QP has 4 transport service types. Source\/Destination QPs must have same mode. Service type depends on app.<br>RC: reliable connection<br>UC: unreliable connection<br>RD: reliable datagram<br>UD: unreliable datagra<\/p>\n\n\n\n<p>connected: dedicated QP for one connection in eachc end. Higher performance than datagra but more kernel memory consumed. Most used. Segmentation is supported<br>datagram: single QP servers multiple connections. Segmentation is not suppoerted. More scalable that connected (similar to multicast)<br>reliable: each packet has Packet Seq Num (PSN). Receiver send Acks if packet arrive in order, send negativa ack otherwise. Send QP has a timer. Similar to TCP.<br>unrelible: no ack.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">partitions<\/h2>\n\n\n\n<p>divide large cluster into small isolated subclusters -&gt; multitenancy, multi apps, security, qos.<br>ports maybe members of multiple partitions at once<br>port in different partitions are unaware of eachc other.<\/p>\n\n\n\n<p>PKEY: partition id. 16bit in BTH header. Carried in packets and stored in HCA. Used to determine partition membership. The Subnet Manager SM assings the PKEY to the ports.<\/p>\n\n\n\n<p>membership type: limited vs full<br>limited: can&#8217;t accept other limited membership in the partition. all nodes may communite with SM. Full&lt;&gt;Limited is always oke (with same PKEY) IE: storage, network mgmt.<br>default PKEY is 0x7fff. everything is part of that pkey and assigned by SM. And all are full.<br>65535<\/p>\n\n\n\n<p>high-order bit (left most) in PKEY records the type membership: 0 = limited \/ 1= full -&gt; 0x7fff = 111 1111 1111 1111<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">offloading<\/h2>\n\n\n\n<p>RDMA: remote direct memory access. data read\/write to remote server bypassing CPU in both ends. zero buffer copy.<br>reduce latency, increase throughput, cpu freed up<\/p>\n\n\n\n<p>two methos for offloading:<br>-channel semantic: send\/receive. Sending app has no visibility on receivers buffer or data structure. Just send data. Syncronoues data flow<br>-memory semantic: rdma read\/write<br>rdma write example<br>receivedr side, register a buffer in its memory space and pass it to the sender. Sender uses RDMA send\/write. Async communication. sender sides does the same.<br>send side puts a WQE. its hca generates CQE. The receiver HCA puts the data directly in the memory, there is no WQE\/CQE in receiver side.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ofed<\/h2>\n\n\n\n<p>perftest: read\/write and send tests. client-server. cpu same in client and server.<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>latency perf test (-h)\n server        client\n  ib_read_lat  ib_read_lat\n  ib_write_lat ib_write_lat\n  ib_send_lat  ib_send_lat\n\nbw perf test (-h)\n server       client\n  ib_read_bw  ib_read_bw\n  ib_write_bw ib_write_bw\n  ib_send_bw  ib_send_bw<\/code><\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">7) Upper Layer<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">overview<\/h2>\n\n\n\n<p>support upper layer protocols (Native IB RDMA, IPoIB,etc).<br>mgmt svc protocls (Subnet mgmt and subnet services).<br>sw transport verbs to communicate with HCA\/IB fabric (clients of upper layer)<\/p>\n\n\n\n<p>upper layer protocols: MPI (for HPC), IPoIB (enables TCP\/IP over IB), SDP (high perf interface for standar socket apps &#8211; TCP), SRP (SCSI devices over RDMA), iSER (zero copy RDMA to eliminate TCP and iSCSI bottleneck, better than SRP), NFS RDMA (NFS over RDMA)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">management service protocols<\/h2>\n\n\n\n<p>-subnet mmgnt: Uses special mgmt datagram (MAD) class called SMP: subnet mgmt packet -&gt; uses special QP0, always uses VL15 and not subject to flow control.<br>-general services: Used MAD called GMP: General mgmt Packet. Each port has a QP1 and all GMPs are received on QP1 are processed by one GSA (General Service Agent).<br>GMP uses any VL except 15 (default 0), subject to Flow Control<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">sw transport verbs<\/h2>\n\n\n\n<p>verb: describe actions how an app request acctions from the messaging svc.<br>ie RMDA send: rdma_post_send, rdma_post_recv<br>RDMA write: rdma_post_write<br>RDMA read: rdma_post_read<br>OpenFabrcAlliance: defines verbs specification.<\/p>\n\n\n\n<p>&#8212; Fabric Mgmt &#8212;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">8) Fabric Init<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Init Stages<\/h2>\n\n\n\n<p>subnet has a common Subnet ID. Router connects subnets. Each subnet has SM (discovery topo, assign LIDs to nodes, calculate\/program forwarding tables, manage all elements, monitor changes). SM can be a server, switch or special device. Each node has a SMA (SM Agent) that communicates with SM<\/p>\n\n\n\n<p>1 Phy Fabric Establish: connect all cables<\/p>\n\n\n\n<p>2 Subnet Discovery: Once SM wakes up, starts discovery with direcltly connected nodes, and then their neigbors. SM gathers switch info, port info and host info. SM uses SMPs (SM packets)<\/p>\n\n\n\n<p>3 Info gathering: SMPs uses VL 15. Two types:<br>-Directed-routed: forwarded based on a vector of port numbers. Not dependent of routing table entries. Provide means to communicate before switches and hosts are configured (before LIDs are assigned). Mainly for discovery. Only SMI (SM interface) allows for these packets.<br>Two types of messages:<br>&#8212; get: SM polls fabric with get.<br>&#8212; get response: answer from devices.<br>Two types of commands:<br>&#8212; get node \/ port info:<br>&#8212; get response node \/ port info:<br>-LID-routed: forwards using switch forwarding table (after SM populates them)<\/p>\n\n\n\n<p>topo info gather: switches, hcas, ports, links. Topo described by nodes GUID and port numbers.<br>node info gather: type, number ports, GUID, description<br>port info gather: MTU, VLs, width (num lanes), speed.<\/p>\n\n\n\n<p>4 LIDs Assigment: SM assigns LDIS to nodes<br>HCA: 1 LID per port<br>1RU switches (1 ASIC): 1 LID for whole switch<br>Modular switch: 1 LID per module (linecard)<\/p>\n\n\n\n<p>5 Paths Establishments<br>min-hop: calculate number of hops required by eachc port to reach each destination LID. Shortest is best. tie-breaker: port with fewer LIDs assigned.<\/p>\n\n\n\n<p>6 Port Config<br>LID (unique in subnet), width (number of physical lines), MTU (default 4096), speed.<br>QoS: VLs, SL to VL (mapping table Service Level to VL), VL arbitration<\/p>\n\n\n\n<p>7 Switch Config<br>SM populates the switch&#8217;s LFT with the best routes. LFT: destination LIDs -&gt; exit port. And SL-VL table.<\/p>\n\n\n\n<p>8 Subnet Activation<br>IB port: physical states: polling (after power on, cable not connected), training (establish link sync), linkup (ready to transfer packets)<br>logical states: down (phy is down: polling or training), init (phy is up but only deals with SMP and flow control), armed (verify data transfer fine. SM sends dummy SMP with VCRC to verify that is not corrupted), active (SM send active to port)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ofed<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code># ibswitches: GUID, description, ports and LID.\n# ibroute &lt;switch_LID&gt;<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h1 class=\"wp-block-heading\">9) Fabric monitoring<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">SM properties<\/h2>\n\n\n\n<p>election process master SM: recommended (2xSM, master , standby) Each has priority: 4 bit: default=0, highest=15. tie-breaker: lowest GUID<br>SMInfo attribure used by SM to exchange info during subnet discovery and polling: GUID of the port of SM, priority and SM state (master, standby)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">SM failover \/ handover<\/h2>\n\n\n\n<p>SM Failover: Master SM fails. Running sessions are not affected. New sessions need to wait for new master. By default, LIDs are not reassinged by new master.<br>SM Handover: new SM with hight priority takes over master role.<br>-avoid double failover: 1) avoid handover. 2) master_sm_priority=15 for all SM (and hight than current priority)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Monitoring<\/h2>\n\n\n\n<p>light sweep: each 10s. SM interrogates nodes and port info from all switches: Port status changes, new SM appears, standby SM changes priority<br>A change traced by light sweep, causes heavy sweep.<\/p>\n\n\n\n<p>heavy sweep: light sweep detects change or SM receives IB trap. -&gt; SM triggres fabric discovery from scratch: topo discovery, new LIDs (if necessary), program fw tables!.<br>current flows through not affected path, are not affected by rediscovery.<\/p>\n\n\n\n<p>host down or leaf switch down -&gt; avoid heavy sweep (not need to recalculate all fw tables in nodes) -&gt; SM configuration: Ucast-cache=True<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ofed<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code># sminfo -&gt; master SM: LID, GUID, priority and state\n# smpquery nd -&gt; identify whitch node is running the SM\n# saquery -s -&gt; query all SMs (master and standby)<\/code><\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">10) IB topologies part1<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">concepts<\/h2>\n\n\n\n<p>network topology: schematic arrangement of network elements: links, nodes<br>phy topology: how devices are connected<br>logical topo: how data moves from one node to another<br>considerations;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>availability: redundancy and fault tolerance<\/li>\n\n\n\n<li>reliability: downtime and delays are unacceptable<\/li>\n\n\n\n<li>performance: locate faults, troubleshoot errors, allocate resources<\/li>\n\n\n\n<li>future growth: add new nodes without affecting performance or user experience<\/li>\n\n\n\n<li>budget: effective and affordable<\/li>\n<\/ul>\n\n\n\n<p>leaf-spine.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>predictable and deterministic latency<\/li>\n\n\n\n<li>scalability<\/li>\n\n\n\n<li>redundancy<\/li>\n\n\n\n<li>increase bw<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">topologies:<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\">fat-tree<\/h2>\n\n\n\n<p>tree like topology where links nearer the top of the hierarchy are &#8220;fatter&#8221; = having more links\/bw, than links further down. thickness = bandwidth<br>It is about oversubscription ratio: downlinks \/ uplinks =&gt; 1:1 (non-blocking)<\/p>\n\n\n\n<p>non-blocking: oversubscription 1;1 in all levels, higher cost. (real fat-tree are often oversubscribed)<br>blocking: oversubscription 2:1,3:1,3:2, reduced cost, not full bw, low latency is maintained.<\/p>\n\n\n\n<p>summary: good for hpc, non-blocking or oversubscription, lowest\/deterministic latency (2levels-&gt;3hops, 3levels-&gt;5hops)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">dragonfly+ (BGP confederation)<\/h2>\n\n\n\n<p>connect groups in full-mesh, inside group leaf-spine. requires adaptive routing.<\/p>\n\n\n\n<p>summary: support large number hosts, extending fabric without reserving ports (fat-tree requires recabling), lowlat and high bw: flexible and cost reduction<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">torus 3d<\/h2>\n\n\n\n<p>nodes connected in a ring formation in 3D (x,y,z)<br>eachc node has 2 links in each ring (3rings=3D)=6 links to neighbor switches<br>very scalable and resilient:<\/p>\n\n\n\n<p>summary: good for locality, cabling simpler\/shorted (less cost: effective, power, resilient), main benefit: cost -&gt; good for very large installs. Hight fault tolerant.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"595\" height=\"471\" src=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/torus-3d-topo.png\" alt=\"\" class=\"wp-image-1588\" style=\"width:429px;height:auto\" srcset=\"https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/torus-3d-topo.png 595w, https:\/\/blog.thomarite.uk\/wp-content\/uploads\/2024\/02\/torus-3d-topo-300x237.png 300w\" sizes=\"auto, (max-width: 595px) 85vw, 595px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">adaptive routing (AR)<\/h2>\n\n\n\n<p>load balancing between same best cost paths (min-hop) and installs in FIB.<br>For every connection the switch will dynamically choose the least congested port. Reduces contention.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">credit loops<\/h2>\n\n\n\n<p>IB uses credit-based flow-control to avoid packet loss in congested switches: a sending port can send packets if it is granted with credits from receiving port<br>credit loop: cyclic buffer dependency (buffers are full) (some cases you have to reboot a switch to fix!) They can create a deadlock (rarely)<br>avoid credit loops: UpDown routing algo: prevents traffic forwarding from downstream link to an upstream. Forbidden: down -&gt; ups<br>allowed paths: up, down, up -&gt; down, same level (up-up, down-down)<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">10) IB topologies part2<\/h1>\n\n\n\n<p>routing engines: way paths are choosing = routing protocol. Each RE uses its own algo according to topology<\/p>\n\n\n\n<p>min-hop: topo agnostic. default algo. 2 stages: 1) compute min-hop table on each switch 2) LFT output port assigment in eachc switch.<br>doesnt prevent credit loops.<\/p>\n\n\n\n<p>up-down (+AR): fat-tree topo<br>prevent deadlocks (min-hop can&#8217;t)<br>algo: 1) starts with root switches (rank 0). 2) Find all switches 1 hop away fro root -&gt; rank 1 3) Switches 2 hops away from root -&gt; rank 2 4) so on 5) Find shortest path between every pair of endpointns 6) Any path that goes down (away from root) and then up (toward root) is discarded =&gt; rank N -&gt; rank N+1 (up) -&gt; rank N (down)<br>avoid credit loops: forbidden paths go down (away from root) and then up (towards root)<\/p>\n\n\n\n<p>fat-tree (+AR): fat-tree topo. fully-symmetrical fat-tree has its leaf switches connected with the same port index to each spine. Avoid credit loop like UpDown algo (forbidden paths down-up). Can do load-balancing to avoid congestion<\/p>\n\n\n\n<p>torus 2-QoS: torus-2\/3d topo. Free of credit loops, two levels of QoS. Self-heal (single failed switch, and\/or multiple failed links) -&gt; rerouting automatic by SM. Short run time, good scaling<\/p>\n\n\n\n<p>dragonfly+ AR: dragonfly topo. Achieving max bw for different traffic patterns requires non-min multi-path routing =&gt; use min-hop+1 routes. You use min-hop+1 based egress queue load (so you avoid congestion just following a longer path) Trinagule example.<br>Credit loops prevention:<br>-path with down-&gt;up can potentially cause a credit loop.<br>-credit-based flow-control operates per VL: Buffers are allocated per VL. Received credits are granted per virtual lanes.<br>-DragonFly+ uses VL increment to avoid credit loops: The VL value is incremented when packet is forwarded from down-&gt;.up direction. 2 VLs are enough to prevent credit loops.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>drangonfly connects &#8220;groups&#8221;<\/li>\n<\/ul>\n\n\n\n<p>configure updn routing engine<br>\/etc\/opensm\/opensm.conf -&gt; default location &#8211; SM params<br>opensm -c \/etc\/opensm\/opensm.conf -&gt; creates default SM config.<br>For UpDown: provide the roots GUID list -&gt; # ibswitches -&gt; create list in \/etc\/opensm\/root_guid.conf -&gt; update opensm.conf with:<br>root_guid_file \/etc\/opensm\/root_guid.conf<br>Update opensm.conf with routeing engine: routing_engine updn \/\/ or use # opensm -R updn<br>restart opensm: # service opensmd restart<br>check logs: grep table \/var\/log\/opensm.log<\/p>\n\n\n\n<p>&#8212; IB fabric bring up &#8212;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">11) IB driver installation<\/h1>\n\n\n\n<p><br>what is OFED?: OpenFabric Enterprise Distribution: opens source sw for RDMA and kernel-bypass apps.<br>nvidia-ofed: supports IB and ethernet. up to 400G. linux\/windows\/VMs.<\/p>\n\n\n\n<p>install ofed linux:<br>hw requirements: 1GB space, supported linux, admin priv<br>prepare install: ofed_info -s (current version). For new install: kernel + os -&gt; uname -a \/ cat \/etc\/os-release<br>hca installed: lspci -v | grep -i mellanox<br>download driver from nvidia site.<br>mount image, install : # mount -o ro,loop MLNX_OFED_FILE.iso \/mnt<br>cd \/mnt &amp;&amp; sudo .\/mlnxofedinstall<br>restart: # \/etc\/init.d\/openibd restart<br>verify: ofed_info | head -1 &#8211;&gt; verify new version installed<br>ibstart &#8211;&gt; verify HCA is discovered as IB node<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">12) HCA firmware upgrade<\/h1>\n\n\n\n<p>hca hw and tools overview: host-channel-adapter. If you install ofed, upgrade hca too. You can upgrade hca itself.<br>MTF tools: MST: NVIDIA software tools serice. Flint: firmware burning tool. MLXfwreset: loading firmware on 5th gen devices tool<\/p>\n\n\n\n<p>firmware upgrade steps:<br>hca type: lspci | grep -i mellanox<br>hca info: ibv_devinfo -&gt; hca_id, fw_ver, vendor_part, board_id (PSID)<br>download firmware: seach card type and then check every OPN option until you find a PSID that matches board_id (above command)<br>unzip + burn:<br>1) find hca full path: # mst status (or start it: mst start)-&gt; search for \/dev\/mst\/\u2026.<br>2) # flint -d \/dev\/mst\/xxxxxx -i FIRMWARE.bin b \/\/\/ b = burn !!!<br>reset: # mlxfwreset -d CARD reset<br>$ ibstat -&gt; compare fw version changed<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">13) Running the SM<br><\/h1>\n\n\n\n<p>SM on a server, switch or NVIDIA UFM. Considere fabric scale (number of nodes): Init fabric, calculate fw tables, conf nodes and monitor changes. Licensing cost. enhanced features<\/p>\n\n\n\n<p>switch: inband or outband mgmt: mgmt in-band by SM, MLNX-OS has embedded SM. Unmanaged dont have SM.<br>SM for small fabrics (up 2048 nodes). Not support AR and dragonfly. No additional license.<br>enable sm:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>enable<br># conf t<br># show ib sm<br>disable<br># ib sm<br># show ib sm<br>enable<br>configure sm:<br># ib sm sm-priority 14<br># show ib sm sm-priority<br># ib sm ? ==&gt; options<br># ip sm routing-engine ? =&gt; change routing engine from min-hop (default)!<\/p>\n<\/blockquote>\n\n\n\n<p>server: large-medium fabrics. open-sm included in mlnx-ofed. no license. support AR and dragonfly<br>run opensm<br># opensm -h<br>or run as a demon<br># \/etc\/init.d\/opensmd start<br>\/etc\/init.d\/opensmd status<br>logging: \/var\/log\/messages (general) + \/var\/log\/opensm.log (details errors)<br>config: opensm -c \/etc\/opensm\/opensm.conf -&gt; creates default config file<br>routing engine config, list, tries one by one until success: routing_engine ar_updn (nov 2021 default RE is updn with AR)<\/p>\n\n\n\n<p>UFM (Unified Fabric Manager): WebUI solution: telemetry, analytics, etc. Uses OpenSM. Can run on a server as a service, docker or dedicated hw.<br>telemetry, enterprise (telemetry + enhanced monitoring and mgmt), cyber-ai (telmetry + enterprise + security)<br>enterprise: licensed per managed device. WebUI: settings -&gt; subnet manager, setting -&gt; network management: routing engine<\/p>\n\n\n\n<p>&#8212; IB monitoring &#8212;<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">14) IB diagnostics<br><\/h1>\n\n\n\n<p>node-level<br>ofed_info: mlnx_ofed driver version<br>lspci: find hca<br>ibstat: link status<br>ibportstate LID PORT<br>ibroute LID: routing table of switch LID<br>ibv_devices: list hcas<br>ibv_devinfo: list hcas details<\/p>\n\n\n\n<p>fabric-level:<br>ibswitches: list switches<br>ibhosts: list hcas<br>ibnodes: list all nodes<br>ibnetdiscover: show node-to-node connectivity<br>iblinkinfo: list all nodes and connectivity info<br>sminfo: show master sm<br>ipbing<br>ibtracert SLID DLID<br>ibdiagnet<br>ib_write_lat<br>ib_read_lat<br>ib_write_bw<br>ib_read_bw<\/p>\n\n\n\n<p>ibdiagnet: fabric disconery, error detection and diagnostics. part of ibutils2 package. part of mlnx_ofed and ufm.<br>fabric discovery, duplicated GUIDs, duplicated nodes descriptions, LIDs checks, links in INIT state, counters, error counters check, routing checks, link width and speed checks, topology matching, partition checks and BER test.<\/p>\n\n\n\n<p>dump files: ibdiagnet2.log, .lst, .net_dump, .sm, .pm, .fdbs, .pkey<br>default location: \/var\/tmp\/ibdiagnet2\/*<\/p>\n\n\n\n<p>ibdiagnet -v -h<br>ibdiagnet (without params) does a lot of stuff<br>&#8211;i mlx5_2 &#8211;p 1 (card and port in card)<\/p>\n\n\n\n<p>ibdiagnet -pc =&gt; reset all port counters<br>ibdiagnet &#8211;pm_pause_time SEC =&gt; port counters delta validation<br>ibdiagnet -w FILE -&gt; creates a topology file<\/p>\n\n\n\n<p>ibdiagnet2.pm (port counter) port_xmit_wait: waiting time of packet in the send buffer: high values -&gt; bad!<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">15) Wireshark<\/h1>\n\n\n\n<p>ibdump -d mlx5_0 (device_name) -w FILE.pcap<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1) Intro IB open standard: IBTAfeatures:simple mgmt: each fabric has a SM: subnet manager componets:gateway: translate IB&lt;&gt;Ethernetswitch, router (between different subnets)hca: host channel adapter: nic? 2) Intro IB Arch Arch L5 Upper: Mgmt protocols: subnet mgmt and subnet svcs. Verbs to interact with Transport LayerL4 Transport: services to complete specified operation. Reassemble and split packets.L3 &hellip; <a href=\"https:\/\/blog.thomarite.uk\/index.php\/2024\/02\/05\/infiniband-professional\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Infiniband Professional&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-1585","post","type-post","status-publish","format-standard","hentry","category-networks"],"_links":{"self":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1585","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/comments?post=1585"}],"version-history":[{"count":2,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1585\/revisions"}],"predecessor-version":[{"id":1591,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/posts\/1585\/revisions\/1591"}],"wp:attachment":[{"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/media?parent=1585"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/categories?post=1585"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.thomarite.uk\/index.php\/wp-json\/wp\/v2\/tags?post=1585"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}