I finished this ebook yesterday. With ebooks can’t hardly take notes so in this case was a pitty but as well a bit of a relief.
Really good and better than I expected. As the title says, it is about having range, being generalist and not ultra specialized.
It gives examples where specialized (at early age) is good because it is inside a domain of well-known rules and immediate feedback like golf (Tiger Woods) and chess. It was int
But provide example of exceptional sports figures (Federer) who didnt commit to one sport to late age.
And that affects not only to sport but to your career and the research. Teams with members from different backgrounds and knowledge produce more than specialized team. It provides examples from investing to solving science problems. So it is quite shocking.
Although an early specialization will give you a head start, the general view will win in the long run. It is not bad to be very good at something, but trying different things can bring extra benefits. That’s the summary of the book.
Personally, I buy it. I like networks, but I have interests in linux, automation, hardware, baking, etc. I will never win a Nobel prize (and I dont need it) but I think it gives me from options at the end of the day. You just need to see the job descriptions where they ask for everything.
I hit rock bottom this week. I hope I finally closed one door in my life so I give myself the chance to open others. Made the wrong decision? It is easy when you look back. Do I regret it? The most annoying thing is these are failures so you can’t go back and recover. But I was so bloody newbie!!!…. At least after 5 years…
“For every reason it’s not possible, there are hundreds of people who have faced the same circumstances and succeeded.” Jack Canfield
Head down, crying, cursing, whatever, but forwards. As it has always been.
—-
Somehow managed to list to long videos, something I normally can’t manage (because lack of time, etc)
Negative Beliefs, avoid bitterness, aim for greatness (remarkable things), scape the darkness: Jordan B Peterson with Modern Wisdom: video, podcast.
Find and keep Love: video. 1st Get your shit together. Communication is critical. Be careful with your shopping list….
Using gNOI capabilities to simplify software upgrade use case: video – I had to idea about gNOI so looks interesting. It is crazy that still in XXI, automating a network device is so painful. Thanks to all vendors to make your life miserable.
Go lang for network engineers: videoslides– I always thought that Golang had a massive potential for network automation but there was always lack of support and python is the king. So nice to see that Arista has things to offer.
There are more things, but havent had the chance to review them.
—-
It looks there is new chatbot that is not using the standard NVIDIA GPU. Groq uses LPU (Language Processing Unit). And they say it is better than a GPU. They have this paper but I can’t really see feature of that LPU.
Slurp’it: Show this blog, and the product looks interesting but although is free, it is not opensource and at the end of they you dont want a new vendor-lockin
Container lab in kubernetes: Clabernetes. I would like to play with this one day.
NetDev0x17: videos and sessions. link This is quite low details and most of the time beyond my knowledge. Again, something to take a look at some point.
LLM from scratch: repo. Looks very interesting. But the book it is going to take a long time to hit the market.
Really good book. Easy to digest and even easier to take home. And need to watch this video too. And funny enough, I was watching a bit this video too (that is quite related – interesting the investing fund points, I need to review)
I highlighted a lot of sentences in the book and I think the summary below is too long but I still take the following as basic: save, get room from error, define what you want, get your freedom.
0- Intro
I think he is the summary of the whole book. He was a gast station attendant, janitor and investor who was over 8m$ worth when he died.
Financial success is not a hard science. it is a soft skill, where how you behave is more important than what you know.
Finance is overwhelmingly taught as a math-based field. But knowing what to do tells you nothing about what happens in your head when you try to do it.
We think Finance follows laws like Physics but it is actually guide by people’s behaviour. And that follows to next point, how I behave may be sane for me but crazy to you.
1- No One’s Crazy
It is easy to say a investment decision was good/bad looking back. We make money decisions based on the information we have in the moment and plugged into the unique mental model of how the world works at that moment. So yes, it can look crazy. And investing for the masses, it is actually something very new… so we are newbies, we like it or not.
Some lessons have to be experienced before they can be understood. So that can explain why looks like crazy if you haven’t gone through it.
2 – Luck and Risk
Bill Gates won the lottery attending one of the few schools in the world with a computer.. his friend Kent Evans died in a mountaineering accident. Both sides of the same coin.
Robert Shiller (Economy Nobel Prize): What do you want to know about investing that we can’t know? The exact role of luck in successful outcomes.
So we always read about the successful people/companies (extreme cases). What proportion of these outcomes were caused by actions that are repeatable vs the role of random risk and luck? So the questions is how to identify luck and skill.
So focus less on specific individuals and case studies and more on broad patterns.
To deal with failure, arrange your financial life in a way that those situations will not wipe you out you can keep playing until the odds fall in your side (room for error…) And be able to forgive yourself when judging failures.
Nothing is as good as bad as it seems
3- Never Enough
Enough: “Yes, but I have something he will never have… enough.”
I think that is another key about “wealth”.
The hardest financial skill is getting the goalpost to stop moving.
Social comparison is the problem here.
Enough is not too little: is realizing that the opposite, an insatiable appetite for more, will make you no good.
There are many things never worth risking
4-Confounding Compounding
There are over 2000 books written about Warren Buffer. But his success came from investing for over 75 years… His secret is time. That’s how compounding works.
Compounding is not intuitive so it is easy to ignore.
So good investment is about earning pretty good returns for the longest period of time.
5- Getting Wealthy vs Staying Wealthy
Million ways to get wealthy. But just one to stay wealthy: some combination of frugality and paranoia (a.k.a survival). So getting money and keeping money are two totally different skills.
I like the note about Jesse Livermore (I read that book some time ago). He made the biggest fortune ever during the crash of 1929.
And another quote from Nassim Taleb: Having an edge and surviving are two different things: the first requires the second. You need to avoid ruin. At all costs.
Be financially unbreakable: you will get the biggest returns, because you will be able to stick around long enough for compounding to run its magic.
The most important part of every plan is to plan on the plan not going according to plan (a.k.a backup plan or room for error). A plan is only useful if it can survive reality. And a future filled with unknowns is everybody’s reality. So it is anything that lets you live happily with a range of outcomes.
Conservative is avoiding a certain level of risk. Room of error or margin of safety is raising the odds of success at a given level of risk by increasing your chances of survival. This is something I think I understand but I need some clear examples in my head.
Optimistic about the future but paranoid about what will prevent you from getting to the future is vital: Being pessimistic is too easy.
6- Tails, You Win.
Another interesting notes about Heinz Berggruen. The great investors bought vast quantities of art. A subset of the collections turned out to be great investments, and they were held for a sufficiently long period of time to allow the portfolio return to converge on the return of the best elements. That’s all that happens.
Anything that is huge, profitable, famous or influential, is the result of a tail event (Walt Disney, Brad Pitt winning an award, Venture Capitals, AWS, iPhone, MicroSoft, etc). Tails drive everything.
Same in different ways: Few things account for most results (I think this is the easiest one to understand)
Military genius based on Napoleon: The man who can do the average thing when all those around him are going crazy. So that applied to investing is, most of the time today is not that important. What matters are those number of days where everybody is going crazy… so what do you do???
George Soros: It is not whether you are right or wrong that’s important, but how much money you make when you are right and how much you lose when you are wrong. You can be wrong halt of the time and still make a fortune.
7- Freedom
Controlling your time is the highest dividend money pays: control over doing what you want, when you want, with the people you want to, is the broadest lifestyle variable that makes people happy. Using your money to buy time and options has a lifestyle benefit that few luxury good can compete with. Most stuff we buy, means giving away most control of our time.
Most workforce today are not “labored” so we need to use our head, and it is not that easy to switch off, so we are constantly working with our heads, and then we are losing control over our time.
8- Man in the Car Paradox
No one is impressed with your possessions as much as you are. Humility, kindness and empathy will bring you more respect than a Ferrari.
9- Wealth is What You Don’t See
Spending money to show off, it is the fastest way to have less money. Wealth is an option not yet taken to buy something later. It requires self-control. And because it is hidden, we have few models to learn from. It is much easier to follow the instagram show-off.
10- Save Money
For me this is they key of everything, without savings few things you can do.
Building wealth is more related to your saving rate than income or investment returns.
The value of wealth is relative to what you need: high savings rate -> having lower expenses.
One of the most powerful ways to increase your savings is not to raise your income, it is to raise your humility: what you need is just what sits below you ego. Dont care about what others thing about you (and not just about money!)
You spend less, if you desire less, then you care less about others -> so that goes back to the title of the book… money relies more on psychology than finance.
You dont need any specific reason to save (car, house, holidays, etc): Savings without a spending goal gives you options and flexibility: ability to wait and opportunity to act = time for thinking. And that flexibility and control over your time is an unseen return on wealth. Savings at 0% earn rate can give you more in the sense of taking a lower pay and more satisfying job than you can think.
Intelligence is no longer a sustainable advantage (software eats the world). Competitive advantages tilt toward nuanced and soft skills: flexibility is a main one. Again, it is being able to wait for a good opportunity (career, investment, etc). So having more control over your time and options is one of the most valuable currencies in the world: Just Save it.
11- Reasonable -> Rational
Aiming to be mostly reasonable works better than trying to be coldly rational. Reasonable is more realistic and you have a better chance of sticking with it for the long run, which is what matters at investing. Think of the fever. It is beneficial, but we fight it because it hurts! Minimizing future regret is hard to rationalize on paper but easy to justify in real life.
12- Surprise!
History is the study of change, is not a map of the future.
Scott Sagan: Things that have never happened before happen all the time.
Investing is a hard science. People making imperfect decisions with limited information.
The majority of what’s happening at any given moment in the global economy can be tied back to a handful of past events that were nearly impossible to predict.
History can be a misleading guide to the future of the economy and investment because it doesnt take into account the structural changes of today. For example, when did venture capital start? Before there was only (if lucky) risk aversed bankers.
Historians are not prophets. Things change.
13- Room for Error
The most important part of every plan is planning on your plan not going according to plan. This similar to an earlier point.
Kevin Lewis from Bringing Down the House: We have enough money to withstand any swings of bad luck (so you can fight another day)
Benjamin Graham: The purpose o the margin of safety is to render the forecast unnecessary.
Unknowns are always part of life.
Having a gap between what you can technically endure versus what you can emotionally endure is an overlooked version of room for error. Use room for error when estimating your future returns.
Charlie Munger: The best way to achieve felicity is to aim low. (and a paper)
Nassim Taleb: You can be risk loving and yet completely averse to ruin. If you have 95% chance to be right, be sure that the other 5% is not going to wipe you out.
Back to an earlier chapter, it is important to safe for the sake of it, for the unknowns.
14- You’ll Change
Long-term planning is harder than it seems because people’s goals and desires change over time. We are poor forecasters of our future selves. We really underestimate how much we will change.
So avoid the extreme ends of financial planning (expending everything vs saving everything)
Accept the reality of change and move on as soon as possible.
Charlie Munger first rule of compounding is to never interrupt it unnecessarily.
End of History Illusion: is a psychological illusion in which individuals of all ages believe that they have experienced significant personal growth and changes in tastes up to the present moment, but will not substantially grow or mature in the future.[1] Despite recognizing that their perceptions have evolved, individuals predict that their perceptions will remain roughly the same in the future.
You usually get what you pay for. Same for markets. The volatility/uncertainty fee (the price of returns) is the cost of admission to get returns greater than low-fee investments (ie: money in the bank, etc). Ticket to Disneyland vs local fair. You need to convince yourself the market’s fee is worth it. So find the price, then pay it.
16- You & Me
This relates to point 1 “Nobody is crazy”. This relates to economy bubbles too. The assets have one rational prize in a world where investors have different goals and time horizons. And that can trigger bubbles, when the momentum of short-term returns attracts enough money that the makeup of investors shifts from mostly long term to mostly short term. Then the process feeds itself on and on until it can’t be maintained. So find which game you are playing.
17- The Seduction of Pessimism
Optimism is the best bet for most people because the world tends to get better for most people most of the time.
Optimism is a belief that the odds of a good outcome are in your favor over time, even when there will be setbacks along the way.
But Pessimism sounds smarter and gets more attention. why?
Based on Daniel Kahneman: It is the asymmetric aversion to loss in the evolution: losses loom larger than gains. Organisms that treat threats as more urgent than opportunities have a better chance to survive and reproduce.
And again relates to point 1 “nobody’s crazy”.
As well, bad news about economy can affects everybody so you pay attention.
Progress happens too slowly to notice but setbacks happen too quickly to ignore.
And finally, expecting things to be bad is the best way to be pleasantly surprised when they are not.
18- When You’ll Believe Anything
Stories are the most powerful force in the economy. Imagine venture capitalist that put money in things dont exist…. Another example in 2009, when we stopped believing house prices would keep rising.
The more you want something to be true, the more likely you are to believe a story that overestimates the odds of it being true. For protecting you about that, you need a bigger gap between what you want to be true and what you need to be true to have an acceptable outcome.
Everyone has an incomplete view of the world. But we form a complete narrative to fill the gapgs.
Daniel Kahneman: The ability to explain the past, gives us the illusion that the world is understandable. It gives the illusion that the world makes sense, even when it doesnt. And that produces big mistakes.
Some years ago, I tried an amazing roasted cauliflower in a roof top, and wanted to try one day. I watched this video and then I had to do it.
INGREDIENTS Whole cauliflower head
Cooking liquid 2 cups cheat white wine 2 cups veggoe stock 3 tbsp olive oil 2 tsp white wine vinegar 2 tbsp brown sugar 5 cloves garlic, slightly crushed 2 bay leaves 1 onion, in quarters salt and pepper
Flavor Paste
3 tbsp olive oil
1 tsp smoked paprika
1 any dried herb blend
2 tbsp fresh grated parmesan cheese
2 tbsp butter, melted
1 tsp fresh ground black pepper
1 tbsp tomate pure
Garnish
2-3 tsp fresh grated parmesan cheese
fresh chopped parsley
DIRECTIONS – Mix Cooking Liquid ingredients in a large stockpot and bring to a boil – Once boiling, reduce heat to medium low and place cauliflower head in the pot and baste with a ladle once or twice as it cooks covered for 10-15 min. – Preheat oven to 180C. Convection setting on if you have that option. – While cauliflower is steaming in the stockpot, mix Flavor Paste ingredients in a small bowl. – After cauliflower has cooked in the stockpot for 10-15 min, remove and transfer to a baking sheet with a raised grate. – Spoon and spread Flavor Paste all over cauliflower head. Spray lightly with olive oil. – Place cauliflower in preheated oven for about 30 min. – Remove from oven and grate parmesan cheese all over and spray again lightly with olive oil – Place in oven for another 15-20 min or until deep golden brown. – Transfer to serving platter and garnish with fresh chopped parsley and enjoy!
And this is my result:
My coating wasnt great and I had it for lunch next day so it wasn’t the same, but still it was good. And as usual, room for improvement!
Still It is not the same roasted cauliflower I had that time…. need to do proper research.
I have done it before but I watched this video and wanted to do it like him, with homemade lasagna sheets.
FOR THE EGG LASAGNA SHEETS Durum wheat semolina flour 350g (just used fine semolina) Cake flour 150g (I just used normal wheat white flour) Spinach (raw) 250g Eggs 2 Egg yolks 3
FOR THE RAGÙ (MEAT SAUCE) Minced beef 300g Pork pancetta 150g (I used a bit of chorizo as didnt have it) Carrots 50g Celery 50g Onions 50g Red wine 1/2 cup (100g) Tomato puree 300g Vegetable broth to taste Salt, pepper and olive oil
FOR THE BÉCHAMEL SAUCE Butter 70g White flour 70g Whole milk 1 liter (likely I used much less) Salt, pepper and nutmeg
TO SEASON Butter to taste Grated Parmesan cheese: 270g (I didnt have that much and used a mozarella)
Making lasagna sheets with a rolling pin is not the same 🙂
open standard: IBTA features: simple mgmt: each fabric has a SM: subnet manager
nodes and links discovery
local id assigment: LIDS
routing table calculations and deployment
configure nodes and ports ie: qos high bw: non-blocking, bi-dir. 4 physical lanes (max 12) EDR: 25G per lane / HRD: 50G per lane / NDR: 100G per lane cpu offload: kernel bypass, RDMA for CPU and GPU. low lat: 1micro for RDMA scale out/flex: up to 48k nodes in one subnet. Beyond that use IB routers/ qos: resilience: self healing. 1ms. LB: adaptive routing, dynamic load balancing sharp: mpi super performance: scalable hierarchical aggregation and reduction protocol. offload collective operations from host cpu/gpu. variety topologies: fat tree, torus 3d, dragonfly
L5 Upper: Mgmt protocols: subnet mgmt and subnet svcs. Verbs to interact with Transport Layer L4 Transport: services to complete specified operation. Reassemble and split packets. L3 Network: describes the protocol for routing a packet between subnets L2 Link: describes the packet format and protocols for packet operation. (routing within a subenet) L1 Physical: framing and signaling
L2: LRH 8B + (L345) + Trailer (ICRC 4B + VCRC 2B) LRH: Local Route Header: local src and local dst port. Includes SL (Svc Level) and VL (?). VL is the only field that changes while the packet traverses the subnet. ICRC: Invariant CRC // VCRC: Variant CRC L3: GRH 40B + L45 GRH: Global Route Header: present in packet that traverses multiple subnets. Routers forward packet based on GRH. Router recalculate VCRF but not ICRC. L4: BTH 23B + ETH var + L5 BTH: Base Transport Header: operation code (first, last, intermediate or only packet + operattion type: send, rdma wr, read, atomic), seq num (PSN) and partition. ETH: Extended Transport Header: conditionally present depending on CoS and operation code. L5: Payload 256-4096B
Wireshark: (L3 only in packet that need to be routed to a different subnet.) Local Route Header -> L2 Base Transport -> L4 DETH – Datagram Ext Transport Header -> L4 MAD Header – Common mgmt datagram -> L5 SMP (Directed Route) -> L5 ICRC – L2 VCRC – L2
Mgmt
fabric: link, switches and routers than connect channel adaptor subnet: port and links with comom subnet id and managed by same SM. -router connects subnets
SM: subnet manager. Centralized routing mgmt. plug and play. One master SM, the rest standy.
discovering topology
assigning local ids to nodes (LIDs)
calculate and program switching forwarding tables
managin elements
monitoring elements Impleted in a server, switch or specialized device.
elements: Manager: active entity Node: managed entity: switch, HCA, router Agent: each node has a SMA (subnet manager agent). Passive, responds to Manager. Can send traps
MADs: standard message format betwen Agent and Manager
Addressing: L1: GUID: Global Unique Id: unique address burned by vendor in hw: chassis, HCAs, switches, routers and ports. L2: LID: Local Id: Assigned by SM. Unique within the subnet. Src and Dst LIDs are present in LRH. Dst LID is used by switch to send packet. L3: GID: Global Id: identify end port or multicast group. Unique across subnets. Src and dst GID are in GRH. Dst GID is used by router.
OFED Monitoring Utilities
OpenFabrics Entreprise Distribution (OFED): sw stack for RDMA and kernel bypass apps. OFED utilities facilitate control, mgmt and diagnosis of IB fabrics.
verify OFED installation: $ ofed_info | head -1 verify OFED running: $ /etc/init.d/openibd status verify HCA (nic) installe: $ lscpi | grep -i mellanox verify IB running: $ ibstat -> list all local HCAs. info from IB driver. GUID, LID, por state, rate verify connectivity: ibping (verify connectiity between hosts). It is Client-Servre command destination: # ibping -S (server mode) source: # ibping -L verify path: ibtracert: source LID to dst LID. # ibtracert ===> You dont have to run the command from the source LID itself !!!
3) Physical Layer
Overview
functions: bit sync, bit rate control, phy topologies, transmission mode specifications: start, end delimeter, data symbos
HCAs = Host Channel Adapter = NICs
connect server to switch. NIC + offload. 1 or 2 ports. GUID = MAC
Media Types and Interconnection
link width: 1,4,8,16 lanes. Current usage: only 4 link rate: link speed * link width DAC ACO EDR: Enhanced Data Rate – 25G per lane = 100G 5m 100m HDR: High DR – 50G per lane = 200G 2m 100m NDR: Next DR – 100g per lane = 400G 4m XDR: Extreme DR – 200G per lane = 800G
DAC: direct attach – copper cable AOC: active optical cable: each line: 1xtx 1xrx – total 8 (more expensive than DAC). MultiMode (3-100m)
Responsabilities
establlish physical link, monitor status, inform link layer, guaranteeing signal integrity for best Bit Error Rate (BER)
status: polling (not cable connected), disabled, portConfigTraining, LinkUp, LinkError Recovery (cable is faulty) # ibstat => show you status of hca
BER = number bit errors / total number bit transferred
Addressing
GUID: (like MAC) Globally Unique Id = 65 bit (assigned by vendor)
system GUID: abstract several GUID in one (like a cluster of devices)
Node GUID: HCA, switches or routers
Port GUID: HCA port.
HCA has: 1x System GUID, 1x Node GUID, 1x Port GUID per physical port –> # ibstat
Switch (Fixed): 1xASIC (1xNode GUID), 1xSystem GUID. It doesnt have Port GUID
Director (Modular) switch: 1xSystem GUID, each module has 1xNode GUI
OFED
# ibportstate -> state, speed, lanes, etc
# ibswitches -> list switches in the subnet and GUIDs
# ibhosts: list all HCAs in the subnet and GUIDs
# ibnodes: list both HCAs and switches in subnet.
4) Link Layer
switching inside local subnet
Link Layer Services
Packet Mgmt: Link mgmt packets data packets: send, read, write, ack header= LRH 8B + GRH 40B + BTH 12B + ETH var payload= 256-4096B ICRC 4B + VCRC 2B
L2 Addressing routing inside local subnet. Each node has LID (local ID) 2B inside LRH LID assigned by Subet Manager when initilization and when topology changes. HCAs: LID per port Fix form switch: 1 LID Modular switches: 1 LID per module Each subnet max 48k unicast LID 16k multicast LID
QoS enabled prioritization app/users/data flows. Service Levels (SL) and Virtaul Lanes (VL) SL is in LRH: defines class of packet VL is in LRH: implements multiple logical flows over a single physical link different packets are mapped to different VLs based on SL (marking) each VL has a weight and priority each VL uses different buffers each VL has a scheduler Max 16 VL: special VL: VL15: Subnet Manager traffic only VL0: all data traffic VL1-14: free to use to implement your QoS policy
Packet Forwarding LID is read by switch to route to destination, checking the LFT (Linear Forwarding Table: table of LDID -> Exit Port) Implementing QoS: LFT contains SL to VL mappings # ibswitches -> list of switches with LID # ibroute –> shows LFT of switch with LID 10 // OutPort=000 means the packet is processed by switch.
Flow Control Lossless Fabric. Flow Control: prevents fast sender to overwhelm slow receiver to avoid drops and retransmissions. Credit based FC: receiver sends credit to sender to indicate availability of receive buffers. Sender waits for credits before transmissing. packet are not held forever. There is timeout, if expires, packet is dropped. Each VL can have a separate FC.
Data Integriy by CRC: Cyclic Redundancy Check. Hash function. If calculation of CRC doesnt match, packet is dropped and request resend. end-to-end integrity ICRC: invariant – all field that dont change 32bit VCRC: variant – whole packet. 16bit
OFED
# iblinkinfo: all nodes in fabric: LID, GUID, hostname, link speeds
# ibnetdisconer: fabric discovery and list all ndoes: LID, GUID, hostnames and link speeds. Generates a file with topology
5) Network Layer
routing solution overview
connect different subnets (each max 48k nodes)
routing benefits: -scaling -isolation: separation, fault resilience, reliability, availability -subnet management per each subnet -connectivity: each subnet can have different topology
network layer overview
handles routing of packets between subnets using GID in GRH 40B (Global Routing Header) unicast and multicast GID: Global ID – 128 bit — identifies single port or multicast group: GID= 64bit subnet prefix + port GUID (kindoff ipv6) globally unique across subnets
each HCA port has an automatic assigned default GID (fe80::) that can be used only in local subnet (kindoff ipv6 link-local)
OFED
# ibv_devices -> ib devices installed in server (hcas) # ibaddr -> displays GID and LID
6) Transport Layer
overview
end-2-end communication services for apps – virtual channel. segment/reassembly channel end-point are called Queue Paros (QPs): Each QP represents one end of a channel. QP bypass kernel during data transfer. HCA oversees reliability QP has a send and receive queue. QP id is 24 bits. apps have direct access to hw: mapping app’s virtual address into the QP. If an app required more than 1 connection -> more QPs are created QP workflow: A work queue is the app’s interface to the IB fabric. If app wants to send/receive data -> post a Work Request (WR) to a work queue (that is a WQE – WQ Element) When the HCA completes a WQE, a completion queue element (CQE) is placed on a completion queue.
Responsibilities: Three below
segmentation/reassembly
segment when message bigger than MTU, done by HCA. HCA receiver side reassembles. payload: 256-4096 bytes default mtu = 4096
transport modes
QP has 4 transport service types. Source/Destination QPs must have same mode. Service type depends on app. RC: reliable connection UC: unreliable connection RD: reliable datagram UD: unreliable datagra
connected: dedicated QP for one connection in eachc end. Higher performance than datagra but more kernel memory consumed. Most used. Segmentation is supported datagram: single QP servers multiple connections. Segmentation is not suppoerted. More scalable that connected (similar to multicast) reliable: each packet has Packet Seq Num (PSN). Receiver send Acks if packet arrive in order, send negativa ack otherwise. Send QP has a timer. Similar to TCP. unrelible: no ack.
partitions
divide large cluster into small isolated subclusters -> multitenancy, multi apps, security, qos. ports maybe members of multiple partitions at once port in different partitions are unaware of eachc other.
PKEY: partition id. 16bit in BTH header. Carried in packets and stored in HCA. Used to determine partition membership. The Subnet Manager SM assings the PKEY to the ports.
membership type: limited vs full limited: can’t accept other limited membership in the partition. all nodes may communite with SM. Full<>Limited is always oke (with same PKEY) IE: storage, network mgmt. default PKEY is 0x7fff. everything is part of that pkey and assigned by SM. And all are full. 65535
high-order bit (left most) in PKEY records the type membership: 0 = limited / 1= full -> 0x7fff = 111 1111 1111 1111
offloading
RDMA: remote direct memory access. data read/write to remote server bypassing CPU in both ends. zero buffer copy. reduce latency, increase throughput, cpu freed up
two methos for offloading: -channel semantic: send/receive. Sending app has no visibility on receivers buffer or data structure. Just send data. Syncronoues data flow -memory semantic: rdma read/write rdma write example receivedr side, register a buffer in its memory space and pass it to the sender. Sender uses RDMA send/write. Async communication. sender sides does the same. send side puts a WQE. its hca generates CQE. The receiver HCA puts the data directly in the memory, there is no WQE/CQE in receiver side.
ofed
perftest: read/write and send tests. client-server. cpu same in client and server.
latency perf test (-h)
server client
ib_read_lat ib_read_lat
ib_write_lat ib_write_lat
ib_send_lat ib_send_lat
bw perf test (-h)
server client
ib_read_bw ib_read_bw
ib_write_bw ib_write_bw
ib_send_bw ib_send_bw
7) Upper Layer
overview
support upper layer protocols (Native IB RDMA, IPoIB,etc). mgmt svc protocls (Subnet mgmt and subnet services). sw transport verbs to communicate with HCA/IB fabric (clients of upper layer)
upper layer protocols: MPI (for HPC), IPoIB (enables TCP/IP over IB), SDP (high perf interface for standar socket apps – TCP), SRP (SCSI devices over RDMA), iSER (zero copy RDMA to eliminate TCP and iSCSI bottleneck, better than SRP), NFS RDMA (NFS over RDMA)
management service protocols
-subnet mmgnt: Uses special mgmt datagram (MAD) class called SMP: subnet mgmt packet -> uses special QP0, always uses VL15 and not subject to flow control. -general services: Used MAD called GMP: General mgmt Packet. Each port has a QP1 and all GMPs are received on QP1 are processed by one GSA (General Service Agent). GMP uses any VL except 15 (default 0), subject to Flow Control
sw transport verbs
verb: describe actions how an app request acctions from the messaging svc. ie RMDA send: rdma_post_send, rdma_post_recv RDMA write: rdma_post_write RDMA read: rdma_post_read OpenFabrcAlliance: defines verbs specification.
— Fabric Mgmt —
8) Fabric Init
Init Stages
subnet has a common Subnet ID. Router connects subnets. Each subnet has SM (discovery topo, assign LIDs to nodes, calculate/program forwarding tables, manage all elements, monitor changes). SM can be a server, switch or special device. Each node has a SMA (SM Agent) that communicates with SM
1 Phy Fabric Establish: connect all cables
2 Subnet Discovery: Once SM wakes up, starts discovery with direcltly connected nodes, and then their neigbors. SM gathers switch info, port info and host info. SM uses SMPs (SM packets)
3 Info gathering: SMPs uses VL 15. Two types: -Directed-routed: forwarded based on a vector of port numbers. Not dependent of routing table entries. Provide means to communicate before switches and hosts are configured (before LIDs are assigned). Mainly for discovery. Only SMI (SM interface) allows for these packets. Two types of messages: — get: SM polls fabric with get. — get response: answer from devices. Two types of commands: — get node / port info: — get response node / port info: -LID-routed: forwards using switch forwarding table (after SM populates them)
topo info gather: switches, hcas, ports, links. Topo described by nodes GUID and port numbers. node info gather: type, number ports, GUID, description port info gather: MTU, VLs, width (num lanes), speed.
4 LIDs Assigment: SM assigns LDIS to nodes HCA: 1 LID per port 1RU switches (1 ASIC): 1 LID for whole switch Modular switch: 1 LID per module (linecard)
5 Paths Establishments min-hop: calculate number of hops required by eachc port to reach each destination LID. Shortest is best. tie-breaker: port with fewer LIDs assigned.
6 Port Config LID (unique in subnet), width (number of physical lines), MTU (default 4096), speed. QoS: VLs, SL to VL (mapping table Service Level to VL), VL arbitration
7 Switch Config SM populates the switch’s LFT with the best routes. LFT: destination LIDs -> exit port. And SL-VL table.
8 Subnet Activation IB port: physical states: polling (after power on, cable not connected), training (establish link sync), linkup (ready to transfer packets) logical states: down (phy is down: polling or training), init (phy is up but only deals with SMP and flow control), armed (verify data transfer fine. SM sends dummy SMP with VCRC to verify that is not corrupted), active (SM send active to port)
ofed
# ibswitches: GUID, description, ports and LID.
# ibroute <switch_LID>
9) Fabric monitoring
SM properties
election process master SM: recommended (2xSM, master , standby) Each has priority: 4 bit: default=0, highest=15. tie-breaker: lowest GUID SMInfo attribure used by SM to exchange info during subnet discovery and polling: GUID of the port of SM, priority and SM state (master, standby)
SM failover / handover
SM Failover: Master SM fails. Running sessions are not affected. New sessions need to wait for new master. By default, LIDs are not reassinged by new master. SM Handover: new SM with hight priority takes over master role. -avoid double failover: 1) avoid handover. 2) master_sm_priority=15 for all SM (and hight than current priority)
Monitoring
light sweep: each 10s. SM interrogates nodes and port info from all switches: Port status changes, new SM appears, standby SM changes priority A change traced by light sweep, causes heavy sweep.
heavy sweep: light sweep detects change or SM receives IB trap. -> SM triggres fabric discovery from scratch: topo discovery, new LIDs (if necessary), program fw tables!. current flows through not affected path, are not affected by rediscovery.
host down or leaf switch down -> avoid heavy sweep (not need to recalculate all fw tables in nodes) -> SM configuration: Ucast-cache=True
ofed
# sminfo -> master SM: LID, GUID, priority and state
# smpquery nd -> identify whitch node is running the SM
# saquery -s -> query all SMs (master and standby)
10) IB topologies part1
concepts
network topology: schematic arrangement of network elements: links, nodes phy topology: how devices are connected logical topo: how data moves from one node to another considerations;
future growth: add new nodes without affecting performance or user experience
budget: effective and affordable
leaf-spine.
predictable and deterministic latency
scalability
redundancy
increase bw
topologies:
fat-tree
tree like topology where links nearer the top of the hierarchy are “fatter” = having more links/bw, than links further down. thickness = bandwidth It is about oversubscription ratio: downlinks / uplinks => 1:1 (non-blocking)
non-blocking: oversubscription 1;1 in all levels, higher cost. (real fat-tree are often oversubscribed) blocking: oversubscription 2:1,3:1,3:2, reduced cost, not full bw, low latency is maintained.
summary: good for hpc, non-blocking or oversubscription, lowest/deterministic latency (2levels->3hops, 3levels->5hops)
dragonfly+ (BGP confederation)
connect groups in full-mesh, inside group leaf-spine. requires adaptive routing.
summary: support large number hosts, extending fabric without reserving ports (fat-tree requires recabling), lowlat and high bw: flexible and cost reduction
torus 3d
nodes connected in a ring formation in 3D (x,y,z) eachc node has 2 links in each ring (3rings=3D)=6 links to neighbor switches very scalable and resilient:
summary: good for locality, cabling simpler/shorted (less cost: effective, power, resilient), main benefit: cost -> good for very large installs. Hight fault tolerant.
adaptive routing (AR)
load balancing between same best cost paths (min-hop) and installs in FIB. For every connection the switch will dynamically choose the least congested port. Reduces contention.
credit loops
IB uses credit-based flow-control to avoid packet loss in congested switches: a sending port can send packets if it is granted with credits from receiving port credit loop: cyclic buffer dependency (buffers are full) (some cases you have to reboot a switch to fix!) They can create a deadlock (rarely) avoid credit loops: UpDown routing algo: prevents traffic forwarding from downstream link to an upstream. Forbidden: down -> ups allowed paths: up, down, up -> down, same level (up-up, down-down)
10) IB topologies part2
routing engines: way paths are choosing = routing protocol. Each RE uses its own algo according to topology
min-hop: topo agnostic. default algo. 2 stages: 1) compute min-hop table on each switch 2) LFT output port assigment in eachc switch. doesnt prevent credit loops.
up-down (+AR): fat-tree topo prevent deadlocks (min-hop can’t) algo: 1) starts with root switches (rank 0). 2) Find all switches 1 hop away fro root -> rank 1 3) Switches 2 hops away from root -> rank 2 4) so on 5) Find shortest path between every pair of endpointns 6) Any path that goes down (away from root) and then up (toward root) is discarded => rank N -> rank N+1 (up) -> rank N (down) avoid credit loops: forbidden paths go down (away from root) and then up (towards root)
fat-tree (+AR): fat-tree topo. fully-symmetrical fat-tree has its leaf switches connected with the same port index to each spine. Avoid credit loop like UpDown algo (forbidden paths down-up). Can do load-balancing to avoid congestion
torus 2-QoS: torus-2/3d topo. Free of credit loops, two levels of QoS. Self-heal (single failed switch, and/or multiple failed links) -> rerouting automatic by SM. Short run time, good scaling
dragonfly+ AR: dragonfly topo. Achieving max bw for different traffic patterns requires non-min multi-path routing => use min-hop+1 routes. You use min-hop+1 based egress queue load (so you avoid congestion just following a longer path) Trinagule example. Credit loops prevention: -path with down->up can potentially cause a credit loop. -credit-based flow-control operates per VL: Buffers are allocated per VL. Received credits are granted per virtual lanes. -DragonFly+ uses VL increment to avoid credit loops: The VL value is incremented when packet is forwarded from down->.up direction. 2 VLs are enough to prevent credit loops.
drangonfly connects “groups”
configure updn routing engine /etc/opensm/opensm.conf -> default location – SM params opensm -c /etc/opensm/opensm.conf -> creates default SM config. For UpDown: provide the roots GUID list -> # ibswitches -> create list in /etc/opensm/root_guid.conf -> update opensm.conf with: root_guid_file /etc/opensm/root_guid.conf Update opensm.conf with routeing engine: routing_engine updn // or use # opensm -R updn restart opensm: # service opensmd restart check logs: grep table /var/log/opensm.log
— IB fabric bring up —
11) IB driver installation
what is OFED?: OpenFabric Enterprise Distribution: opens source sw for RDMA and kernel-bypass apps. nvidia-ofed: supports IB and ethernet. up to 400G. linux/windows/VMs.
install ofed linux: hw requirements: 1GB space, supported linux, admin priv prepare install: ofed_info -s (current version). For new install: kernel + os -> uname -a / cat /etc/os-release hca installed: lspci -v | grep -i mellanox download driver from nvidia site. mount image, install : # mount -o ro,loop MLNX_OFED_FILE.iso /mnt cd /mnt && sudo ./mlnxofedinstall restart: # /etc/init.d/openibd restart verify: ofed_info | head -1 –> verify new version installed ibstart –> verify HCA is discovered as IB node
12) HCA firmware upgrade
hca hw and tools overview: host-channel-adapter. If you install ofed, upgrade hca too. You can upgrade hca itself. MTF tools: MST: NVIDIA software tools serice. Flint: firmware burning tool. MLXfwreset: loading firmware on 5th gen devices tool
firmware upgrade steps: hca type: lspci | grep -i mellanox hca info: ibv_devinfo -> hca_id, fw_ver, vendor_part, board_id (PSID) download firmware: seach card type and then check every OPN option until you find a PSID that matches board_id (above command) unzip + burn: 1) find hca full path: # mst status (or start it: mst start)-> search for /dev/mst/…. 2) # flint -d /dev/mst/xxxxxx -i FIRMWARE.bin b /// b = burn !!! reset: # mlxfwreset -d CARD reset $ ibstat -> compare fw version changed
13) Running the SM
SM on a server, switch or NVIDIA UFM. Considere fabric scale (number of nodes): Init fabric, calculate fw tables, conf nodes and monitor changes. Licensing cost. enhanced features
switch: inband or outband mgmt: mgmt in-band by SM, MLNX-OS has embedded SM. Unmanaged dont have SM. SM for small fabrics (up 2048 nodes). Not support AR and dragonfly. No additional license. enable sm:
enable # conf t # show ib sm disable # ib sm # show ib sm enable configure sm: # ib sm sm-priority 14 # show ib sm sm-priority # ib sm ? ==> options # ip sm routing-engine ? => change routing engine from min-hop (default)!
server: large-medium fabrics. open-sm included in mlnx-ofed. no license. support AR and dragonfly run opensm # opensm -h or run as a demon # /etc/init.d/opensmd start /etc/init.d/opensmd status logging: /var/log/messages (general) + /var/log/opensm.log (details errors) config: opensm -c /etc/opensm/opensm.conf -> creates default config file routing engine config, list, tries one by one until success: routing_engine ar_updn (nov 2021 default RE is updn with AR)
UFM (Unified Fabric Manager): WebUI solution: telemetry, analytics, etc. Uses OpenSM. Can run on a server as a service, docker or dedicated hw. telemetry, enterprise (telemetry + enhanced monitoring and mgmt), cyber-ai (telmetry + enterprise + security) enterprise: licensed per managed device. WebUI: settings -> subnet manager, setting -> network management: routing engine
— IB monitoring —
14) IB diagnostics
node-level ofed_info: mlnx_ofed driver version lspci: find hca ibstat: link status ibportstate LID PORT ibroute LID: routing table of switch LID ibv_devices: list hcas ibv_devinfo: list hcas details
fabric-level: ibswitches: list switches ibhosts: list hcas ibnodes: list all nodes ibnetdiscover: show node-to-node connectivity iblinkinfo: list all nodes and connectivity info sminfo: show master sm ipbing ibtracert SLID DLID ibdiagnet ib_write_lat ib_read_lat ib_write_bw ib_read_bw
ibdiagnet: fabric disconery, error detection and diagnostics. part of ibutils2 package. part of mlnx_ofed and ufm. fabric discovery, duplicated GUIDs, duplicated nodes descriptions, LIDs checks, links in INIT state, counters, error counters check, routing checks, link width and speed checks, topology matching, partition checks and BER test.
I remember that several years a go in one of my jobs we used to apply this… I thought it was genius, although I dont remember was called Eisenhower Matrix. Anyway, obviously, I never applied to myself. And today, I have read the best explanation I can remember so far. So I must post-it.