Calling Bullshit

This is an interesting book about the flooding of data we need to go through and the difficulty to figure out what is true or not. And I feel it many times you read something “scientific” with many numbers, stats, etc and you kind off believe that has to be true. And those new pharmaceutical drugs that are so amazing or latest paper with a dramatic breakthrough.

Interesting points:

With the hype about machine learning, understanding the algorithm may be out of our understanding but the critical thing is the training data fed into that algorithm. GIGO = Garbage In, Garbage Out. Because the training data is “biased” or not relevant, imagine how is going to be the result.

Correlation is not causation. This is a difficult topic becase we see very easily causation everywhere or find one that matches our theory.

Goodhart’s law adapted to normal people: “When a measure becomes a target, it ceases to be a good measure”. That’s so true. Think of your performance review at work, the GPU tests, etc.

Regarding the stats, it is important to pay attention to the axis: start at zero? same proportions/scales?, be mindful of 3D stats, “ducks” decorate or obscure the meaningful data,

If it is too be good to be true/false, then it isn’t.

“mathiness”: formulas and expressions that look like “good” math but they lack logical coherence and formal rigor. This is very typical for things that are not really easy to quantify (ie healthcare quality management), how things are measured?, unit? etc

One of my favourite examples is the paper about the fMRI on the brain of dead! salmons when showing picture of people showing different types of mood. This was important to clarify that MRI images maybe are not as perfect as you expect. I assume that nowadays that has improved….

Prosecutor’s fallacy: You need to prove you customer is innocent although there is DNA match in a database. There is an error rate of 1 in 10,000,000.

MatchNo Match
Guilty10
Innocent550,000,000

You are the defence prosecutor and you want to focus in the left column (blue). That means that there are 5 chances out of 6 (5+1) that your client is inocent having a DNA match.

p-values: null hypothesis and alternative hypothesis. Most papers are based on a p-value of 0.05 (now you have Goodhart’s law)

Refuting Bullshit:

  • Use “reductio ad absudum”
  • Be memorable (dead salmon example)
  • Find counterexamples (immune system theory vs trees)
  • Provide analogies (74M$ -> 2sec faster)
  • Redraw figures
  • Deploy a null model

I leave a lot of things behind that I dont remember but it is worth the reading (and more than once)

In summary, the goal is to be “smart” sceptic and dont believe everything throw to us.