Bloom filters, profiling, performance numbers

One more cloudflare blog that I had in the to-read list:

I had never heard about Bloom filters so that was interesting and the actual uses of them:

I like his point to chose  ‘m’, number of bits in the bit array, to be a power of two (module operation becomes a bitwise AND):

But at the end, it is not all about the Bloom filters. It is understanding how things work under the hood and see if they are actually delivering, if not, you should change your approach. So the debugging section “A secret weapon – a profiler” is very good. Profiling is not one of my strengths so the tools used are the ones I need to understand and use more often:

strace -cf
perf stat -d
perf record
perf record | head -n 20
perf annotate process_line --source
google-perftools' with kcachegrind  

As well the reference to the performance numbers that are good to have in mind:

So I take a copy here:

  • L1 cache reference 0.5 ns
  • Branch mispredict 5 ns
  • L2 cache reference 7 ns
  • Mutex lock/unlock 100 ns
  • Main memory reference 100 ns
  • Compress 1K bytes with Zippy 10,000 ns
  • Send 2K bytes over 1 Gbps network 20,000 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Round trip within same datacenter 500,000 ns
  • Disk seek 10,000,000 ns
  • Read 1 MB sequentially from network 10,000,000 ns
  • Read 1 MB sequentially from disk 30,000,000 ns
  • Send packet CA->Netherlands->CA 150,000,000 ns 

Things to take in mind:

  • Notice the magnitude differences in the performance of different options.
  • Datacenters are far away so it takes a long time to send anything between them.
  • Memory is fast and disks are slow.
  • By using a cheap compression algorithm a lot (by a factor of 2) of network bandwidth can be saved.
  • Writes are 40 times more expensive than reads.
  • Global shared data is expensive. This is a fundamental limitation of distributed systems. The lock contention in shared heavily written objects kills performance as transactions become serialized and slow.
  • Architect for scaling writes.
  • Optimize for low write contention.
  • Optimize wide. Make writes as parallel as you can.

As well, “The lessons learned” is a great summary of his trip.

  • Sequential memory access great / Random memory access costly -> cache prefetching
  • Advanced data structures to fit L3: optimize for reduced number loads than the amount of memory used.
  • CPU hits the memory wall

So another great post from Marek.