Fast case conversion or how to really compress sparse arrays

Posted on

If we are working with the 7-bit ASCII set, the tables will take 512 bytes.

I would expect 512 bytes (2 x 28 bytes) to be for the 8-bits ASCII set; so I think there’s either a typo or confusion there.

Very cool article.

It’s important to note that these days the cost of bitwise operations/additions/multiplications is generally peanuts compared to a cache miss.

Compressing data-tables means minimizing the cache foot-print, which ultimately means:

  • Less cache misses when executing the case-conversion.

  • Less cache misses when executing the rest of the program; as the case-conversion will have evicted less entries from the cache.

Of course, that’s only the case if your program is cache-bound to start with, but programs become cache-bound early. Using AMD Zen 3 for cache size and AMD Zen 2 for latency numbers:

  • L3 cache: 16 MB per CCD. Miss it? 66ns to access RAM.

  • L2 cache: 512 KB per core. Miss it? 10ns to access L3.

  • L1 cache: 32 KB (data) per core. Miss it? 3ns to access L2.

It’s not rare for an application to require more than 16MB of memory; and you’re sharing with other processes on the same socket. But even if your application only requires 1MB of data, though, the difference between 10ns access vs 66ns access is huge: that’s a 6x factor!

It’s too easy for micro-benchmarks to not take into account the effect of cache misses and therefore look-up tables. Unless your micro-benchmark is specially rigged to trash the cache somehow, LUTs will generally perform super well because — well — the micro-benchmark is the only thing using the CPU cache at that point.

If you are working on a “sporadic” function — eg. you don’t tend to do case-conversion of MBs of data in a loop, you do it on 1 key worth a couple bytes at a time, in between other work — the performance in the context of the application may very well be quite different, though, because suddenly you’re sharing the cache.