* optimize `test_ct_intlog2` while still covering all 128 bit positions
* refactor whirlpool to reduce code bloat
replaced the fully unrolled round loop with a runtime loop, reducing
instruction count by 80k in `process_block` and yielding aprox 30%
performance boost due to improved cache locality.
* use compile-time arrays for `test_ct_intlog2`