- replaced manual unrolling with loop structures and constant arrays
- instruction count reduced from 12445 to 4016
- maybe about 1 to 2% performance loss on some benchs but take this
number with a grain of salt.
* optimize `test_ct_intlog2` while still covering all 128 bit positions
* refactor whirlpool to reduce code bloat
replaced the fully unrolled round loop with a runtime loop, reducing
instruction count by 80k in `process_block` and yielding aprox 30%
performance boost due to improved cache locality.
* use compile-time arrays for `test_ct_intlog2`