This should result in a clearer idea of how fast the procedures are
running, as the loop can run without going back and forth to the system
for the time.
- Assume unaligned loads are cheap
- Explicilty use 256-bit or 128-bit SIMD to avoid AVX512
- Limit "vectorized" scanning to 128-bits if SIMD is emulated via SWAR
- Add a few more benchmark cases