This should result in a clearer idea of how fast the procedures are
running, as the loop can run without going back and forth to the system
for the time.
- Assume unaligned loads are cheap
- Explicilty use 256-bit or 128-bit SIMD to avoid AVX512
- Limit "vectorized" scanning to 128-bits if SIMD is emulated via SWAR
- Add a few more benchmark cases
This new algorithm uses a Scalar->Vector->Scalar iteration loop which
requires no masking off of any incomplete data chunks.
Also, the width was reduced to 32 bytes instead of 64, as I found this
to be about as fast as the previous 64-byte x86 version.
The `original_pattern` introduced a tenuous dependency to the expression
value as a whole, and after some consideration, I decided that it would
be better for the developer to manage their own pattern strings.
In the event you need to print the text representation of a pattern,
it's usually better that you manage the memory of it as well.