[x86_64] Use 4-way poly1305 update
Using 4-way poly1305 block update by taking advantage of AVX2 instructions based on radix 26 yields significant performance improvement on Comet Lake arch. The avx2 code has threshold of 32 blocks as it starts to make performance difference. Fat build support is to be added later once MR is approved.
Tested on Intel Core i5-10300H
1-way (Radix 64) | 4-way (AVX2 Radix 26) |
---|---|
4820.42 Mbyte/s | 6256.05 Mbyte/s |