[x86_64] Implement Poly1305 based on 2^26 using AVX2
This patch adds optimized version of Poly1305 based on 2^26 using AVX2 instructions and YMM registers, it interleaves four-blocks horizontally for each loop iteration.
The patch adds new option --enable-x86-avx2
for configuration to compile AVX2 files.
testsuite passes all tests of this patch.
Benchmark of poly1305 update on intel Core i5-10300H CPU
Upstream (Standard based on radix 64) | This patch (AVX2 based on radix 26) |
---|---|
3900.75 Mbyte/s (1.136 cpb) | 6490.70 Mbyte/s (0.691 cpb) |