[PowerPC] Implement _nettle_poly1305_blocks based on radix 2^44
This patch optimizes Poly1305 for powerpc64 architecture by utilizing POWER9-specific instruction vmsumudm
for full 64-bit multiplication applied on 4-blocks at parallel based on radix 2^44
testsuite passes all tests of this patch.
Benchmark of poly1305 update using nettle-benchmark on Power9
C | single block (2^64) | multi blocks (2^44) |
---|---|---|
472.63 Mbyte/s | 658.45 Mbyte/s | 2136.30 Mbyte/s |