nettle merge requestshttps://git.lysator.liu.se/nettle/nettle/-/merge_requests2024-03-28T19:05:54Zhttps://git.lysator.liu.se/nettle/nettle/-/merge_requests/63Implement SHAKE1282024-03-28T19:05:54ZDaiki UenoImplement SHAKE128https://git.lysator.liu.se/nettle/nettle/-/merge_requests/60Implement RSA-OAEP encryption/decryption2024-02-15T19:16:48ZDaiki UenoImplement RSA-OAEP encryption/decryptionThis extends !20 by Nicolas with side-channel silent decryption operation.This extends !20 by Nicolas with side-channel silent decryption operation.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/59Use Test instruction instead of And to check remaining single block2023-04-07T08:57:47ZMaamoun TKUse Test instruction instead of And to check remaining single blockWe don't need the output of And instruction when checking single block existence so I replaced it with Test instruction.We don't need the output of And instruction when checking single block existence so I replaced it with Test instruction.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/57[x86_64] Use 2-way GHASH pclmul update2023-04-03T05:27:37ZMaamoun TK[x86_64] Use 2-way GHASH pclmul updateI observed that pclmulqdq has latency of 7 cycles on Comet Lake arch and a reciprocal throughput of 7/7 = 1 so 2-way GHASH block update nearly doubles the performance speed on that architecture.
Tested on Intel Core i5-10300H
| 1-way (F...I observed that pclmulqdq has latency of 7 cycles on Comet Lake arch and a reciprocal throughput of 7/7 = 1 so 2-way GHASH block update nearly doubles the performance speed on that architecture.
Tested on Intel Core i5-10300H
| 1-way (Former) | 2-way |
| ------ | ------ |
| 3014.85 Mbyte/s | 6010.53 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/56[PowerPC] Implement _nettle_poly1305_blocks based on radix 2^442022-11-09T19:56:12ZMaamoun TK[PowerPC] Implement _nettle_poly1305_blocks based on radix 2^44This patch optimizes Poly1305 for powerpc64 architecture by utilizing POWER9-specific instruction `vmsumudm` for full 64-bit multiplication applied on 4-blocks at parallel based on radix 2^44
testsuite passes all tests of this patch.
B...This patch optimizes Poly1305 for powerpc64 architecture by utilizing POWER9-specific instruction `vmsumudm` for full 64-bit multiplication applied on 4-blocks at parallel based on radix 2^44
testsuite passes all tests of this patch.
Benchmark of poly1305 update using nettle-benchmark on Power9
| C | single block (2^64) | multi blocks (2^44) |
|---|---------------------|---------------------|
| 472.63 Mbyte/s | 658.45 Mbyte/s | 2136.30 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/54Fix illegal instruction in chacha-2core.asm on POWER72022-10-20T19:07:52ZMaamoun TKFix illegal instruction in chacha-2core.asm on POWER7This patch replaces "vmrgew/vmrgow" instructions with vector permute instruction in chacha-2core.asm to completely depend on Power ISA 2.06This patch replaces "vmrgew/vmrgow" instructions with vector permute instruction in chacha-2core.asm to completely depend on Power ISA 2.06https://git.lysator.liu.se/nettle/nettle/-/merge_requests/52Implement AES-GCM-SIV2022-09-29T13:17:37ZDaiki UenoImplement AES-GCM-SIVThis implements AES-GCM-SIV, described in RFC8452, on top of the
existing AES-GCM primitives. In particular, its hash algorithm
POLYVAL is implemented using the GHASH with additional byte order
conversion according to RFC8452 Appendix A...This implements AES-GCM-SIV, described in RFC8452, on top of the
existing AES-GCM primitives. In particular, its hash algorithm
POLYVAL is implemented using the GHASH with additional byte order
conversion according to RFC8452 Appendix A.
Signed-off-by: Daiki Ueno <dueno@redhat.com>https://git.lysator.liu.se/nettle/nettle/-/merge_requests/47[PowerPC] Implement Poly1305 single block update based on radix 2^642022-08-06T19:45:05ZMaamoun TK[PowerPC] Implement Poly1305 single block update based on radix 2^64This patch optimizes Poly1305 for powerpc64 architecture by utilizing POWER9-specific instruction `vmsumudm` for full 64-bit multiplication applied on single block based on radix 2^64
The patch also adds new option `--enable-power9` for...This patch optimizes Poly1305 for powerpc64 architecture by utilizing POWER9-specific instruction `vmsumudm` for full 64-bit multiplication applied on single block based on radix 2^64
The patch also adds new option `--enable-power9` for configuration to compile Power ISA v3.0 code.
testsuite passes all tests of this patch.
Benchmark of poly1305 update using nettle-benchmark on Power9
| C | This patch |
| ------ | ------ |
| 472.63 Mbyte/s | 657.47 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/51[S390x] Fix potential compiler error regarding GIEF usage2022-06-28T15:08:17ZMaamoun TK[S390x] Fix potential compiler error regarding GIEF usageSome GAS variants trigger an error regarding the use of `clgije/risbg` instructions that have been added in arch8 (z10) by General-instructions-extension facility (GIEF). This patch fixes that issue by altering assembler of machine type ...Some GAS variants trigger an error regarding the use of `clgije/risbg` instructions that have been added in arch8 (z10) by General-instructions-extension facility (GIEF). This patch fixes that issue by altering assembler of machine type in `memxor.s`https://git.lysator.liu.se/nettle/nettle/-/merge_requests/50Add missing percent sign for chacha s390x-specific vector names2022-06-14T15:38:05ZMaamoun TKAdd missing percent sign for chacha s390x-specific vector namesThis patch fixes gitlab CI failure of s390x pipeline job at compile-time due to missing symbol in a previous patch.This patch fixes gitlab CI failure of s390x pipeline job at compile-time due to missing symbol in a previous patch.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/49Fix a POSIX violation of m4 argument expansion2022-06-13T17:45:38ZMaamoun TKFix a POSIX violation of m4 argument expansionA workaround for expanding multiple digits of argument references to `QR` macro in `chacha-4core.asm` which is incompatible with POSIX.
See https://www.gnu.org/software/m4/manual/html_node/Arguments.htmlA workaround for expanding multiple digits of argument references to `QR` macro in `chacha-4core.asm` which is incompatible with POSIX.
See https://www.gnu.org/software/m4/manual/html_node/Arguments.htmlhttps://git.lysator.liu.se/nettle/nettle/-/merge_requests/45[S390x] Alerting assembler of machine type2022-05-14T20:35:52ZMaamoun TK[S390x] Alerting assembler of machine typeI have noticed some GCC versions don't recognize certain vector instructions without declaring machine type in the assembly files. This patch addresses this issue by specifying machine model in each file that utilizes those vector instru...I have noticed some GCC versions don't recognize certain vector instructions without declaring machine type in the assembly files. This patch addresses this issue by specifying machine model in each file that utilizes those vector instructions.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/44Refactor s390x-specific code for new ghash organization2022-02-23T16:51:23ZMaamoun TKRefactor s390x-specific code for new ghash organizationthis patch refactors GCM code of s390x arch to be compatible with new ghash organization.this patch refactors GCM code of s390x arch to be compatible with new ghash organization.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/37[Arm64] Optimize Chacha202022-01-25T18:47:39ZMaamoun TK[Arm64] Optimize Chacha20This patch optimizes Chacha20 for arm64 architecture by following the approach used in powerpc implementation.
testsuite passes all tests of this patch.
Benchmark of chacha encrypt/decrypt using nettle-benchmark on gfarm 117
| C | This...This patch optimizes Chacha20 for arm64 architecture by following the approach used in powerpc implementation.
testsuite passes all tests of this patch.
Benchmark of chacha encrypt/decrypt using nettle-benchmark on gfarm 117
| C | This patch |
| ------ | ------ |
| 197.72 Mbyte/s | 357.58 Mbyte/s |
NOTE: This patch is implemented while both endianess modes are in mind but has been tested only on little-endian variant because of lack of big-endian access.https://git.lysator.liu.se/nettle/nettle/-/merge_requests/40[S390x] Optimize Chacha20 with fat build support2022-01-20T20:27:10ZMaamoun TK[S390x] Optimize Chacha20 with fat build supportThis patch optimizes Chacha20 for s390x architecture by following the approach used in powerpc implementation.
testsuite passes all tests of this patch.
Benchmark of chacha encrypt/decrypt using nettle-benchmark on z15
| C | This patch...This patch optimizes Chacha20 for s390x architecture by following the approach used in powerpc implementation.
testsuite passes all tests of this patch.
Benchmark of chacha encrypt/decrypt using nettle-benchmark on z15
| C | This patch |
| ------ | ------ |
| 384.13 Mbyte/s | 1478.10 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/36[S390x] Optimize SHA3 permute using vector facility2021-10-31T07:35:21ZMaamoun TK[S390x] Optimize SHA3 permute using vector facilityThis patch optimizes SHA3 permute function by taking advantage of supported vector facility. Vectorizing SHA3 permute fits more than applying SHA3 hardware-accelerator for s390x architecture in terms of implementing the actual permute pr...This patch optimizes SHA3 permute function by taking advantage of supported vector facility. Vectorizing SHA3 permute fits more than applying SHA3 hardware-accelerator for s390x architecture in terms of implementing the actual permute procedure only rather than executing unneeded extra procedures which are handled by other functions in nettle library. Applying SHA3 hardware-accelerator in a previous patch yielded 12% performance boost while this patch has ~105% performance increase for SHA3 functions.
The optimized core follows the same optimization procedure that used in SHA3 permute implementation for x86_64 architecture.
| Algorithm | C (Mbyte/s) | Vectorized (Mbyte/s) |
| ------ | ------ | ------ |
| sha3_224 | 235.08 | 483.41 |
| sha3_256 | 226.15 | 460.68 |
| sha3_384 | 172.90 | 357.15 |
| sha3_512 | 120.46 | 243.96 |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/35[S390x] Optimize SHA256 and SHA512 compress functions2021-08-16T20:09:25ZMaamoun TK[S390x] Optimize SHA256 and SHA512 compress functionsThis patch optimizes SHA256 and SHA512 compress functions for s390x architecture, the testsuite passes the tests. Benchmark on Z15:
| Algorithm | C | Hardware-accelerated |
| ------ | ------ | ------ |
| SHA265 | 242.76 Mbyte/s | 869.00 ...This patch optimizes SHA256 and SHA512 compress functions for s390x architecture, the testsuite passes the tests. Benchmark on Z15:
| Algorithm | C | Hardware-accelerated |
| ------ | ------ | ------ |
| SHA265 | 242.76 Mbyte/s | 869.00 Mbyte/s |
| SHA512 | 373.18 Mbyte/s | 1555.21 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33[S390x] Optimize SHA1 compress with fat build support2021-08-10T20:53:25ZMaamoun TK[S390x] Optimize SHA1 compress with fat build supportThis patch optimizes SHA1 compress function for s390x architectures using built-in cipher accelerating instruction KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST). The patch also adds fat build support for the two functions to pick the suppor...This patch optimizes SHA1 compress function for s390x architectures using built-in cipher accelerating instruction KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST). The patch also adds fat build support for the two functions to pick the supported implementations at run-time.
`make check` passes all tests. Benchmark of SHA-1 by executing `examples/nettle-benchmark`:
| Function | C | Hardware-accelerated |
| ------ | ------ | ------ |
| update | 370.57 Mbyte/s | 816.54 Mbyte/s |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/34[AArch64] Optimize AES with fat build support2021-08-09T14:51:11ZMaamoun TK[AArch64] Optimize AES with fat build supportThis patch optimizes AES encrypt/decrypt functions with each key size has its own implementation to load the key expansion just once at function prologue which yields a considerable performance increase over loading the key expansion for...This patch optimizes AES encrypt/decrypt functions with each key size has its own implementation to load the key expansion just once at function prologue which yields a considerable performance increase over loading the key expansion for every block iteration. The patch also adds fat build support for the AES functions.
`make check` passes all tests. Benchmark of executing `examples/nettle-benchmark`:
| Algorithm | mode | C (Mbyte/s) | OpenSSL (Mbyte/s) | This patch (Mbyte/s) |
| ------ | ------ | ------ | ------ | ------ |
| aes128 | ECB encrypt | 95.01 | 1037.85 | 2579.62 |
| aes128 | ECB decrypt | 93.47 | 1005.15 | 2577.53 |
| aes192 | ECB encrypt | 79.60 | 893.34 | 2205.53 |
| aes192 | ECB decrypt | 78.34 | 889.17 | 2204.41 |
| aes256 | ECB encrypt | 66.64 | 782.21 | 1925.73 |
| aes256 | ECB decrypt | 65.81 | 781.37 | 1925.79 |https://git.lysator.liu.se/nettle/nettle/-/merge_requests/30[S390x] Optimize memxor2021-08-08T08:54:59ZMaamoun TK[S390x] Optimize memxorThis patch optimizes `memxor` function for `s390x` architecture. The optimized core takes advantage of `xc` instruction "Storage-to-storage xor" to implement high performance `memxor` function. Unfortunately, `xc` instruction processes t...This patch optimizes `memxor` function for `s390x` architecture. The optimized core takes advantage of `xc` instruction "Storage-to-storage xor" to implement high performance `memxor` function. Unfortunately, `xc` instruction processes the bytes in left-to-right order which is not suitable to assist implementing `memxor3` function, I tried to make a workaround for that issue but it yields a slower performance than the one implementing in C so I dropped that implementation.
Benchmark of `memxor` run on z15 with 5.2 GHz CPU frequency
| mode | C | xc-assisted implementation |
| ------ | ------ | ------ |
| aligned | 22552.01 Mbyte/s | 32331.91 Mbyte/s |
| unaligned | 13152.09 Mbyte/s | 32086.29 Mbyte/s |