[S390x] Optimize memxor

Maamoun TK requested to merge mamonet/nettle:s390x-memxor into master

This patch optimizes memxor function for s390x architecture. The optimized core takes advantage of xc instruction "Storage-to-storage xor" to implement high performance memxor function. Unfortunately, xc instruction processes the bytes in left-to-right order which is not suitable to assist implementing memxor3 function, I tried to make a workaround for that issue but it yields a slower performance than the one implementing in C so I dropped that implementation.

Benchmark of memxor run on z15 with 5.2 GHz CPU frequency

mode C xc-assisted implementation
aligned 22552.01 Mbyte/s 32331.91 Mbyte/s
unaligned 13152.09 Mbyte/s 32086.29 Mbyte/s

Merge request reports