Force 16-byte alignment on data structures where that can benefit performance
There are several places where assembly code could use other load and store instructions if data items were 16-byte aligned, e.g., ppc64 lvx instruction. It would make sense to at least make union nettle_block16, and the arrays of subkeys for aes and umac, 16-byte aligned.
This is an abi change, and alignment directives must go in public header files and be compiler agnostic. It may also need some conditions on architecture; if an architecture doesn't have any instructions that benefit from 16-byte alignment, maybe we shouldn't require it.