Instead of using an `Image16`, which is organized in planes, store the HaldCLUT in an `AlignedBuffer<std::uint16_t>` with sequential RGBx values. This gives a speedup of roughly 23% here.