Ingo has provided a solution for the strange Windows crash with
`_mm_cvtpu16_ps()`: It was not an alignment problem, but the use of
MMX instructions which led to the SEGV.
Now Ingo's solutions omits MMX instructions altogether and is
nevertheless faster than the `_mm_set_ps()` workaround.
Many thanks to @heckflosse!
Ingo had some cleanup suggestions in #3154 which I tried to realize with
this commit. Although switching to `vfloat2` is a clever idea, I can see
no further speedup.
Instead of using an `Image16`, which is organized in planes, store the
HaldCLUT in an `AlignedBuffer<std::uint16_t>` with sequential RGBx
values. This gives a speedup of roughly 23% here.
This commit adds a true LRU cache to `rtengine` which is used in the new
`CLUTStore` class. The code in `clutstore.*` was cleaned up with C++11
features and small optimizations taken from my `clutbench` project.
The `CLUTStore` class was converted to a true singleton.