User jgschaefer [reported an error on pixls.us](https://discuss.pixls.us/t/rt-build-from-git-crash-on-launch-debian-testing-64-bit/1425)
which could be traced down to an empty basename for a HaldCLUT. The
original implementation did not throw an exception due to the use of
`std::string::substr()` instead of `std::string::erase()`, but silently
assigned the first working profile to `profile_name`.
Ingo has provided a solution for the strange Windows crash with
`_mm_cvtpu16_ps()`: It was not an alignment problem, but the use of
MMX instructions which led to the SEGV.
Now Ingo's solutions omits MMX instructions altogether and is
nevertheless faster than the `_mm_set_ps()` workaround.
Many thanks to @heckflosse!
Ingo had some cleanup suggestions in #3154 which I tried to realize with
this commit. Although switching to `vfloat2` is a clever idea, I can see
no further speedup.
Instead of using an `Image16`, which is organized in planes, store the
HaldCLUT in an `AlignedBuffer<std::uint16_t>` with sequential RGBx
values. This gives a speedup of roughly 23% here.
This commit adds a true LRU cache to `rtengine` which is used in the new
`CLUTStore` class. The code in `clutstore.*` was cleaned up with C++11
features and small optimizations taken from my `clutbench` project.
The `CLUTStore` class was converted to a true singleton.