In a world where CLR and JIT are generating native code for our desktop apps on the fly and horribly inefficient interpreted languages are serving our web content, both server-side and client-side, Intel and AMD folks must think our cheese slipped off our crackers.
The x86 processor is incredibly complex, incrementally loaded with new instructions to solve common problems in fractions of the time. Most of these instructions are added with SSE releases. SIMD stands for single instruction, multiple data. We were given 64-bits of data when we all lived with our 32-bit operating systems, then it was upped to 128-bits. Now with the VEX prefix, AVX is giving us 256-bits for 64-bit operating systems, potentially moving up to 512-bits within a couple of years.
These 128-bit registers are broken up into components, such as 2 doubles, 4 floats or 16 characters. Then operations can be performed on these components in parallel, with a single instruction. This is compared to scalar processing of each component at a time. Available to us now is everything from simple mathematics to complex comparison and matching at the reduced cost of single instructions.
The entire pipeline is improved for these SIMD operations. All 128-bits are loaded into memory at once. Then the operations are performed on each component in parallel. Finally, all 128-bits are store back to memory in one go. In contrast with scalar processing, each component has to be separately loaded into a register, processed and then loaded back into memory, one at a time. SIMD can save considerable amounts of processing time for complex mathematics, but also for very linear processes, such as string processing. SSE can test 16 characters at the same time, compared to a single character at a time with scalar processing.
We are moving backwards in computing. Since Pentium 4, Celeron and Athlon 64, we have had SSE2, yet Microsoft’s .NET platform only utilizes SSE for converting between int and float. .NET compiles to the native CPU on the fly! Why does it not check what the processor supports and generate native code which rivals C++ programs of even experienced developers? Lua and php utilize hash table lookups with reckless abandon, why do they not utilize SSE’s string matching capabilities.
That is all, I just wanted to point out that we are all being stupid.
All right, there is more… Most of these projects that I mentioned are either open source or have open source competitors (like Mono to .NET). My hope is that some of the research into SIMD that I provide here will ripple into those projects. There must be reasons that rich companies are avoiding basic points of optimization that would give them a significant edge. Even SQL database systems are missing these features.
Here are the cons that I know of:
Here is what can be improved:
Here is why: