Install this theme
Going faster in a slowing world

In a world where CLR and JIT are generating native code for our desktop apps on the fly and horribly inefficient interpreted languages are serving our web content, both server-side and client-side, Intel and AMD folks must think our cheese slipped off our crackers.

The x86 processor is incredibly complex, incrementally loaded with new instructions to solve common problems in fractions of the time.  Most of these instructions are added with SSE releases.  SIMD stands for single instruction, multiple data.  We were given 64-bits of data when we all lived with our 32-bit operating systems, then it was upped to 128-bits.  Now with the VEX prefixAVX is giving us 256-bits for 64-bit operating systems, potentially moving up to 512-bits within a couple of years.

These 128-bit registers are broken up into components, such as 2 doubles, 4 floats or 16 characters.  Then operations can be performed on these components in parallel, with a single instruction.  This is compared to scalar processing of each component at a time.  Available to us now is everything from simple mathematics to complex comparison and matching at the reduced cost of single instructions.

The entire pipeline is improved for these SIMD operations.  All 128-bits are loaded into memory at once.  Then the operations are performed on each component in parallel.  Finally, all 128-bits are store back to memory in one go.  In contrast with scalar processing, each component has to be separately loaded into a register, processed and then loaded back into memory, one at a time.  SIMD can save considerable amounts of processing time for complex mathematics, but also for very linear processes, such as string processing.  SSE can test 16 characters at the same time, compared to a single character at a time with scalar processing.

We are moving backwards in computing.  Since Pentium 4, Celeron and Athlon 64, we have had SSE2, yet Microsoft’s .NET platform only utilizes SSE for converting between int and float.  .NET compiles to the native CPU on the fly!  Why does it not check what the processor supports and generate native code which rivals C++ programs of even experienced developers?  Lua and php utilize hash table lookups with reckless abandon, why do they not utilize SSE’s string matching capabilities.

That is all, I just wanted to point out that we are all being stupid.

All right, there is more…  Most of these projects that I mentioned are either open source or have open source competitors (like Mono to .NET).  My hope is that some of the research into SIMD that I provide here will ripple into those projects.  There must be reasons that rich companies are avoiding basic points of optimization that would give them a significant edge.  Even SQL database systems are missing these features.

Here are the cons that I know of:

  • Task switching in some OSes when SIMD is being utilized causes a hit to performance.  Registers that hold SIMD are quite large and stored when switching tasks when two applications are utilizing SIMD at the same time.
  • Adding SIMD adds a level of uncertainty to the compatibility and portability. It is possible that a functions could be slower, or unsupported on a certain CPU.  Intrinsic functions have not always been portable and are still not available in many compilers, or are poorly optimized.
  • Performance is rarely consistent with SIMD functions.  While the result is always faster than scalar computation, the performance of those functions can vary depending on other processes utilizing them at the same time or the alignment of data in memory.
  • There is a penalty of working with memory that isn’t aligned to 128-bit boundaries with many instructions.  Entire architectures may have to be modified to align memory properly for these functions.
  • Supporting multiple platforms can cause complex branching of code.

Here is what can be improved:

  • Any vector mathematics with 3 or more components (single operations need aligned memory to benefit)
  • Any matrix operations (small matrices need aligned memory to benefit)
  • All String functions (strings of any length)
  • Working with large bitmaps
  • Interpreted languages can tokenize their scripts in a fraction of the time.
  • Hash table (dictionary lookup) functions can be improved exponentially with string compare as well as with new hash generation (CRC32)
  • 3D mathematics can be performed more rapidly

Here is why:

  • Saving power for mobile devices due reduced processor and  memory utilization
  • More responsive applications due to faster run-time
  • Competitive edge due to utilizing the newest technologies
  • Higher through-put in service applications due to the above
 
  1. j0ethought reblogged this from maragnus
  2. maragnus posted this