<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>Full disclosure of nothing at all.

  var _gaq = _gaq || [];
  _gaq.push([‘_setAccount’, ‘UA-11000517-1’]);
  _gaq.push([‘_trackPageview’]);

  (function() {
    var ga = document.createElement(‘script’); ga.type = ‘text/javascript’; ga.async = true;
    ga.src = (‘https:’ == document.location.protocol ? ‘https://ssl’ : ‘http://www’) + ‘.google-analytics.com/ga.js’;
    var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.insertBefore(ga, s);
  })();</description><title>Josh's Blog</title><generator>Tumblr (3.0; @maragnus)</generator><link>http://blog.maragnus.com/</link><item><title>Optimizing your images</title><description>&lt;p&gt;Most of the bytes associated with a webpage are images.  A site that has a lot of small elements in separate files requires a lot of separate requests.  And for small elements, a significant portion of that traffic is overhead.&lt;/p&gt;
&lt;p&gt;I have built a Windows application to combine images into a single file and output CSS to use it.&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;Some solutions in &lt;a href="http://code.google.com/speed/page-speed/docs/request.html"&gt;reducing overhead&lt;/a&gt; can help.  You can use a server that supports keep-alive or spread your static content across multiple domains to work around the maximum connections to a domain.  But my preference is to make every request count.&lt;/p&gt;
&lt;p&gt;Tiling smaller images onto a single image is becoming a common solution.  It has given me &lt;a href="http://kyleschaeffer.com/best-practices/pure-css-image-hover/"&gt;pure CSS hover overs&lt;/a&gt; with no need to precache the hover states.  And subsequent pages load much quicker if I combine the smaller elements for the entire site.&lt;/p&gt;
&lt;p&gt;After numerous manual efforts of combining images and pixel counting the offsets, I have created an application to do all the hard work for me.&lt;/p&gt;
&lt;p&gt;Drag and drop files onto the app and it will stack them together.  Files with a &amp;#8220;_over&amp;#8221; suffix will automatically have Hover as the state.  It will store the new PNG, the &lt;a href="http://dl.dropbox.com/u/19888282/Projects/PngStacker/tiles.css"&gt;generated CSS&lt;/a&gt; and an XML file to load it back in for updates.  The CSS class names will be the filenames.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://dl.dropbox.com/u/19888282/Projects/PngStacker/screen.png"/&gt;&lt;/p&gt;
&lt;p&gt;Download: &lt;a href="http://dl.dropbox.com/u/19888282/Projects/PngStacker/PngStacker.exe"&gt;Executable&lt;/a&gt; (33&amp;#160;KB), &lt;a href="http://dl.dropbox.com/u/19888282/Projects/PngStacker/PngStacker.7z"&gt;C# Source Code&lt;/a&gt; (15&amp;#160;KB), &lt;a href="http://www.microsoft.com/download/en/details.aspx?id=17851"&gt;.NET Framework 4&lt;/a&gt; (Web Installer)&lt;/p&gt;</description><link>http://blog.maragnus.com/post/15865568979</link><guid>http://blog.maragnus.com/post/15865568979</guid><pubDate>Sat, 14 Jan 2012 22:44:00 -0500</pubDate><category>css</category><category>programming</category><category>web</category><category>html</category><category>c</category><category>optimizing</category></item><item><title>Preparing for the first holiday lighting adventure</title><description>&lt;p&gt;My wife bought three strands of lights on clearance last year in anticipation of decorating the outside of our house for the first time this year.  Big multi-colored C9 bulbs with dangling white icicle strands of T1&amp;#160;&lt;a href="http://en.wikipedia.org/wiki/Christmas_lighting_technology#Sizes"&gt;bulbs&lt;/a&gt;.  It didn&amp;#8217;t occur to me that they wouldn&amp;#8217;t reach around the roof, turns out they had a total of 10&amp;#8217; lighted feet each, and nobody carried those strands anymore.  It was time to hit Google and learn about my options.&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How should I hang them?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There are a ton of hangers available.  And there are tons of things that can hang lights that are not hangers.  I was not able to find any clips designed for hanging lights that would stay up, so I turned to &lt;a href="http://www.instructables.com/id/Hanging-Christmas-Lights-Made-Easy/"&gt;instructibles&lt;/a&gt; and found a great solution.  Checking &lt;a href="http://www.ebay.com/itm/100-ID-Badge-Clips-Clear-Vinyl-Strap-Badge-Holders-/220894035544"&gt;Ebay&lt;/a&gt; and &lt;a href="http://www.amazon.com/Metal-Badge-Clips-Clear-Straps/dp/B0038NHBBM"&gt;Amazon&lt;/a&gt;, I purchased a cheap bulk order of 100 pieces, which turned out to be enough to hang two strands of lights over 90&amp;#8217; (28m) with dozen spares.&lt;/p&gt;
&lt;p&gt;You should definitely inspect each clip and prepare it prior to climbing on the roof.  I attached the clips at regular intervals ahead of time, and had a pocket full of ones ready for tricky spots like corners.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://dl.dropbox.com/u/19888282/Posts/hangers.jpg"/&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How do I pick out which lights to get?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This was a tricky problem, I hit up three stores to set some parameters.  Walmart (to set my base point), Home Depot (to set my high point) and then BJ&amp;#8217;s Wholesale Club (for comparison).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LED lights&lt;/strong&gt;: These lights are a bit more expensive and rarely as bright.  Now that the cons are out of the way, they make up for it in several ways, and turned out cheaper for me up front.  Their light quality is higher even if they are not as bright.  They have a higher maximum chain length reducing extension cords.  They consume less electricity, about 13% of condescends.  They don&amp;#8217;t get hot.  The chain length is the selling point for me, I can light my entire house without concern for extension cords.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Maximum chain length&lt;/strong&gt;: The gauge of the wire determines how many lights you can chain together, end-to-end from a single outlet.  Commercial grade condescends allow about 6 strands, while residential grade allow between 2 and 3 strands.  If you go too long, you risk damage to your lights and potential fire.  The solution is to run extension cords to each starting point which is no fun.&lt;/p&gt;
&lt;p&gt;The LED lights are a different story, the ones I bought allow up to 48 strands on a single chain.  And that is the minimum that I&amp;#8217;ve seen.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost per lighted foot&lt;/strong&gt;: Some companies display the &amp;#8220;lighted length&amp;#8221; and some companies display the total length from plug-to-plug.  Keep this in mind when you are browsing.  Higher gauge wire on LEDs and commercial grade condescend lights is tough to work with.  It is springy and doesn&amp;#8217;t cooperate when it is new.  Make sure that you&amp;#8217;ve got about a couple inches per each foot extra lights.  Other than that, just divide the lighted length by the price and multiply by your roof line.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What did I eventually purchase?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I purchased five strands of &lt;a href="http://www.amazon.com/Noma-Inliten--Icicle-V47615-88-Christmas/dp/B002L878FY"&gt;140 LED icicle lights&lt;/a&gt;, three strands of &lt;a href="http://www.amazon.com/Sylvania-50-LED-C9-Lights/dp/B00459H33O"&gt;50 LED C9 lights&lt;/a&gt; for my 90 feet of roof line and a &lt;a href="http://www.amazon.com/Coleman-Cable-13547-6-Outlet-Sensor/dp/B001XCWLVK"&gt;6 outlet dusk-dawn timer&lt;/a&gt; for around $130.  We plan to add a little each year.  Everything turned out to be Sylvania brand.  These lights have very short unlighted cords, meaning that light-to-plug on both ends is unnoticeable when they are hung without having to bunch them together.  If you use them on your tree, make sure you start with an extension cord.&lt;/p&gt;</description><link>http://blog.maragnus.com/post/13711231960</link><guid>http://blog.maragnus.com/post/13711231960</guid><pubDate>Sat, 03 Dec 2011 22:50:00 -0500</pubDate></item><item><title>Going faster in a slowing world</title><description>&lt;p&gt;In a world where &lt;a href="http://en.wikipedia.org/wiki/Common_Language_Runtime" title="Common Language Runtime"&gt;CLR&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Just-in-time_compilation" title="Just-in-time compilation"&gt;JIT&lt;/a&gt; are generating native code for our desktop apps on the fly and horribly inefficient &lt;a href="http://en.wikipedia.org/wiki/Interpreted_language" title="Interpreted languages"&gt;interpreted languages&lt;/a&gt; are serving our web content, both server-side and client-side, Intel and AMD folks must think our cheese slipped off our crackers.&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;The x86 processor is incredibly complex, incrementally loaded with new instructions to solve common problems in fractions of the time.  Most of these instructions are added with &lt;a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions" title="Streaming SIMD Extensions"&gt;SSE&lt;/a&gt; releases.  SIMD stands for single instruction, multiple data.  We were given 64-bits of data when we all lived with our 32-bit operating systems, then it was upped to 128-bits.  Now with the &lt;span&gt;&lt;a href="http://en.wikipedia.org/wiki/VEX_prefix" title="VEX prefix"&gt;VEX prefix&lt;/a&gt;, &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Advanced_Vector_Extensions" title="Advanced Vector Extensions"&gt;AVX&lt;/a&gt; is giving us &lt;span&gt;256-bits for 64-bit operating systems, potentially moving up to 512-bits within a couple of years.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;These 128-bit registers are broken up into components, such as 2 doubles, 4 floats or 16 characters.  Then operations can be performed on these components in parallel, with a single instruction.  This is compared to scalar processing of each component at a time.  Available to us now is everything from simple mathematics to complex comparison and matching at the reduced cost of single instructions.&lt;/p&gt;
&lt;p&gt;The entire pipeline is improved for these SIMD operations.  All 128-bits are loaded into memory at once.  Then the operations are performed on each component in parallel.  Finally, all 128-bits are store back to memory in one go.  In contrast with &lt;a href="http://en.wikipedia.org/wiki/SISD" title="single instruction, single data"&gt;scalar processing&lt;/a&gt;, each component has to be separately loaded into a register, processed and then loaded back into memory, one at a time.  &lt;span&gt;SIMD can save considerable amounts of processing time for complex mathematics, but also for very linear processes, such as string processing.  SSE can test 16 characters at the same time, compared to a single character at a time with scalar processing.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;We are moving backwards in computing.  Since Pentium 4, Celeron and Athlon 64, we have had SSE2, yet Microsoft&amp;#8217;s .NET platform only utilizes SSE for converting between int and float.  .NET compiles to the native CPU on the fly!  Why does it not &lt;a href="http://softpixel.com/~cwright/programming/simd/cpuid.php" title="CPUID to check for SSE version"&gt;check&lt;/a&gt; what the processor supports and generate native code which rivals C++ programs of even experienced developers?  Lua and php utilize hash table lookups with reckless abandon, why do they not utilize SSE&amp;#8217;s string matching capabilities.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;That is all, I just wanted to point out that we are all being stupid.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;All right, there is more&amp;#8230;  Most of these projects that I mentioned are either open source or have open source competitors (like &lt;a href="http://www.mono-project.com" title="Cross platform, open source .NET development framework"&gt;Mono&lt;/a&gt; to .NET).  My hope is that some of the research into SIMD that I provide here will ripple into those projects.  There must be reasons that rich companies are avoiding basic points of optimization that would give them a significant edge.  Even SQL database systems are missing these features.&lt;/p&gt;
&lt;p&gt;Here are the cons that I know of:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/876txxy5(v=VS.71).aspx" title="Task switching and SIMD registers in Windows"&gt;Task switching&lt;/a&gt; in some OSes when SIMD is being utilized causes a hit to performance.  Registers that hold SIMD are quite large and stored when switching tasks when two applications are utilizing SIMD at the same time.&lt;/li&gt;
&lt;li&gt;Adding SIMD adds a level of uncertainty to the compatibility and portability. It is possible that a functions could be slower, or unsupported on a certain CPU.  Intrinsic functions have not always been portable and are still not available in many compilers, or are poorly optimized.&lt;/li&gt;
&lt;li&gt;Performance is rarely consistent with SIMD functions.  While the result is always faster than scalar computation, the performance of those functions can vary depending on other processes utilizing them at the same time or the alignment of data in memory.&lt;/li&gt;
&lt;li&gt;There is a penalty of working with memory that isn&amp;#8217;t aligned to 128-bit boundaries with many instructions.  Entire architectures may have to be modified to align memory properly for these functions.&lt;/li&gt;
&lt;li&gt;Supporting multiple platforms can cause complex branching of code.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Here is what can be improved:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Any vector mathematics with 3 or more components (single operations need aligned memory to benefit)&lt;/li&gt;
&lt;li&gt;Any matrix operations (small matrices need aligned memory to benefit)&lt;/li&gt;
&lt;li&gt;All String functions (strings of any length)&lt;/li&gt;
&lt;li&gt;Working with large bitmaps&lt;/li&gt;
&lt;li&gt;Interpreted languages can &lt;a href="http://en.wikipedia.org/wiki/Tokenization" title="Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens."&gt;tokenize&lt;/a&gt; their scripts in a fraction of the time.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Hash_table" title="data structure that uses keys to find their associated values"&gt;Hash table&lt;/a&gt; (dictionary lookup) functions can be improved exponentially with string compare as well as with new &lt;a href="http://en.wikipedia.org/wiki/Hash_function" title="mapping large data sets to smaller data sets"&gt;hash generation&lt;/a&gt; (CRC32)&lt;/li&gt;
&lt;li&gt;3D mathematics can be performed more rapidly&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Here is why:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Saving power for mobile devices due reduced processor and  memory utilization&lt;/li&gt;
&lt;li&gt;More responsive applications due to faster run-time&lt;/li&gt;
&lt;li&gt;Competitive edge due to utilizing the newest technologies&lt;/li&gt;
&lt;li&gt;Higher through-put in service applications due to the above&lt;/li&gt;
&lt;/ul&gt;</description><link>http://blog.maragnus.com/post/11463220496</link><guid>http://blog.maragnus.com/post/11463220496</guid><pubDate>Fri, 14 Oct 2011 23:00:00 -0400</pubDate><category>SSE</category><category>SIMD</category><category>optimizing</category><category>c++</category><category>programming</category></item><item><title>How low can you go, strlen?</title><description>&lt;p&gt;The final verdict: SSE2 is the best option.  It offers performance between 150% and 388% of the CRT strlen function. 32-bit CRT and libc strlen are quite slow and the 64-bit strlens are about twice as fast.&lt;/p&gt;
&lt;!-- more --&gt;
&lt;p&gt;1. What about SSE4.2?&lt;/p&gt;
&lt;p&gt;First, SSE4.2 is slower.  Though it can run on unaligned memory faster than SSE2 can, it seems searching only for zeros can be done more efficiently with SSE2.&lt;/p&gt;
&lt;p&gt;Second, it is only available on processors designed after 2009.  That is a harsh penalty.&lt;/p&gt;
&lt;p&gt;Noteworthy: The SSE4.2 &lt;span&gt;pcmpistri instruction (&lt;a href="http://msdn.microsoft.com/en-us/library/bb531463.aspx" title="Emits the Streaming SIMD Extensions 4 (SSE4) instruction pcmpistri."&gt;_mm_cmpistri&lt;/a&gt; instrinsic) runs faster on all unaligned memory than with the overhead of aligning the initial block of memory.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;2. Is a separate executable for 64-bit worth it?&lt;/p&gt;
&lt;p&gt;The benefits of a 64-bit operating system show here using the traditional methods without SIMD.  However, when it comes to the 128-bit SIMD instructions, it really doesn&amp;#8217;t matter too much.  You would indeed get a boost running 64-bit code on a 64-bit operating system, but it is negligible for string functions.&lt;/p&gt;
&lt;p&gt;3. What is more memory efficient?&lt;/p&gt;
&lt;p&gt;None of these methods allocate memory.&lt;/p&gt;
&lt;p&gt;4. Is it worth requiring SSE2?&lt;/p&gt;
&lt;p&gt;While you can certainly fall back to a standard implementation, SSE2 has been around for a very long time and it is unlikely that any software that you write today will be run on unsupported hardware.&lt;/p&gt;
&lt;p&gt;5. Is this faster in ICC 12 or MSVC++ 2010 SP1?&lt;/p&gt;
&lt;p&gt;The results were inconclusive and appear equal.&lt;/p&gt;
&lt;p&gt;The &lt;a href="http://dl.dropbox.com/u/19888282/Posts/strlen.cpp"&gt;source code&lt;/a&gt; and &lt;a href="https://docs.google.com/leaf?id=1TdB0cBr7N_V7TZ1H5p-TRp_Vn4sm6OOWGoY6VxxKgMA&amp;amp;hl=en_US" title="simd strlen.xlsx"&gt;raw data&lt;/a&gt; are available.  The tests were run on an Intel i7-2600K @ 3.4GHz with 8GB of RAM in Windows 7 SP1&amp;#160;64-bit.  &lt;span&gt;Y-axis is &lt;/span&gt;&lt;span&gt;the duration in seconds of 100,000 iterations multiplied by 100,000.  The alignment of the string increases by 1 byte for each string length.  The libc function is the &lt;a href="http://sourceware.org/glibc/wiki/GlibcGit"&gt;source code from git&lt;/a&gt;&lt;/span&gt;&lt;span&gt; on &lt;/span&gt;&lt;span&gt;2011-&lt;/span&gt;&lt;span&gt;10-13&lt;/span&gt;&lt;span&gt; updated to utilize 64-bits in VC++&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;#8220;return 0&amp;#8221; is essentially a baseline for comparison since it is the most simple function that can be called.  The tests were performed in Visual Studio 2010 SP1.&lt;/p&gt;
&lt;p&gt;&lt;img alt="strlen performance comparison (CRT, libc, SSE2, SSE4.2)" height="auto" src="http://db.tt/DkkMu1T5" width="100%"/&gt;&lt;/p&gt;</description><link>http://blog.maragnus.com/post/11447793404</link><guid>http://blog.maragnus.com/post/11447793404</guid><pubDate>Fri, 14 Oct 2011 16:56:00 -0400</pubDate><category>SSE</category><category>SSE4.2</category><category>SSE2</category><category>SIMD</category><category>C++</category><category>optimizing</category><category>programming</category></item></channel></rss>

