Project Part 3

Alright, let's start off with the most relevant function: mainSimpleSort (50%, 28%)
Right off the bat, there's a tiny bit of hoisting that was half-done but seems to have been forgotten.
As you can see, lo + h - 1 is loop invariant. lo + h  is actually calculated outside the loop at the very beginning of the function, but it is used as a counter. I will create a separate variable called loHi_1 and hoist this calculation. This hoist can be repeated three times, since there are three separate instances of this condition in three separate loops. Another hoist we can do is move the v+d calculation outside the loop. I've repeated the same procedure, storing this result into a variable called v_d.

At this point in my blog post, I would like to thank the person who created the bZip2 makefile. At the end of the makefile, there are six small sample files that are run through the bZip2 function suite (decompress, compress, recompress) and their results are automatically cmp'd. This means that an effective test of function integrity occurs at compile time. This has saved me innumerable headaches. Thanks, bZip2 author.

Moving on to the second most relevant function: fallbackQSort3 (40%, 0%)
An easy hoist. Introduced variable F_Q_1.
Something strange is going on regarding the variable r in this function. The following code occurs inside a while loop:
At first glance this would be a strong target for optimization, but given that variable r multiplies by itself in a while loop it becomes difficult to optimize. I'm going to leave this code alone for the time being.
This is a difficult function to optimize. Most calculation has already been hoisted out, there are no for loops, and the frequency of function calls adds a lot of ambiguity. Code such as this:
At a first glance it appears to be a candidate for optimization, but it quickly becomes obvious how well this function was designed. I may revisit this function later.

Third most relevant function: fallbackSort (0%, 26%)
And finally, we have some for loops to work with.
Some strength reduction can be performed by removing the *2 and having the loop iterate 64 times instead of 32. Unfortunately, this was another function with only a single optimization.

Fourth most relevant function: mainQSort3 (15%, 0%)
Another immediate and extremely ineffective hoist. Introducing variable M_Q_2. This function is very similar to fallbackQSort3 in it's structure, it's hatred of for loops, and it's difficulty to optimize.

Final relevant function: mainSort (10%, 1%)
This line really bugs me. I'm not knowledgeable enough with bitwise operators to optimize this. I will revisit this line.
Interesting operation occurring here. h is set to 1, then iterated over and multiplied by itself repeatedly. Since h always starts at 1, the loop will always go around five times.
  1. 3 * 1 + 1 = 4
  2. 3 * 4 + 1 = 13
  3. 3 * 13 + 1 = 40
  4. 3 * 40 + 1 = 121
  5. 3 * 121 + 1 = 364
Therefore, h will always be 364. We can then hardcode this, and remove the looping entirely.

I think we can safely assume that all the previous 'optimization' didn't affect performance at all. To hopefully have an actual impact on execution time, i've added some flags to the makefile.
It was clear from my earlier testing that the -O3 flag did a lot to slow execution time. Because of this, I am remaining at -O2 while enabling some potentially helpful flags.
-ftree-vectorize enables autovectorization
-msse2 enables vectorization on x86_64 architecture
-ffast-math enables vectorization of floating-point reductions
At this point, i've had to go back and test some more. During stage 1, I omitted the small-mode operations since I did not think their results would be relevant, but after vectorizing bZip I realized this was not the case. Because of this I have updated the timings spreadsheet from part 1 with the small-mode data. I have also updated the timings sheet with the "optimized" (vectorized) version of the program. Once again, you can see the results here.
The results are fairly self explanatory, the "optimized" version is more of a middle-ground between the O2 and O3 versions of bZip2. But the most interesting thing to note is that the delta of O2 and opt (difference in time between O2 which is considered the 'ideal' version and the vectorized version) is almost always positive for systime. This makes sense, because if SIMD is to be implemented via vectorization it would decrease the amount of CPU cycles overall.

Comments

Popular posts from this blog

Lab 5 x86_64 Update