Monday, April 20, 2020

Project Conclusion

I chose this project because I saw that the program was intended to run slowly. I thought this would lead to more opportunities to make small changes to get performance improvements. The project was unfortunately already pretty well optimized which lead to a bit of a struggle in the last stage. 

As I've posted in my last posts, ZopfliUpdateHash was the bulk of the programs runtime. UpdateHashValue was called very often inside of this function so these were my main targets for optimization. On a coding level, these functions didn't have much that could be changed to improve their algorithms as they were just single lines with a constant run time or simple loops with at worse a linear growth rate. The functions were just called a lot. This lead me to the conclusion that forced inline functions may improve performance. 

Unfortunately time restraints kind of left me unable to actively test this more thoroughly. I may follow up over the summer just for my own interest but for now, this is the end of my project for SPO600.

Stage 3 - Optimization Opportunities

This stage of the project will detail various ways of optimizing the program. My first thought was to go look at the algorithm used to do the compression. Although the compression is slow, it's thorough. There aren't really simple optimizations to be done such as switching to a better sorting algorithm or better search algorithms because the program really doesn't do either. While there are small things that could be done to perhaps save on memory such as changing some of the data members from ints to shorts or doubles to ints, it'd be a pretty negligible gain most likely, and any changes may actually do more harm than good if we consider alignment.

So, with that in mind I started experimenting with compiler options on x86_64 and Aarch. Once again, this didn't really seem to have much in the way of big optimizations. Any compiler option I added varied things by at most 1 second runtime and I really just attribute that to a bit of variance. Changing to 0fast didn't change the run time at all. After reading some of the comments on github it seems that the program is optimized for 03 optimizations right now so this kind of makes sense.

After exploring my options one option I decided might be possible is to ensure that the zopfliUpdateHash function is an inline function. Though it's probably on the larger end of what should be allowed to be an inline function, it's mostly just a few checks on some preprocessor conditions and then 1 line of code or 1 loop depending on which way it goes. Since this function is called very often (it's a short function and it accounts for roughly 50% of the programs runtime), this should cut down the runtime somewhat just by saving on function call overhead. Though this would increase the size of the executable file in the end, if we're optimizing for speed it may be a good option.


UpdateHashValue could also be made into an inline function since it is also called very often.


In conclusion, I didn't find very many options that were easy fixes in the code. The code is well optimized for what it is. Though it is slow, it is thorough. ZopfliUpdateHash is the bulk of the runtime despite being a relatively small function. This function, along with UpdateHashValue are called very often which leads me to believe ensuring they are inline functions would give a small performance boost at the cost of the executable file being larger, which would be depending on circumstance.

Wednesday, April 15, 2020

Assignment, Stage 2 - Profiling

For this stage of the project, we were required to profile the program to see which functions were eating up the most time. I used the same file for compression as I did in stage 1 of the assignment for Aarchie, 8.192 mb. The program took 197.52 seconds to run in this gprof profile. I added the options -pg -g -0g to the compilation options.

Compiler Options

AArch 64
gprof for zopfli on Aarchie

As you can see, 42.52% of the time is spent in the zopfliUpdateHash. zopfliStoreLitLenDistance comes in a distant second using 12.54% of the time in the profile. Everything else trails by a good bit. 

x86_64
On x86_64, I used a file that was roughly 10 times as big as the file I used on AArchie. The file is roughly 83mb in size. 

gprof on x86_64

Same kind of thing going on on x86_64. zopfliUpdateHash takes the bulk of the time. Interestingly, it takes an increased portion of the overall time compared to AArch64. zopfliStoreLitLenDistance is once again a distant second this time taking 9% of the time, with zopfliMaxCachedSublen taking 8.9%. 

It's somewhat interesting to see the different percentages of time that each function takes compared to AArch64. It's worth noting that it could be due to different file sizes, unfortunately it was unavoidable due to the difference in the power of the machines I used. Either way, the function that takes far and away the most time to run is zopfliUpdateHash so that is likely the one I'll be looking into for stage 3 of the project.




Monday, April 6, 2020

Week 11, profiling a program

This week, we learned about the 2 ways to profile a program

Sampling is one approach. In this approach the program is interrupted frequently to determine which function the most time is being used, and more specifically, where in the function the program is spending its time.

The other approach is instrumentation. It involves adding extra code to the program so that function calls can be recorded.

We were also taught about gprof and perf. Gprof works on both sampling and instrumentation. Perf works only on sampling. The man disadvantage to gprof is that it requires putting extra code into the program to make it work. Perf doesn't require this extra step but it gives less information. Because of this, I will likely end up trying to use gprof for the assignment just to get the most accurate readings.

This lesson overall was very helpful for assignment 2.

Thursday, April 2, 2020

Assignment stage 1, AArch64 times

I connected to the Aarchie server to get these times. I had to do a little bit of guess and check in order to make the file small enough to run on Aarchie but big enough to still get long enough times out of (a file the size of the one I compressed on matrix took so long that Aarchie just disconnected me before it could finish). I used a file that was roughly 8mb (8.192mb to be exact). 

-03 compiler optimization

My runtimes averaged out to around 2m and 18.5 seconds for -03. A file roughly a 10th of the size of the one I used on matrix takes a little more than half the time. Not entirely unexpected considering Aarchie isn't exactly the strongest machine but its a pretty stark difference which made finding a good file size for this a little tricky.

-02 compiler optimization
These runtimes averaged out to around 2m and 19.2 seconds. This surprised me because the runtime on matrix varied far more between compiler optimizations than it did on AArchie.

-01 compiler optimization

The runtimes averaged out to around 2m and 20.7 seconds. Again, not much of a different between the compiler optimizations. I hope to do more testing to see if I can make it vary more with a larger runtime (if AArchie will cooperate).

Wednesday, April 1, 2020

Assignment, Stage 1

For my project, I decided to work with Zopfli (https://github.com/google/zopfli). I mostly chose Zopfli because it specifically says it takes longer to perform compression to I thought it would have more opportunities for optimization. Zopfli is a data compression software that can compress into gzip format. It takes a significant amount of time to do it's compression.



The build process for Zopfli was very simple. I cloned the files and just ran the make file, not complications. I tested around with various file sizes to get a test case that took around 4 minutes to run. I used </dev/urandom > to generate a mass amount of junk data for my test case.



x86_64

Benchmarking -03 compiler optimizations (default in the makefile)



Most runs of the program on -03 optimizations took within 3-4 seconds of the above times (4 mins and 6 seconds). The big thing that stuck out to me was how long it took. After testing with a few other programs, Zopfli took by far the longest of any of them.


Benchmarking -02 compiler optimizations


Again, most runs of the program on this setting took within a few seconds of the above times (4 mins and 15 seconds). A small decrease in performance over the -03 optimizations


Benchmarking -01 compiler optimizations


Most times took within a seconds of the above time (4 mins and 22 seconds). Interestingly there's less of a performance swing between -01 and -02 settings than between -03 and -02. Overall, there's a different of around 16 seconds between -01 and -03 run times. I would be interested to see what kind of performance increases could be achieved on larger file compressions.