Monday, April 20, 2020

Project Conclusion

I chose this project because I saw that the program was intended to run slowly. I thought this would lead to more opportunities to make small changes to get performance improvements. The project was unfortunately already pretty well optimized which lead to a bit of a struggle in the last stage. 

As I've posted in my last posts, ZopfliUpdateHash was the bulk of the programs runtime. UpdateHashValue was called very often inside of this function so these were my main targets for optimization. On a coding level, these functions didn't have much that could be changed to improve their algorithms as they were just single lines with a constant run time or simple loops with at worse a linear growth rate. The functions were just called a lot. This lead me to the conclusion that forced inline functions may improve performance. 

Unfortunately time restraints kind of left me unable to actively test this more thoroughly. I may follow up over the summer just for my own interest but for now, this is the end of my project for SPO600.

Stage 3 - Optimization Opportunities

This stage of the project will detail various ways of optimizing the program. My first thought was to go look at the algorithm used to do the compression. Although the compression is slow, it's thorough. There aren't really simple optimizations to be done such as switching to a better sorting algorithm or better search algorithms because the program really doesn't do either. While there are small things that could be done to perhaps save on memory such as changing some of the data members from ints to shorts or doubles to ints, it'd be a pretty negligible gain most likely, and any changes may actually do more harm than good if we consider alignment.

So, with that in mind I started experimenting with compiler options on x86_64 and Aarch. Once again, this didn't really seem to have much in the way of big optimizations. Any compiler option I added varied things by at most 1 second runtime and I really just attribute that to a bit of variance. Changing to 0fast didn't change the run time at all. After reading some of the comments on github it seems that the program is optimized for 03 optimizations right now so this kind of makes sense.

After exploring my options one option I decided might be possible is to ensure that the zopfliUpdateHash function is an inline function. Though it's probably on the larger end of what should be allowed to be an inline function, it's mostly just a few checks on some preprocessor conditions and then 1 line of code or 1 loop depending on which way it goes. Since this function is called very often (it's a short function and it accounts for roughly 50% of the programs runtime), this should cut down the runtime somewhat just by saving on function call overhead. Though this would increase the size of the executable file in the end, if we're optimizing for speed it may be a good option.


UpdateHashValue could also be made into an inline function since it is also called very often.


In conclusion, I didn't find very many options that were easy fixes in the code. The code is well optimized for what it is. Though it is slow, it is thorough. ZopfliUpdateHash is the bulk of the runtime despite being a relatively small function. This function, along with UpdateHashValue are called very often which leads me to believe ensuring they are inline functions would give a small performance boost at the cost of the executable file being larger, which would be depending on circumstance.

Wednesday, April 15, 2020

Assignment, Stage 2 - Profiling

For this stage of the project, we were required to profile the program to see which functions were eating up the most time. I used the same file for compression as I did in stage 1 of the assignment for Aarchie, 8.192 mb. The program took 197.52 seconds to run in this gprof profile. I added the options -pg -g -0g to the compilation options.

Compiler Options

AArch 64
gprof for zopfli on Aarchie

As you can see, 42.52% of the time is spent in the zopfliUpdateHash. zopfliStoreLitLenDistance comes in a distant second using 12.54% of the time in the profile. Everything else trails by a good bit. 

x86_64
On x86_64, I used a file that was roughly 10 times as big as the file I used on AArchie. The file is roughly 83mb in size. 

gprof on x86_64

Same kind of thing going on on x86_64. zopfliUpdateHash takes the bulk of the time. Interestingly, it takes an increased portion of the overall time compared to AArch64. zopfliStoreLitLenDistance is once again a distant second this time taking 9% of the time, with zopfliMaxCachedSublen taking 8.9%. 

It's somewhat interesting to see the different percentages of time that each function takes compared to AArch64. It's worth noting that it could be due to different file sizes, unfortunately it was unavoidable due to the difference in the power of the machines I used. Either way, the function that takes far and away the most time to run is zopfliUpdateHash so that is likely the one I'll be looking into for stage 3 of the project.




Monday, April 6, 2020

Week 11, profiling a program

This week, we learned about the 2 ways to profile a program

Sampling is one approach. In this approach the program is interrupted frequently to determine which function the most time is being used, and more specifically, where in the function the program is spending its time.

The other approach is instrumentation. It involves adding extra code to the program so that function calls can be recorded.

We were also taught about gprof and perf. Gprof works on both sampling and instrumentation. Perf works only on sampling. The man disadvantage to gprof is that it requires putting extra code into the program to make it work. Perf doesn't require this extra step but it gives less information. Because of this, I will likely end up trying to use gprof for the assignment just to get the most accurate readings.

This lesson overall was very helpful for assignment 2.

Thursday, April 2, 2020

Assignment stage 1, AArch64 times

I connected to the Aarchie server to get these times. I had to do a little bit of guess and check in order to make the file small enough to run on Aarchie but big enough to still get long enough times out of (a file the size of the one I compressed on matrix took so long that Aarchie just disconnected me before it could finish). I used a file that was roughly 8mb (8.192mb to be exact). 

-03 compiler optimization

My runtimes averaged out to around 2m and 18.5 seconds for -03. A file roughly a 10th of the size of the one I used on matrix takes a little more than half the time. Not entirely unexpected considering Aarchie isn't exactly the strongest machine but its a pretty stark difference which made finding a good file size for this a little tricky.

-02 compiler optimization
These runtimes averaged out to around 2m and 19.2 seconds. This surprised me because the runtime on matrix varied far more between compiler optimizations than it did on AArchie.

-01 compiler optimization

The runtimes averaged out to around 2m and 20.7 seconds. Again, not much of a different between the compiler optimizations. I hope to do more testing to see if I can make it vary more with a larger runtime (if AArchie will cooperate).

Wednesday, April 1, 2020

Assignment, Stage 1

For my project, I decided to work with Zopfli (https://github.com/google/zopfli). I mostly chose Zopfli because it specifically says it takes longer to perform compression to I thought it would have more opportunities for optimization. Zopfli is a data compression software that can compress into gzip format. It takes a significant amount of time to do it's compression.



The build process for Zopfli was very simple. I cloned the files and just ran the make file, not complications. I tested around with various file sizes to get a test case that took around 4 minutes to run. I used </dev/urandom > to generate a mass amount of junk data for my test case.



x86_64

Benchmarking -03 compiler optimizations (default in the makefile)



Most runs of the program on -03 optimizations took within 3-4 seconds of the above times (4 mins and 6 seconds). The big thing that stuck out to me was how long it took. After testing with a few other programs, Zopfli took by far the longest of any of them.


Benchmarking -02 compiler optimizations


Again, most runs of the program on this setting took within a few seconds of the above times (4 mins and 15 seconds). A small decrease in performance over the -03 optimizations


Benchmarking -01 compiler optimizations


Most times took within a seconds of the above time (4 mins and 22 seconds). Interestingly there's less of a performance swing between -01 and -02 settings than between -03 and -02. Overall, there's a different of around 16 seconds between -01 and -03 run times. I would be interested to see what kind of performance increases could be achieved on larger file compressions. 

Sunday, March 1, 2020

Lab4

For lab 4, our group chose to do option 1 and 4.

Part 1 proved the most challenging. Our group managed to get the input to only receive numbers from the user (after several kind of embarrassing mixups between hexadecimal values and their decimal equivalent). We couldn't get the numbers to add though, and this kind of fell to the wayside while we finished part 4.

For part 4, we chose to start the screen on black. Pushing an arrow key will increment or decrement (depending on direction) the colour we store in the screen and redraw the screen. The most challenging part was getting the proper line on the text screen to highlight.

Friday, January 31, 2020

Lab 3

The program we chose to complete for this lab was pong. For something that seemed relatively simple, there were some complications with it. One thing that seems obvious now but wasn't at the time was that we needed something to delay the drawing of the "ball" on the screen so that it didn't just instantly fly all over the place and the game would be unplayable.

We used the following code for our "sleep function"

delay: iny
cpy #$FF
bne delay

ldy #$00
inx
cpx #$06
bne delay

The above function just gives the processor busy work for a while until it should draw again. In other languages you can call built in methods like sleep for this kind of thing. It was interesting to get a look behind the scenes at what the processor could potentially be doing to make these "sleep" functions happen, though in a real system it would probably be working on other tasks rather than just counting.

Thursday, January 23, 2020

Lab 2

For lab 2, we were required to write assembly code to colour the edges of the bitmap screen. The first thing that struck me about this lab is just how hard it is to do even the simplest things with assembly. Even something as simple as adding a value to another value is not quite as straight forward as in other languages.

tya ;move y value to a for adding command
adc #$1f ;add 32 to accumulator (display is 32 pixels wide)
tay ;transfer new value back to y

The above code for example, seems much more daunting than just typing "y += 32;" in c or java. One of the biggest issues i ran into with this lab was that I didn't really understand what the carry flag was or where it came from. It was ultimately explained in class, but until then, it just seemed like my results were getting a 1 added to them at random. It did create some interesting results on the bit map screen at least. I learned that CLC should be used before attempting to do addition in most cases.


Below is the code for lab 2.


        lda #$00 ; set a pointer at $40 to point to $0200
sta $40
lda #$02
sta $41

lda #$05 ; colour
ldy #$00 ; set index to 0

loop: sta ($40),y ; set pixel
iny ; increment index
cpy #$20
bne loop ; continue until done the page

ldy #$E0 ; last line of 5
lda #$05 ; set page to 5
sta $41 ; set page to 5
lda #$06
loop2:

sta ($40),y
iny
cpy #$00
bne loop2

ldy #$00
lda #$00 ; set a pointer at $40 to point to $0200
sta $40
lda #$02
sta $41

loop3:
clc
lda #$07
sta ($40),y
tya ;move y value to a for adding command
adc #$1f ;add 32 to accumulator (display is 32 pixels wide)
tay ;transfer new value back to y
lda #$04 ;load purple
sta ($40),y
iny ;increment to start of next line
cpy #$00 ;check if finished page
bne loop3
inc $41 ;increment page
ldx $41 ;load value of page
cpx #$06 ;check if passed end of display
bne loop3