Lab 3 - CMSC 154 Lab Spring 2012

Lab 3 Goals

By the end of this lab, you should be able to use tools like gprof, time, and valgrind to diagnose performance problems in your code. You will use these tools to identify the portions of your program that are taking up the most time.

Using gprof

GProf is the GNU profiler, and is closely integrated with the C compiler, gcc. When gcc is used with the -pg option, there will be extra profiling code generated in the output binary. At an interval, this profiling code wakes up and looks at the program counter as well as the current stack backtrace. Profilers that work by gathering data at fixed intervals are called sampling profilers, as opposed to instrumenting profilers, which rewrite a program to emit timestamp information at program points during execution. The biggest difference is that intrumenting profilers can dramatically change the timing behavior of your program, whereas sampling profilers can miss functions with tiny execution times.

In your walkthrough directory, there is a very simple program that performs a rolling sum of all of the numbers in an array, ignoring overflow conditions. The walkthrough also performs a simple directory traversal. Compile the program so that it will collect profiling information and run it:

$ cd ~/cs154/lab3/walkthrough
$ gcc -g -pg main.c
$ ./a.out 10000

It will take the program a couple of seconds to execute (depending on which machine you are working at) and then it will exit. The -pg argument caused the program to emit a gmon.out file, which is what gprof takes as input. Now run gprof to see an analysis of the gathered data:

$ gprof a.out gmon.out

It will first print a flat profile, showing how much time was spent in each of the functions in the program. There are only two in this walkthrough - main and AddToEntry. The column self seconds is the most interesting and useful one. That column reports the amount of time spent executing just the function named. So, main's self seconds does not include time spent in the AddToEntry function, even though main was on the call stack when AddToEntry was called.

The second section is the call graph. For this program, it's not very interesting, but it basically is showing two things: functions called by a function (callees), and places where a function was called (callers).

Since we compiled the program with debugging symbols enabled (the -g option to gcc), we can also get line-by-line profiling information. This expands the previous view to show time spent on individual source code basic blocks.

$ gprof -lb a.out gmon.out

But, profiling an unoptimized program is not a good use of time. Why redo work that the compiler can do for you? Compile with the -O2 flag to enable a large number of optimizations and notice the changes in the output reported by gprof.

$ gcc -O2 -g -pg main.c
$ ./a.out 10000
$ gprof -lb a.out gmon.out

Using cachegrind

The Valgrind tool framework has a memory cache performance simulator. Cachegrind will run a program and make an estimation of that program's cache performance. Recompile the walkthrough without profiling turned on, and run it through cachegrind:

$ gcc -O2 -g main.c
$ valgrind --tool=cachegrind --L2=4194304,2,64 ./a.out 2000

The --L2 parameter is requried because the version of cachegrind on these machines cannot recognize the amount of L2 cache on the processor.

After the tool header and warnings, there are three sections of information that cachegrind reports. The first section of information is on the total number of instructions and the instruction cache misses. The second section is on data references and misses in the L1 and L2 cache. The final section is the combined instruction and data cache reference counts and misses. Of particular importance when diagnosing performance problems is the L2d (Level 2 Data) miss rate. This number represents the percentage of requests for memory that had to go into main memory rather than being fetched directly from a cached value.

For this program and execution size, the data fits almost entirely in the cache and so the misses are basically the sum of the initial loads from memory the first time.

Using time

While gprof gives a very good overview of how the C code you wrote is running, it doesn't show how much time was spent in system calls on your code's behalf. For example, stat() is an expensive call because it requires the operating system to look at a large amount of information in the filesystem and fill in a structure. However, gprof will show that call as having taken zero time because stat is system code instead of user code. To see the difference between the two measurements, run time on the walkthrough and observe the difference between user and sys time. The gprof tool reports only user time.

$ time ./a.out 2000

There should be a significant amount of system time reported by time - more than likely over half of the total runtime, though none of that system time is reported by gprof!

Also, always remember to run time with a fully-qualified path to the program to execute. Otherwise, you'll just see a quick error message and then very short execution times. Being wrong is a very fast operation.

Exercises

First, please create and add a readme.txt file at the top of your lab3 directory with the names of all folks working together. Use this file to record your answers to questions in the exercises. If you work with several people, it's fine to just check in one file in one directory for this exercise.

$ cd ~/cs154/lab3
$ emacs readme.txt
$ svn add readme.txt
$ svn ci -m "Readme with member names"

Exercise 1

In the ex1 directory is a file main.c. This program opens up the binary file for AbiWord and counts the approximate number of NULL characters within. Open the source file main.c, look at it, and try to determine where the largest performance problem is without running the program.

Now, use both the gprof and time tools to determine where the greatest opportunity to speed up the program is. Commit a speedier version of the file main.c. Also comment in your readme.txt file about your guess and the actual performance problem.

In particular, remember to compile with -O2 and only use -g and -pg when compiling for use with gprof. Compile with just -O2 when using time to avoid measuring the time spent handling profiling information.

Exercise 2

In the ex2 directory is a file main.c. When compiled and run, this program takes an integer argument for the size of an array and then walks along the array writing some values into the cells.

Compile (with no optimization, to avoid loop reordering) and then run the program under cachegrind with a variety of input sizes (50; 100; 500; 1,000; 5,000). Enter your answers to following questions into your readme.txt file:

At what input sizes does the L2 data miss rate approach the L1 data miss rate (D1)? Why?
Why would reducing the increment size in the outer for loop reduce the data miss rates for some data set sizes?
How could you determine the "worst" values to use for the increment size in the outer for loop and input data? Worst here means highest data miss rate.

Make a straightforward loop transformation to improve data locality. What effect does this loop transformation have on the simulated L1 and L2 data miss rates? Commit your new version of main.c and put your answer in your readme.txt file.

Bonus Problems

Do not attempt these problems until you have completed all of the normal exercises.

BONUS 1 (what's wget doing?)

Download wget (via wget http://ftp.gnu.org/gnu/wget/wget-latest.tar.gz) and create a profiled version by adding the -pg argument to the CFLAGS definition in the Makefile that is generated in the src directory after you have run configure. Run your built version of wget (it will be in the src/ subdirectory) to retrieve the wget tarball again. Where is wget spending most of its wall clock (overall) time getting the file? What would be the result of optimizing the code to which you have access? Assuming you could make the user code infinitely fast, how much speedup would you get?

BONUS 2 (do you trust your results?)

In the examples above, gprof interrupted the program to sample on a 0.01 second basis. If the program runs for only 10 seconds, how confident would you be if gprof told you that the function you were interested in runs for 1 second? What might you do to increase your confidence? How could you quantify that confidence level?

Commit your answers to your readme.txt file.