The results here are from a run on csmarlboro.org/var/www/csmarlboro/memory
i.e. http://csmarlboro.org/memory

The ./get script fetches C files from cmu.

$ more /proc/cpuinfo

$ make sumarrayrows
$ make sumarraycols

$ time ./sumarrayrows
sum=0
real	0m0.190s
user	0m0.125s
sys	0m0.065s

$ time ./sumarrayrows
sum=0
real	0m0.190s
user	0m0.125s
sys	0m0.065s

$ time ./sumarraycols
sum=0

real	0m0.871s
user	0m0.797s
sys	0m0.073s

 How big are the arrays?
 What's the difference between the routines?
 Why does cols take four times longer to run?

-- valgrind --
$ valgrind --tool=cachegrind ./sumarrayrows
==15778== Cachegrind, a cache and branch-prediction profiler
==15778== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==15778== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==15778== Command: ./sumarrayrows
==15778== 
sum=0
==15778== 
==15778== I   refs:      360,886,922
==15778== I1  misses:            784
==15778== LLi misses:            779
==15778== I1  miss rate:        0.00%
==15778== LLi miss rate:        0.00%
==15778== 
==15778== D   refs:      218,194,148  (201,391,605 rd   + 16,802,543 wr)
==15778== D1  misses:      2,098,354  (  1,049,591 rd   +  1,048,763 wr)
==15778== LLd misses:      2,098,272  (  1,049,516 rd   +  1,048,756 wr)
==15778== D1  miss rate:         0.9% (        0.5%     +        6.2%  )
==15778== LLd miss rate:         0.9% (        0.5%     +        6.2%  )
==15778== 
==15778== LL refs:         2,099,138  (  1,050,375 rd   +  1,048,763 wr)
==15778== LL misses:       2,099,051  (  1,050,295 rd   +  1,048,756 wr)
==15778== LL miss rate:          0.3% (        0.1%     +        6.2%  )

$ valgrind --tool=cachegrind ./sumarraycols
==15782== Cachegrind, a cache and branch-prediction profiler
==15782== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al.
==15782== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==15782== Command: ./sumarraycols
==15782== 
sum=0
==15782== 
==15782== I   refs:      360,886,922
==15782== I1  misses:            784
==15782== LLi misses:            779
==15782== I1  miss rate:        0.00%
==15782== LLi miss rate:        0.00%
==15782== 
==15782== D   refs:      218,194,148  (201,391,605 rd   + 16,802,543 wr)
==15782== D1  misses:     17,826,994  ( 16,778,231 rd   +  1,048,763 wr)
==15782== LLd misses:     17,826,912  ( 16,778,156 rd   +  1,048,756 wr)
==15782== D1  miss rate:         8.1% (        8.3%     +        6.2%  )
==15782== LLd miss rate:         8.1% (        8.3%     +        6.2%  )
==15782== 
==15782== LL refs:        17,827,778  ( 16,779,015 rd   +  1,048,763 wr)
==15782== LL misses:      17,827,691  ( 16,778,935 rd   +  1,048,756 wr)
==15782== LL miss rate:          3.0% (        2.9%     +        6.2%  )

$ cg_annotate cachegrind.out.15778 
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         6291456 B, 64 B, 24-way associative
...

valgrind's cachegrind simulation has 
 L1 is top level cache, split into I (instructions) and D (data)
 LL is "last level"
 "This exactly matches the config of many modern machines"

 caches :
  L1 : I1 (32kB) & D1 (32kB)
  L2 : unified (6MB)

 8-way associative means that each address
 maps to one of 8 possible cache spots.
 (1 place = "direct mapped";
  anwhere = "fully associative";
  see e.g. wikipedia CPU_cache)