The results here are from a run on csmarlboro.org/var/www/csmarlboro/memory i.e. http://csmarlboro.org/memory The ./get script fetches C files from cmu. $ more /proc/cpuinfo $ make sumarrayrows $ make sumarraycols $ time ./sumarrayrows sum=0 real 0m0.190s user 0m0.125s sys 0m0.065s $ time ./sumarrayrows sum=0 real 0m0.190s user 0m0.125s sys 0m0.065s $ time ./sumarraycols sum=0 real 0m0.871s user 0m0.797s sys 0m0.073s How big are the arrays? What's the difference between the routines? Why does cols take four times longer to run? -- valgrind -- $ valgrind --tool=cachegrind ./sumarrayrows ==15778== Cachegrind, a cache and branch-prediction profiler ==15778== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==15778== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==15778== Command: ./sumarrayrows ==15778== sum=0 ==15778== ==15778== I refs: 360,886,922 ==15778== I1 misses: 784 ==15778== LLi misses: 779 ==15778== I1 miss rate: 0.00% ==15778== LLi miss rate: 0.00% ==15778== ==15778== D refs: 218,194,148 (201,391,605 rd + 16,802,543 wr) ==15778== D1 misses: 2,098,354 ( 1,049,591 rd + 1,048,763 wr) ==15778== LLd misses: 2,098,272 ( 1,049,516 rd + 1,048,756 wr) ==15778== D1 miss rate: 0.9% ( 0.5% + 6.2% ) ==15778== LLd miss rate: 0.9% ( 0.5% + 6.2% ) ==15778== ==15778== LL refs: 2,099,138 ( 1,050,375 rd + 1,048,763 wr) ==15778== LL misses: 2,099,051 ( 1,050,295 rd + 1,048,756 wr) ==15778== LL miss rate: 0.3% ( 0.1% + 6.2% ) $ valgrind --tool=cachegrind ./sumarraycols ==15782== Cachegrind, a cache and branch-prediction profiler ==15782== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==15782== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==15782== Command: ./sumarraycols ==15782== sum=0 ==15782== ==15782== I refs: 360,886,922 ==15782== I1 misses: 784 ==15782== LLi misses: 779 ==15782== I1 miss rate: 0.00% ==15782== LLi miss rate: 0.00% ==15782== ==15782== D refs: 218,194,148 (201,391,605 rd + 16,802,543 wr) ==15782== D1 misses: 17,826,994 ( 16,778,231 rd + 1,048,763 wr) ==15782== LLd misses: 17,826,912 ( 16,778,156 rd + 1,048,756 wr) ==15782== D1 miss rate: 8.1% ( 8.3% + 6.2% ) ==15782== LLd miss rate: 8.1% ( 8.3% + 6.2% ) ==15782== ==15782== LL refs: 17,827,778 ( 16,779,015 rd + 1,048,763 wr) ==15782== LL misses: 17,827,691 ( 16,778,935 rd + 1,048,756 wr) ==15782== LL miss rate: 3.0% ( 2.9% + 6.2% ) $ cg_annotate cachegrind.out.15778 I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 6291456 B, 64 B, 24-way associative ... valgrind's cachegrind simulation has L1 is top level cache, split into I (instructions) and D (data) LL is "last level" "This exactly matches the config of many modern machines" caches : L1 : I1 (32kB) & D1 (32kB) L2 : unified (6MB) 8-way associative means that each address maps to one of 8 possible cache spots. (1 place = "direct mapped"; anwhere = "fully associative"; see e.g. wikipedia CPU_cache)