Following the White Rabbit

When logic and proportion have fallen sloppy dead
And the white knight is talking backwards
And the red queen’s off with her head
Remember what the dormouse said
Feed your head, feed your head

— Jefferson Airplane, White Rabbit

So, this was intended as a quick and smooth addendum to the “50 ways to fill your vector” post, bringing callgrind into the game and ensuring everyone that its instructions counts are a good proxy for walltime performance of your code. This started out as mostly as expected, when measuring the instructions counts in two scenarios:

implementation/cflags -O2 not inlined -O3 inlined
A1 2610061438 2510061428
A2 2610000025 2510000015
A3 2610000025 2510000015
B1 3150000009 2440000009
B2 3150000009 2440000009
B3 3150000009 2440000009
C1 3150000009 2440000009
C3 3300000009 2440000009

The good news here is, that this mostly faithfully reproduces some general observations on the timings from the last post on this topic, although the differences in callgrind are more pronounced in callgrind than in reality:

  • The A implementations are faster than the B and C implementations on -O2 without inlining
  • The A implementations are slower (by a smaller amount) than the B and C implementations on -O3 with inlining

The last post also suggested the expectation that all implementations could — and with a good compiler: should — have the same code and same speed when everything is inline. Apart from the A implementations still differing from the B and C ones, callgrinds instruction count suggest to actually be the case. Letting gcc compile to assembler and comparing the output, one finds:

  • Inline A1-3 compile to the same output on -Os, -O2, -O3 each. There is no difference between -O2 and -O3 for these.
  • Inline B1-3 compile to the same output on -Os, -O2, -O3 each, but they differ between optimization levels.
  • Inline C3 output differs from the others and between optimization levels.
  • Without inlinable constructors, the picture is the same, except that A3 and B3 now differ slightly from their kin as expected.

So indeed most of the implementations generate the same assembler code. However, this is quite a bit at odd with the significant differences in performance measured in the last post, e.g. B1/B2/B3 on -O2 created widely different walltimes. So time to test the assumption that running one implementation for a minute is producing reasonable statistically stable result, by doing 10 1-minute runs for each implementation and see what the standard deviation is. The following is found for walltimes (no inline constructors):

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 80.6 s 78.9 s 78.9 s 79.0 s
A2 78.7 s 78.1 s 78.0 s 79.2 s
A3 80.7 s 78.9 s 78.9 s 78.9 s
B1 84.8 s 80.8 s 78.0 s 78.0 s
B2 84.8 s 86.0 s 78.0 s 78.1 s
B3 84.8 s 82.3 s 79.7 s 79.7 s
C1 84.4 s 85.4 s 78.0 s 78.0 s
C3 86.6 s 85.7 s 78.0 s 78.9 s
no inline measurements
no inline measurements

And with inlining:

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 76.4 s 74.5 s 74.7 s 73.8 s
A2 75.4 s 73.7 s 73.8 s 74.5 s
A3 76.3 s 74.6 s 75.5 s 73.7 s
B1 80.6 s 77.1 s 72.7 s 73.7 s
B2 81.4 s 78.9 s 72.0 s 72.0 s
B3 80.6 s 78.9 s 72.8 s 73.7 s
C1 81.4 s 78.9 s 72.0 s 72.0 s
C3 79.7 s 80.5 s 72.9 s 77.8 s
inline measurements
inline measurements

The standard deviation for all the above values is less than 0.2 seconds. That is … interesting: For example, on -O2 without inlining, B1 and B2 generate the same assembler output, but execute with a very significant difference in hardware (5.2 s difference, or more than 25 standard deviations). So how have logic and proportion fallen sloppy dead here? If the same code is executed — admittedly from two different locations in the binary — how can that create such a significant difference in walltime performance, while not being visible at all on callgrind? A wild guess, which I have not confirmed yet, is cache locality: When not inlining constructors, those might be in CPU cache from one copy of the code in the binary, but not for the other. And by the way, it might also hint at the reasons for the -march= flag (which creates bigger code) seeming so uneffective. And it might explain, why performance is rather consistent when using inline constructors. If so, the impact of this is certainly interesting. It also suggest that allowing inlining of hotspots, like recently done with the low-level sw::Ring class, produces much more performance improvement on real hardware than the meager results measured with callgrind. And it reinforces the warning made in that post about not falling in the trap of mistaking the map for the territory: callgrind is not a “map in the scale of a mile to the mile”.

Addendum: As said in the previous post, I am still interested in such measurements on other hardware or compilers. All measurements above done with gcc 4.8.3 on Intel i5-4200U@1.6GHz.