50 ways to fill your vector …

“The problem is all inside your head” she said to me
“The answer is easy if you take it logically”
โ€” Paul Simon, 50 ways to leave your lover

So recently I tweaked around with these newfangled C++11 initializer lists and created an EasyHack to use them to initialize property sequences in a readable way. This caused a short exchange on the LibreOffice mailing list, which I assumed had its part in motivating Stephans interesting post “On filling a vector”. For all the points being made (also in the quick follow up on IRC), I wondered how much the theoretical “can use a move constructor” discussed etc. really meant when the C++ is translated to e.g. GENERIC, then GIMPLE, then amd64 assembler, then to the internal RISC instructions of the CPU — with multiple levels of caching in addition.

So I quickly wrote the following (thanks so much for C++11 having the nice std::chrono now).

data.hxx:

#include <vector>
struct Data {
    Data();
    Data(int a);
    int m_a;
};
void DoSomething(std::vector<Data>&);

data.cxx:

#include "data.hxx"
// noop in different compilation unit to prevent optimizing out what we want to measure
void DoSomething(std::vector<Data>&) {};
Data::Data() : m_a(4711) {};
Data::Data(int a) : m_a(a+4711) {};

main.cxx:

#include "data.hxx"
#include <iostream>
#include <vector>
#include <chrono>
#include <functional>

void A1(long count) {
    while(--count) {
        std::vector<Data> vec { Data(), Data(), Data() };
        DoSomething(vec);
    }
}

void A2(long count) {
    while(--count) {
        std::vector<Data> vec { {}, {}, {} };
        DoSomething(vec);
    }
}

void A3(long count) {
    while(--count) {
        std::vector<Data> vec { 0, 0, 0 };
        DoSomething(vec);
    }
}

void B1(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back(Data());
        vec.push_back(Data());
        vec.push_back(Data());
        DoSomething(vec);
    }
}

void B2(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back({});
        vec.push_back({});
        vec.push_back({});
        DoSomething(vec);
    }
}

void B3(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back(0);
        vec.push_back(0);
        vec.push_back(0);
        DoSomething(vec);
    }
}

void C1(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.emplace_back(Data());
        vec.emplace_back(Data());
        vec.emplace_back(Data());
        DoSomething(vec);
    }
}

void C3(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.emplace_back(0);
        vec.emplace_back(0);
        vec.emplace_back(0);
        DoSomething(vec);
    }
}

double benchmark(const char* name, std::function<void (long)> testfunc, const long count) {
    const auto start = std::chrono::system_clock::now();
    testfunc(count);
    const auto end = std::chrono::system_clock::now();
    const std::chrono::duration<double> delta = end-start;
    std::cout << count << " " << name << " iterations took " << delta.count() << " seconds." << std::endl;
    return delta.count();
}

int main(int, char**) {
    long count = 10000000;
    while(benchmark("A1", &A1, count) < 60l)
        count <<= 1;
    std::cout << "Going with " << count << " iterations." << std::endl;
    benchmark("A1", &A1, count);
    benchmark("A2", &A2, count);
    benchmark("A3", &A3, count);
    benchmark("B1", &B1, count);
    benchmark("B2", &B2, count);
    benchmark("B3", &B3, count);
    benchmark("C1", &C1, count);
    benchmark("C3", &C3, count);
    return 0;
}

Makefile:

CFLAGS?=-O2
main: main.o data.o
    g++ -o $@ $^

%.o: %.cxx data.hxx
    g++ $(CFLAGS) -std=c++11 -o $@ -c $<

Note the object here is small and trivial to copy as one would expect from objects passed around as values (as expensive to copy objects mostly can be passed around with a std::shared_ptr). So what did this measure? Here are the results:

Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 without inline constructors:

implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
A1 89.1 s 79.0 s 78.9 s 78.9 s
A2 89.1 s 78.1 s 78.0 s 80.5 s
A3 90.0 s 78.9 s 78.8 s 79.3 s
B1 103.6 s 97.8 s 79.0 s 78.0 s
B2 99.4 s 95.6 s 78.5 s 78.0 s
B3 107.4 s 90.9 s 79.7 s 79.9 s
C1 99.4 s 94.4 s 78.0 s 77.9 s
C3 98.9 s 100.7 s 78.1 s 81.7 s

creating a three element vector without inlined constructors
And, for comparison, following are the results, if one allows the constructors to be inlined.
Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 with inline constructors:

implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
A1 85.6 s 74.7 s 74.6 s 74.6 s
A2 85.3 s 74.6 s 73.7 s 74.5 s
A3 91.6 s 73.8 s 74.4 s 74.5 s
B1 93.4 s 90.2 s 72.8 s 72.0 s
B2 93.7 s 88.3 s 72.0 s 73.7 s
B3 97.6 s 88.3 s 72.8 s 72.0 s
C1 93.4 s 88.3 s 72.0 s 73.7 s
C3 96.2 s 88.3 s 71.9 s 73.7 s

creating a three element vector without inlined constructors
Some observations on these measurements:

  • -march=... is at best neutral: The measured times do not change much in general, they only even slightly improve performance in five out of 16 cases, and the two cases with the most significant change in performance (over 3%) are actually hurting the performance. So for the rest of this post, -march=... will be ignored. Sorry gentooers.๐Ÿ˜‰
  • There is no silver bullet with regard to the different implementations: A1, A2 and A3 are the faster implementations when not inlining constructors and using -Os or -O2 (the quickest A* is ~10% faster than the quickest B*/C*). However when inlining constructors and using -O3, the same implementations are the slowest (by 2.4%).
  • Most common release builds are still done with -O2 these days. For those, using initializer lists (A1/A2/A3) seem too have a significant edge over the alternatives, whether constructors are inlined or not. This is in contrast to the conclusions made from “constructor counting”, which assumed these to be slow because of additional calls needed.
  • The numbers printed in bold are either the quickest implementation in a build scenario or one that is within 1.5% of the quickest implementation. A1 and A2 are sharing the title here by being in that group five times each.
  • With constructors inlined, everything in the loop except DoSomething() could be inline. It seems to me that the compiler could — at least in theory — figure out that it is asked the same thing in all cases. Namely, reserve space for three ints on the heap, fill them each with 4711 and make the ::std::vector<int> data structure on the stack reflect that, then hand that to the DoSomething() function that you know nothing about. If the compiler would figure that out, it would take the same time for all implementations. This doesnt happen either on -O2 (differ by ~18% from quickest to slowest) nor on -O3 (differ by ~3.6%).

One common mantra in applications development is “trust the compiler to optimize”. The above observations show a few cracks in the foundations of that, esp. if you take into account that this is all on the same version of the same compiler running on the same platform and hardware with the same STL implementation. For huge objects with expensive constructors, the constructor counting approach might still be valid. Then again, those are rarely statically initialized as a bigger bunch into a vector. For the more common scenario of smaller objects with cheap constructors, my tentative conclusion so far would be to go with A1/A2/A3 — not so much because they are quickest in the most common build scenarios on my platform, but rather because the readability of them is a value on its own while the performance picture is muddy at best.

And hey, if you want to run the tests above on other platforms or compilers, I would be interested in results!

Note: I did these runs for each scenario only once, thus no standard deviation is given. In general, they seemed to be rather stable, but this being wallclock measurements, one or the other might be outliers. caveat emptor.

5 thoughts on “50 ways to fill your vector …

  1. -flto flag is I think the most interesting development lately in Gcc. In my small experience of C it looks to me that is allocation bound so overriding the memory manager should run your program much faster.
    Functions like malloca are your friend here๐Ÿ™‚

    1. Well, for one -flto would likely help performance here: By allowing gcc to see everything in A1() having no sideeffect at all, optimizing it — and then the whole loop — out. That is exactly not what I want here: Im using different compilation units _only_ for preventing the whole benchmark measuring nothing. Beyond that, I think what I wrote about “the map is not the territory” problems with -flto still holds.

      As for handcrafted memory management: Thats exactly what is to be avoided here. This is not about that “one hotspot function” that gets called a million times during your run. In those situations: yes, you might want to look into doing tricks with manual memory management of some kind (and often we do). However, in LibreOffice we have thousands of places that are not _the_ hotspot, but in sum also contribute to overall performance. Doing manual memory management on those would be wrong: It would be a errorprone, a maintenance burden and would increase need for test coverage even more. Instead, using the standard library in the best performing way is the route to take.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s