I would kill 500 lines and I would kill 500 more …

I would walk 500 miles and I would walk 500 more
The proclaimers, 500 miles

So I recently noted that github reported I have 1337 commits on LibreOffice since I joined Canonical in February 2011. Looking at those stats, it seems I also deleted some net 155,634 lines over that time in the codebase.

LibreOffice commits

Even though I cant find that mail, I seem to remember that Michael Stahl, when joining the LibreOffice project proclaimed his goal to be to contribute ‘a net negative lines of code.’1) Now I have not looked into the details of the above stats — they might very likely reveal to be caused by some bulk change. Which would be lame, unless its the killing of the old build system, for which I think I can claim some credit. But in general I really love the idea of ‘contributing a net negative number of lines of code’.

So, at the last LibreOffice Hackfest in Cambridge 2), I pushed a set of commits refactoring the UNO bindings of writer tables. It all started so innocent. I was actually aiming to do something completely different: namely give the UNO cursors in Writer (SwUnoCrsr) somewhat saner resource management and drag them screaming and kicking out of the 1980ies. However, once in unotbl.cxx, I found more of “determined Real Programmer can write FORTRAN programs in any language” and copypasta there than I could bear. I thought: “This UNO stuff has decent test coverage, you could refactor it a bit quickly.”.

Of course I was wrong with both sides of that statement: On the one hand, when I started the coverage was 70.1% LOC on that file which is not really as high as I expected. On the other hand, I did not end with “a bit quickly”, rather I went on to refactor away:
dc -e "`git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^+|wc -l` `git log --author Michaelsen -p dc8697e554417d31501a0d90d731403ede223370^..HEAD sw/source/core/unocore/unotbl.cxx|grep ^-|wc -l` - p"
-1015

… a thousand lines. On discovering the lacking test-coverage, I quickly added some more tests — bringing coverage to 77.52% LOC at least now.3) And yes, I also silently fixed the one regression I thereby discovered I had introduced, which nobody seemed to have noticed so far. One thing I noticed in this little refactoring spree is that while C++11s features might look tame compared to more modern programming languages in metrics like avoiding boilerplate, it still outclasses what we had before. Beyond the simplifying refactoring, features like lambdas are really nice for non-interactive (test-driven) debugging, including quickly asserting on the state of variables some over some 10 stackframes up or down without going into major contortions in testcode.

1) By the way, a quick:
dc -e "`git log --author Stahl -p |grep ^+|wc -l` `git log --author Stahl -p |grep ^-|wc -l` - p"
-108686

confirms Michael is more than living up to his personal goals.

2) Speaking of the Hackfest: The other thing I did there was helping/observing Sam Tuke getting setup for his first code contribution. While we made great progress in making this easier than it used to be, we could be a lot better there still. Sadly though, I didnt see a shortcut or simplification we could implement right away.

3) And along with that did bring coverage of unochart.cxx from abismal 4.4% LOC to at least 35.31% LOC  as a collateral damage.

addendum: Note that the writer tables core also increased coverage quite a bit from 54.6% LOC to 65% LOC.

Death or Glory vs. Continuous Integration

But I believe in this and it’s been tested by research
— The Clash, Death and Glory

Thanks to Norbert’s efforts, the LibreOffice project now has a Jenkins setup that not only gives us visibility on how healthy our master branch is, with the results being reported to the ESC regularly: In addition it allows everyone easily testing commits and branches on all major LibreOffice platforms (Linux, OS X, Windows) just by uploading a change to gerrit. Doing so is really easy once you are set up:

./logerrit submit                      # a little helper script in our repo
git push logerrit HEAD:refs/for/master # alternative: plain old git
git review                             # alternative: needs to install the git-review addon

Each of the above commands alone send your work for review and testbuilding to gerrit. The last one needs an additional setup, that is however really helpful and worth it for people working with gerrit from the command-line regulary.

So, what if you have a branch that you want to testbuild? Well, just pushing the branch to gerrit as suggested above still works: gerrit then will create a change for every commit, mark them as depending on each other and testbuild every commit. This is great for a small branch of a handful of commits, but will be annoying and somewhat wasteful for a branch with more than 10-15 commits. In the latter case you might not want a manual review for each commit and also not occupy our builders for each of them. So what’s the alternative, if you have a branch ${mybranch} and want to at least test the final commit to build fine everywhere?

git checkout -b ${mybranch}-ci ${mybranch} # switch to branch ${mybranch}-ci
git rebase -i remotes/logerrit/master      # rebase the branch on master interactively

Now your favourite editor comes up showing the commits of the branch. As your favourite editor will be vim, you can then type:

:2,$s/^pick/s/ | x

To squash all the commits of the branch into one commit. Then do:

git checkout -                                   # go back to whatever branch we where on before
git push logerrit ${mybranch}-ci:refs/for/master # push squashed branch to gerrit for testbuilding
git branch -D ${mybranch}-ci                     # optional: delete squashed branch locally

Now only wait for the builder on Jenkins to report back. This allowed me to find out that our compiler on OS X didnt think of this new struct as a POD-type, while our compilers on Linux and Windows where fine with it (see: “Why does C++ require a user-provided default constructor to default-construct a const object?” for the gory details). Testbuilding on gerrit allowed me to fix this before pushing something broken on a platform to master, which would have spoiled the nifty ability to test your commit before pushing for everyone else: Duly testing your commit on gerrit only to find that the master you build upon was broken by someone else on some platform is not fun.

The above allows you to ensure the end of your branch builds fine on all platforms. But what about the intermediate commits and our test-suites? Well, you can test that each and every commit passes tests quite easily locally:

git rebase -i remotes/logerrit/master --exec 'make check'

This rebases your branch on master (even if its already up to date) and builds and runs all the tests on each commit along the way. In case there is a test breakage, git stops and lets you fix things (just like with traditional troubles on rebases like changes not applying cleanly).

Note: gerrit will close the squashed branch change if you push the branch to master: The squashed commit message ends with the Change-Id of the final commit of the branch. So once that commit is pushed, the gerrit closes the review for the squashed change.

Another note: If the above git commands are too verbose for you (they are for me), consider using gitsh and aliases. Combined they help quite a lot in reducing redundant typing when working with git.

Easterhegg: Vimpressing LibreOffice!

Das ist alles nur geklaut und gestohlen,
nur gezogen und geraubt.
Entschuldigung, das hab ich mir erlaubt.
— die Prinzen, Alles nur geklaut

So, you might have noticed that there was no April Fools post from me this, year unlike previous years. One idea, I had was giving LibreOffice vi-key bindings — except that apparently already exists: vibreoffice. So I went looking for something else and found odpdown by Thorsten, who just started to work on LibreOffice fulltime again, and reading about it I always had the thought that it would be great to be able to run this right from your favourite editor: Vim.

And indeed: That is not hard to do. Here is a very raw video showing how to run presentations right out of vim:

Now, this is a quick hack, Linux only, requires you to have Python3 UNO-bindings installed etc. If you want to play with it: Clone the repo from github and get started. Kudos go out to Thorsten for the original odpdown on which this is piggybagging (“das ist alles nur geklaut”). So: Have fun with this — I will have to install vibreoffice now.

Following the White Rabbit

When logic and proportion have fallen sloppy dead
And the white knight is talking backwards
And the red queen’s off with her head
Remember what the dormouse said
Feed your head, feed your head

— Jefferson Airplane, White Rabbit

So, this was intended as a quick and smooth addendum to the “50 ways to fill your vector” post, bringing callgrind into the game and ensuring everyone that its instructions counts are a good proxy for walltime performance of your code. This started out as mostly as expected, when measuring the instructions counts in two scenarios:

implementation/cflags -O2 not inlined -O3 inlined
A1 2610061438 2510061428
A2 2610000025 2510000015
A3 2610000025 2510000015
B1 3150000009 2440000009
B2 3150000009 2440000009
B3 3150000009 2440000009
C1 3150000009 2440000009
C3 3300000009 2440000009

The good news here is, that this mostly faithfully reproduces some general observations on the timings from the last post on this topic, although the differences in callgrind are more pronounced in callgrind than in reality:

  • The A implementations are faster than the B and C implementations on -O2 without inlining
  • The A implementations are slower (by a smaller amount) than the B and C implementations on -O3 with inlining

The last post also suggested the expectation that all implementations could — and with a good compiler: should — have the same code and same speed when everything is inline. Apart from the A implementations still differing from the B and C ones, callgrinds instruction count suggest to actually be the case. Letting gcc compile to assembler and comparing the output, one finds:

  • Inline A1-3 compile to the same output on -Os, -O2, -O3 each. There is no difference between -O2 and -O3 for these.
  • Inline B1-3 compile to the same output on -Os, -O2, -O3 each, but they differ between optimization levels.
  • Inline C3 output differs from the others and between optimization levels.
  • Without inlinable constructors, the picture is the same, except that A3 and B3 now differ slightly from their kin as expected.

So indeed most of the implementations generate the same assembler code. However, this is quite a bit at odd with the significant differences in performance measured in the last post, e.g. B1/B2/B3 on -O2 created widely different walltimes. So time to test the assumption that running one implementation for a minute is producing reasonable statistically stable result, by doing 10 1-minute runs for each implementation and see what the standard deviation is. The following is found for walltimes (no inline constructors):

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 80.6 s 78.9 s 78.9 s 79.0 s
A2 78.7 s 78.1 s 78.0 s 79.2 s
A3 80.7 s 78.9 s 78.9 s 78.9 s
B1 84.8 s 80.8 s 78.0 s 78.0 s
B2 84.8 s 86.0 s 78.0 s 78.1 s
B3 84.8 s 82.3 s 79.7 s 79.7 s
C1 84.4 s 85.4 s 78.0 s 78.0 s
C3 86.6 s 85.7 s 78.0 s 78.9 s
no inline measurements
no inline measurements

And with inlining:

implementation/cflags -Os -O2 -O3 -O3 -march=
A1 76.4 s 74.5 s 74.7 s 73.8 s
A2 75.4 s 73.7 s 73.8 s 74.5 s
A3 76.3 s 74.6 s 75.5 s 73.7 s
B1 80.6 s 77.1 s 72.7 s 73.7 s
B2 81.4 s 78.9 s 72.0 s 72.0 s
B3 80.6 s 78.9 s 72.8 s 73.7 s
C1 81.4 s 78.9 s 72.0 s 72.0 s
C3 79.7 s 80.5 s 72.9 s 77.8 s
inline measurements
inline measurements

The standard deviation for all the above values is less than 0.2 seconds. That is … interesting: For example, on -O2 without inlining, B1 and B2 generate the same assembler output, but execute with a very significant difference in hardware (5.2 s difference, or more than 25 standard deviations). So how have logic and proportion fallen sloppy dead here? If the same code is executed — admittedly from two different locations in the binary — how can that create such a significant difference in walltime performance, while not being visible at all on callgrind? A wild guess, which I have not confirmed yet, is cache locality: When not inlining constructors, those might be in CPU cache from one copy of the code in the binary, but not for the other. And by the way, it might also hint at the reasons for the -march= flag (which creates bigger code) seeming so uneffective. And it might explain, why performance is rather consistent when using inline constructors. If so, the impact of this is certainly interesting. It also suggest that allowing inlining of hotspots, like recently done with the low-level sw::Ring class, produces much more performance improvement on real hardware than the meager results measured with callgrind. And it reinforces the warning made in that post about not falling in the trap of mistaking the map for the territory: callgrind is not a “map in the scale of a mile to the mile”.

Addendum: As said in the previous post, I am still interested in such measurements on other hardware or compilers. All measurements above done with gcc 4.8.3 on Intel i5-4200U@1.6GHz.

LibreOffice around the world

Around the world, Around the world
— Daft Punk, Around the world

So, you still heard that unfounded myth that it is hard to get involved with and to start contributing to LibreOffice? Still? Even though that there are our Easy Hacks and the LibreOffice developer are a friendly bunch that will help you get started on mailing lists and on IRC? If those alone do not convince you, it might be because it is admittedly much easier to get started if you meet people face to face — like on one of our upcoming Events! Especially our Hackfests are a good way to get started. The next one will be at the University de Las Palmas de Gran Canaria were we had been guests last year already. We presented some introduction talks to the students of the university and then went on to hack on LibreOffice from fixing bugs to implementing new stuff. Here is how that looked like last year:

LibreOffice Hackfest Gran Canaria 2014
LibreOffice Hackfest Gran Canaria 2014

One thing we learned from previous Hackfests was that it is great if newcomers have a way to start working on code right away. While it is rather easy to do that as the 5 minute video on our wiki shows, it might still take some time on some notebooks. So what if you spontaneously show up at the event without a pre-build LibreOffice? Well for that, we now have — thanks to Christian Lohmaier of the Document Foundation staffremote virtual machines prepared for Hackfests, that allow you to get started right away with everything prepared — on rather beefy hardware even, that is.

If you are a student at ULPGC or live in Las Palmas or on the Canary Islands, we invite you to join us to learn how to get started. For students, this is also a very good opportunity get involved and prepare for a Google Summer of Code on LibreOffice. Furthermore, if you are a even casual contributor to LibreOffice code already and want to help out sharing and deepen knowledge on how to work on LibreOffice code, you should get in contact with the Document Foundation — while the event is already very soon now, there still might be travel reimbursal available. You will find all the details on the wiki page for the Hackfest in Las Palmas de Gran Canaria 2015.

LibreOffice Evening Hacking
LibreOffice Evening Hacking in Las Palmas 2014

On the other hand, if two weeks is too short a notice for you, but the rest of this sounds really tempting, there is already the next Hackfest planned, which will take place in Cambridge in the United Kingdom in May. We will be there with a Hackfest for the first time and invite you to join us from anywhere in Europe if you either are a LibreOffice code contributor or if you are interested in learning more on how to become one. Again, there is a wiki page with the details on the LibreOffice Hackfest in Cambridge 2015, and travel reimbursals are available. Contact us!

LibreOffice Evening Hacking
How I imagine Cambridge in May — Photo by Andrew Dunn CC-BY-SA 2.0 via Wikimedia

50 ways to fill your vector …

“The problem is all inside your head” she said to me
“The answer is easy if you take it logically”
— Paul Simon, 50 ways to leave your lover

So recently I tweaked around with these newfangled C++11 initializer lists and created an EasyHack to use them to initialize property sequences in a readable way. This caused a short exchange on the LibreOffice mailing list, which I assumed had its part in motivating Stephans interesting post “On filling a vector”. For all the points being made (also in the quick follow up on IRC), I wondered how much the theoretical “can use a move constructor” discussed etc. really meant when the C++ is translated to e.g. GENERIC, then GIMPLE, then amd64 assembler, then to the internal RISC instructions of the CPU — with multiple levels of caching in addition.

So I quickly wrote the following (thanks so much for C++11 having the nice std::chrono now).

data.hxx:

#include <vector>
struct Data {
    Data();
    Data(int a);
    int m_a;
};
void DoSomething(std::vector<Data>&);

data.cxx:

#include "data.hxx"
// noop in different compilation unit to prevent optimizing out what we want to measure
void DoSomething(std::vector<Data>&) {};
Data::Data() : m_a(4711) {};
Data::Data(int a) : m_a(a+4711) {};

main.cxx:

#include "data.hxx"
#include <iostream>
#include <vector>
#include <chrono>
#include <functional>

void A1(long count) {
    while(--count) {
        std::vector<Data> vec { Data(), Data(), Data() };
        DoSomething(vec);
    }
}

void A2(long count) {
    while(--count) {
        std::vector<Data> vec { {}, {}, {} };
        DoSomething(vec);
    }
}

void A3(long count) {
    while(--count) {
        std::vector<Data> vec { 0, 0, 0 };
        DoSomething(vec);
    }
}

void B1(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back(Data());
        vec.push_back(Data());
        vec.push_back(Data());
        DoSomething(vec);
    }
}

void B2(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back({});
        vec.push_back({});
        vec.push_back({});
        DoSomething(vec);
    }
}

void B3(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.push_back(0);
        vec.push_back(0);
        vec.push_back(0);
        DoSomething(vec);
    }
}

void C1(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.emplace_back(Data());
        vec.emplace_back(Data());
        vec.emplace_back(Data());
        DoSomething(vec);
    }
}

void C3(long count) {
    while(--count) {
        std::vector<Data> vec;
        vec.reserve(3);
        vec.emplace_back(0);
        vec.emplace_back(0);
        vec.emplace_back(0);
        DoSomething(vec);
    }
}

double benchmark(const char* name, std::function<void (long)> testfunc, const long count) {
    const auto start = std::chrono::system_clock::now();
    testfunc(count);
    const auto end = std::chrono::system_clock::now();
    const std::chrono::duration<double> delta = end-start;
    std::cout << count << " " << name << " iterations took " << delta.count() << " seconds." << std::endl;
    return delta.count();
}

int main(int, char**) {
    long count = 10000000;
    while(benchmark("A1", &A1, count) < 60l)
        count <<= 1;
    std::cout << "Going with " << count << " iterations." << std::endl;
    benchmark("A1", &A1, count);
    benchmark("A2", &A2, count);
    benchmark("A3", &A3, count);
    benchmark("B1", &B1, count);
    benchmark("B2", &B2, count);
    benchmark("B3", &B3, count);
    benchmark("C1", &C1, count);
    benchmark("C3", &C3, count);
    return 0;
}

Makefile:

CFLAGS?=-O2
main: main.o data.o
    g++ -o $@ $^

%.o: %.cxx data.hxx
    g++ $(CFLAGS) -std=c++11 -o $@ -c $<

Note the object here is small and trivial to copy as one would expect from objects passed around as values (as expensive to copy objects mostly can be passed around with a std::shared_ptr). So what did this measure? Here are the results:

Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 without inline constructors:

implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
A1 89.1 s 79.0 s 78.9 s 78.9 s
A2 89.1 s 78.1 s 78.0 s 80.5 s
A3 90.0 s 78.9 s 78.8 s 79.3 s
B1 103.6 s 97.8 s 79.0 s 78.0 s
B2 99.4 s 95.6 s 78.5 s 78.0 s
B3 107.4 s 90.9 s 79.7 s 79.9 s
C1 99.4 s 94.4 s 78.0 s 77.9 s
C3 98.9 s 100.7 s 78.1 s 81.7 s

creating a three element vector without inlined constructors
And, for comparison, following are the results, if one allows the constructors to be inlined.
Time for 1280000000 iterations on a Intel i5-4200U@1.6GHz (-march=core-avx2) compiled with gcc 4.8.3 with inline constructors:

implementation / CFLAGS -Os -O2 -O3 -O3 -march=…
A1 85.6 s 74.7 s 74.6 s 74.6 s
A2 85.3 s 74.6 s 73.7 s 74.5 s
A3 91.6 s 73.8 s 74.4 s 74.5 s
B1 93.4 s 90.2 s 72.8 s 72.0 s
B2 93.7 s 88.3 s 72.0 s 73.7 s
B3 97.6 s 88.3 s 72.8 s 72.0 s
C1 93.4 s 88.3 s 72.0 s 73.7 s
C3 96.2 s 88.3 s 71.9 s 73.7 s

creating a three element vector without inlined constructors
Some observations on these measurements:

  • -march=... is at best neutral: The measured times do not change much in general, they only even slightly improve performance in five out of 16 cases, and the two cases with the most significant change in performance (over 3%) are actually hurting the performance. So for the rest of this post, -march=... will be ignored. Sorry gentooers. ;)
  • There is no silver bullet with regard to the different implementations: A1, A2 and A3 are the faster implementations when not inlining constructors and using -Os or -O2 (the quickest A* is ~10% faster than the quickest B*/C*). However when inlining constructors and using -O3, the same implementations are the slowest (by 2.4%).
  • Most common release builds are still done with -O2 these days. For those, using initializer lists (A1/A2/A3) seem too have a significant edge over the alternatives, whether constructors are inlined or not. This is in contrast to the conclusions made from “constructor counting”, which assumed these to be slow because of additional calls needed.
  • The numbers printed in bold are either the quickest implementation in a build scenario or one that is within 1.5% of the quickest implementation. A1 and A2 are sharing the title here by being in that group five times each.
  • With constructors inlined, everything in the loop except DoSomething() could be inline. It seems to me that the compiler could — at least in theory — figure out that it is asked the same thing in all cases. Namely, reserve space for three ints on the heap, fill them each with 4711 and make the ::std::vector<int> data structure on the stack reflect that, then hand that to the DoSomething() function that you know nothing about. If the compiler would figure that out, it would take the same time for all implementations. This doesnt happen either on -O2 (differ by ~18% from quickest to slowest) nor on -O3 (differ by ~3.6%).

One common mantra in applications development is “trust the compiler to optimize”. The above observations show a few cracks in the foundations of that, esp. if you take into account that this is all on the same version of the same compiler running on the same platform and hardware with the same STL implementation. For huge objects with expensive constructors, the constructor counting approach might still be valid. Then again, those are rarely statically initialized as a bigger bunch into a vector. For the more common scenario of smaller objects with cheap constructors, my tentative conclusion so far would be to go with A1/A2/A3 — not so much because they are quickest in the most common build scenarios on my platform, but rather because the readability of them is a value on its own while the performance picture is muddy at best.

And hey, if you want to run the tests above on other platforms or compilers, I would be interested in results!

Note: I did these runs for each scenario only once, thus no standard deviation is given. In general, they seemed to be rather stable, but this being wallclock measurements, one or the other might be outliers. caveat emptor.

SwNodeIndex: Ludicious Speed

No-no-no, light speed is too slow!
Yes, we’ll have to go right to… ludicrous speed!

— Dark Helmet, Spaceballs

So, I recently brought up the topic of writers notes in the LibreOffice ESC call. More specifically: the SwNodeIndex class, which is, if one broadly simplifies an iterator over the container holding all the paragraphs of a text document. Before my modifications, the SwNodes container class had all these SwNodeIndices in a homegrown intrustive double linked list, to be able to ensure these stay valid e.g. if a SwNode gets deleted/removed. Still — as usual with performance topics — wild guesses arent helpful, and measurements should trump over intuition. I used valgrind for that, and measured the number of instructions needed for loading the ODF spec. Since I did the same years and years ago on the old OpenOffice.org performance project, I just checked if we regressed against that. Its comforting that we did not at all — we were much faster, but that measurement has to be taken with a few pounds of salt, as a lot of other things differ between these two measurements (e.g. we now have a completely new build system, compiler versions etc.). But its good we are moving in the right direction.

implementation SwNodes SwNodeIndex total instructions performance linedelta
DEV300_m45 71,727,655 73,784,052 9,823,158,471 ? ?
master@fc93c17a 84,553,232 60,987,760 6,170,762,825 0% 0
std::list 18,461,317 103,461,317 14,502,230,571 -5,725%
(-235% of total)
+12/-70
std::vector 18,986,848 3,707,286,032 9,811,541,380 -2,502% +22/-70
std::unordered_map 18,984,984 82,843,000 7,083,620,244 -627%
(-15% of total)
+16/-70
std::vector rbegin 18,986,848 143,851,229 6,214,602,532 -30%
(-7% of total)
+23/-70
sw::Ring<> 23,447,256 inlined 6,154,660,709 11%
(2.6% of total)
+108/-229

With that comforting knowledge, I started to play around with the code. The first thing I did was to replace the handcrafted intrusive list with a std::list pointing to the SwNodeIndex instances as a member in the SwNodes class. This is expected to slow down things, as now two allocs are needed: one for the SwNodeIndex class and one for the node entry in the std::list. To be honest though, I didnt expect this to slow down the code handling the nodes by a factor of ~57 for the loading of the example document. This whole document loading time (not just the node handling) slows by a factor of ~2.4. So ok, this establishes for certain that this part of the code is highly performance sensitive.

The next thing I tried to get a feel for how the performance reacts was using a std::vector in the SwNodes class. When reserving some memory early, this should severely reduce the amount of allocs needed. And indeed this was quicker than the std::list even with a naive approach just doing a push_back() for insertion and a std::find()/std::erase() for removal. However, the node indices are often temporarily created and quickly destroyed again. Thus adding new indices at the end and searching from the start certainly is not ideal: Thus this is also slower than the intrusive list that was on master by a factor of ~25 for the code doing the node handling.

Searching for a SwNodeIndex from the end of the vector, where we likely just inserted it and then swapping it with the last entry makes the std::vector almost compatitive with the original implementation: but still 30% slower than the original implementation. (The total loading time would only have increased by 0.7% using the vector like this.)

For completeness, I also had a look at a std::unordered_map. It did a bit better than I expected, but still would have slowed down loading by 15% for the example experiment.

Having ruled out that standard containers would do much good here without lots of tweaking, I tried the sw::Ring<> class that I recently rewrote based on Boost.Intrusive as a inline header class. This was 11% quicker than the old implementation, resulting in 2.6% quicker loading for the whole document. Not exactly a heroic archivement, but also not too bad for just some 200 lines touched. So this is now on master.

Why do this linked list outperform the old linked list? Inlining. Especially, the non-inlined constructors and the destructor calling a trivial non-inlined member function. And on top of that, the contructors and the function called by the destructor called two non-inlined friend functions from a different compilation unit, making it extra hard for a compiler to optimize that. Now, link time optimization (LTO) could maybe do something about that someday. However, with LTO being in different states on different platforms and with developers possibly building without LTO for build time performance for some time, requiring the compiler/linker to be extra clever might be a mixed blessing: The developers might run into “the map is not the territory” problems.

my personal take-aways:

  • The SwNodeIndex has quite a relevant impact on performance. If you touch it, handle with care (and with valgrind).
  • The current code has decent performance, further improvement likely need deeper structual work (see e.g. Kendys bplustree stuff).
  • Intrusive linked lists might be cumbersome, but for some scenarios, they are really fast.
  • Inlining can really help (doh).
  • LTO might help someday (or not).
  • friend declarations for non-inline functions across compilation units can be a code smell for possible performance optimization.

Please excuse the extensive writing for a meager 2.6% performance improvement — the intention is to avoid somebody (including me) to redo some or all of the work above just to come to the same conclusion.


Note: Here is how this was measured:

  • gcc 4.8.3
  • boost 1.55.0
  • test document: ODF spec
  • valgrind --tool=callgrind "--toggle-collect=*LoadOwnFormat*" --callgrind-out-file=somefilename.cg ./instdir/program/soffice.bin
  • ./autogen.sh --disable-gnome-vfs --disable-odk --disable-postgresql-sdbc --disable-report-builder --disable-scripting-beanshell --enable-gio --enable-symbols --with-external-tar=... --with-junit=... --with-hamcrest=... --with-system-libs --without-doxygen --without-help --without-myspell-dicts --without-system-libmwaw --without-system-mdds --without-system-orcus --without-system-sane --without-system-vigra --without-system-libodfgen --without-system-libcmis --disable-firebird-sdbc --without-system-libebook --without-system-libetonyek --without-system-libfreehand --without-system-libabw --disable-gnome-vfs --without-system-glm --without-system-glew --without-system-librevenge --without-system-libcdr --without-system-libmspub --without-system-libvisio --without-system-libwpd --without-system-libwps --without-system-libwpg --without-system-libgltf --without-system-libpagemaker --without-system-coinmp --with-jdk-home=...

Follow

Get every new post delivered to your Inbox.