SwNodeIndex: Ludicious Speed

No-no-no, light speed is too slow!
Yes, we’ll have to go right to… ludicrous speed!

— Dark Helmet, Spaceballs

So, I recently brought up the topic of writers notes in the LibreOffice ESC call. More specifically: the SwNodeIndex class, which is, if one broadly simplifies an iterator over the container holding all the paragraphs of a text document. Before my modifications, the SwNodes container class had all these SwNodeIndices in a homegrown intrustive double linked list, to be able to ensure these stay valid e.g. if a SwNode gets deleted/removed. Still — as usual with performance topics — wild guesses arent helpful, and measurements should trump over intuition. I used valgrind for that, and measured the number of instructions needed for loading the ODF spec. Since I did the same years and years ago on the old OpenOffice.org performance project, I just checked if we regressed against that. Its comforting that we did not at all — we were much faster, but that measurement has to be taken with a few pounds of salt, as a lot of other things differ between these two measurements (e.g. we now have a completely new build system, compiler versions etc.). But its good we are moving in the right direction.

implementation SwNodes SwNodeIndex total instructions performance linedelta
DEV300_m45 71,727,655 73,784,052 9,823,158,471 ? ?
master@fc93c17a 84,553,232 60,987,760 6,170,762,825 0% 0
std::list 18,461,317 103,461,317 14,502,230,571 -5,725%
(-235% of total)
+12/-70
std::vector 18,986,848 3,707,286,032 9,811,541,380 -2,502% +22/-70
std::unordered_map 18,984,984 82,843,000 7,083,620,244 -627%
(-15% of total)
+16/-70
std::vector rbegin 18,986,848 143,851,229 6,214,602,532 -30%
(-7% of total)
+23/-70
sw::Ring<> 23,447,256 inlined 6,154,660,709 11%
(2.6% of total)
+108/-229

With that comforting knowledge, I started to play around with the code. The first thing I did was to replace the handcrafted intrusive list with a std::list pointing to the SwNodeIndex instances as a member in the SwNodes class. This is expected to slow down things, as now two allocs are needed: one for the SwNodeIndex class and one for the node entry in the std::list. To be honest though, I didnt expect this to slow down the code handling the nodes by a factor of ~57 for the loading of the example document. This whole document loading time (not just the node handling) slows by a factor of ~2.4. So ok, this establishes for certain that this part of the code is highly performance sensitive.

The next thing I tried to get a feel for how the performance reacts was using a std::vector in the SwNodes class. When reserving some memory early, this should severely reduce the amount of allocs needed. And indeed this was quicker than the std::list even with a naive approach just doing a push_back() for insertion and a std::find()/std::erase() for removal. However, the node indices are often temporarily created and quickly destroyed again. Thus adding new indices at the end and searching from the start certainly is not ideal: Thus this is also slower than the intrusive list that was on master by a factor of ~25 for the code doing the node handling.

Searching for a SwNodeIndex from the end of the vector, where we likely just inserted it and then swapping it with the last entry makes the std::vector almost compatitive with the original implementation: but still 30% slower than the original implementation. (The total loading time would only have increased by 0.7% using the vector like this.)

For completeness, I also had a look at a std::unordered_map. It did a bit better than I expected, but still would have slowed down loading by 15% for the example experiment.

Having ruled out that standard containers would do much good here without lots of tweaking, I tried the sw::Ring<> class that I recently rewrote based on Boost.Intrusive as a inline header class. This was 11% quicker than the old implementation, resulting in 2.6% quicker loading for the whole document. Not exactly a heroic archivement, but also not too bad for just some 200 lines touched. So this is now on master.

Why do this linked list outperform the old linked list? Inlining. Especially, the non-inlined constructors and the destructor calling a trivial non-inlined member function. And on top of that, the contructors and the function called by the destructor called two non-inlined friend functions from a different compilation unit, making it extra hard for a compiler to optimize that. Now, link time optimization (LTO) could maybe do something about that someday. However, with LTO being in different states on different platforms and with developers possibly building without LTO for build time performance for some time, requiring the compiler/linker to be extra clever might be a mixed blessing: The developers might run into “the map is not the territory” problems.

my personal take-aways:

  • The SwNodeIndex has quite a relevant impact on performance. If you touch it, handle with care (and with valgrind).
  • The current code has decent performance, further improvement likely need deeper structual work (see e.g. Kendys bplustree stuff).
  • Intrusive linked lists might be cumbersome, but for some scenarios, they are really fast.
  • Inlining can really help (doh).
  • LTO might help someday (or not).
  • friend declarations for non-inline functions across compilation units can be a code smell for possible performance optimization.

Please excuse the extensive writing for a meager 2.6% performance improvement — the intention is to avoid somebody (including me) to redo some or all of the work above just to come to the same conclusion.


Note: Here is how this was measured:

  • gcc 4.8.3
  • boost 1.55.0
  • test document: ODF spec
  • valgrind --tool=callgrind "--toggle-collect=*LoadOwnFormat*" --callgrind-out-file=somefilename.cg ./instdir/program/soffice.bin
  • ./autogen.sh --disable-gnome-vfs --disable-odk --disable-postgresql-sdbc --disable-report-builder --disable-scripting-beanshell --enable-gio --enable-symbols --with-external-tar=... --with-junit=... --with-hamcrest=... --with-system-libs --without-doxygen --without-help --without-myspell-dicts --without-system-libmwaw --without-system-mdds --without-system-orcus --without-system-sane --without-system-vigra --without-system-libodfgen --without-system-libcmis --disable-firebird-sdbc --without-system-libebook --without-system-libetonyek --without-system-libfreehand --without-system-libabw --disable-gnome-vfs --without-system-glm --without-system-glew --without-system-librevenge --without-system-libcdr --without-system-libmspub --without-system-libvisio --without-system-libwpd --without-system-libwps --without-system-libwpg --without-system-libgltf --without-system-libpagemaker --without-system-coinmp --with-jdk-home=...

One thought on “SwNodeIndex: Ludicious Speed

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s