Why roll your own RTTI?

I've received a few inquiries as to why C++ RTTI is not desirable in high performance applications. Better than anecdotal information can be found in an extremely interesting piece for an ISO technical committee by Lois Goldthwaite - a Technical Report on C++ Performance. Roughly half of this report is devoted to a proposal for hardware addressing methods for C++, but the other half is immediately practical. The answer to "why roll your own" is buried near the end of this article.

To quote the document, its purpose is

to give the reader a model of time and space overheads implied by use of
various C++ language and library features,
to debunk widespread myths about performance problems,
to present techniques for use of C++ in applications where performance
matters, and
to present techniques for implementing C++ Standard language and library
facilities to yield efficient code.

Knowledge of all of the above should be an arrow in every C++ programmer's quiver!

The first rich vein in this document begins in Section 5.3, Classes and Inheritance. The overhead of various call methodologies is analyzed. My take on the severity of the differences between the approaches is greater than that of the report's author, but the raw data is very helpful to understanding the issues. Unfortunately the identity of Mystery Compilers one through five is masked.

Section 5.3.5 on multiple inheritance method invocation shows the overhead is much larger than I would have expected. I am amused by the author's note that 25% overhead is minor, but I take the point that one can code so as not to take that branch so often.

Section 5.3.7 is the first shocker. The typeid operator is very slow, but comments earlier in the report that type_info is recorded as strings, and my own profiling and disassembly showing that dynamic_cast and typeid is dominated by strcmp calls jive with these results. I suppose I shouldn't have been shocked, but I was still on the fence as to whether I was somehow misreading what I was seeing.

Next, in Section 5.3.8, we see that dynamic_cast performance is perfectly acceptable only in the case of up casting. Down casting is dreadful, and cross casting (casting between branches in a multiple inheritance hierarchy) is an outrage. The report's author suggests that compiler optimizer writers could perhaps pull up their socks and maybe do a bit better.

Section 5.4.1 reveals exception handling to suffer from the same drawbacks as dynamic casting. The primary reason is to deal with exceptions thrown in constructors and destructors; complex type information and the current state of construction must be maintained so that if an exception is thrown, all relevant destructors can be invoked.

Section 5.4.2.2 discusses exception specifications. The basic conclusion is that exception specifications are very heavy because redundant work is done for every exception thrown because the exception must be rethrown after type checking. Only whole program analysis could turn this into a compile-time check, and I can conclude that it is therefore impractical for large projects. This section also shows that an empty throw specification should greatly speed execution by telling the compiler that no type information needs to be baked into the execution context. I haven't checked in VS2008, but up to VS2005, the empty throw specification is ignored. gcc does not warn if an empty throw specification exists, but I don't know if it implements the relevant optimizations.

Section 5.6 is a large section on practical remedies a programmer can undertake to make code performant. There is a lot of common sense in this section, and some less worthy homilies. I'll leave it to you to make your own judgements on the advice. All of the advice here is certainly not wrong, and is worth considering.

Section 5.6.7 is worth special note. I found three useful points here.

If your code is built with exception handling, std::string will be slower than you would expect for the reasons covered in 5.4.1.
The implementation of list::size() is often order n, resulting in quadratic complexity in a loop, suggesting the use of list.empty() instead of list.size() == 0.
Iostreams are hog-tied by synchronization with C streams. This can be disabled by calling std::ios_base::sync_with_stdio(false); and std::cin.tie(0).

Section 6 reveals that the inefficiency of IOStreams is due to std::locale. A great many pages are devoted to how locales might be implemented such that they don't inherently suck. We can read this two ways, one, as a well reasoned beg to std library implementors to do something better, and two, as a note to ourselves to only use unadorned IOStreams for ASCII I/O. Otherwise, we can certainly use IOStreams, but we must take care to use them well. On Windows, we must set the mode to binary by specifying the ios::binary facet: ifstream stream(filename, ios::binary); Cross platform, we should bypass the locale layer entirely as this is where the piles and piles of inefficiency comes from (setting stream flags, checking options, locale, character set, and so on), and go straight for the data reads and writes using the rdbuf - ifs.rdbuf()->sgetn()/ifs.rdbuf()->sputn(). I refer you to this gamedev thread for more information. Another IOStreams inefficiency under MSVC is that IOStreams goes to FILE* operations instead of Windows native functionality. Alas! Nonetheless, bypassing the locale and character set gets us close to where we need to be.

I wrote a small test program to verify the advice of going to rdbuf. The fopen version of the program ran at roughly the same speed as the ifstream version, although for whatever reason the variability was much higher (generally running at almost the same speed, occasionally a bit faster, and occasionally at half the speed - I can only assume that the half speed sample was a hiccough due to my system doing something else during the test run). I did a similar sample for writing and found the performance to be comparable through rdbuf. I'm including the reading sample program here to save you hunting for cryptic syntax and keywords (the writing code is trivially easy). It might just be me, but compared to fopen/fread this is alphabet soup! In order to get with the program though, I'm willing to give it a shot. I had to define _SCL_SECURE_NO_WARNINGS and _CRT_SECURE_NO_WARNINGS in the preprocessor definitions to suppress the compiler telling me that it's not safe for me to read data because maybe data isn't large enough to hold size bytes. Sure, thanks, I'm fine juggling this particular chainsaw.

#ifdef _MSC_VER #define IOS_BINARY std::ios::binary #else  #define IOS_BINARY 0 #endif std::ifstream i; i.open("d:\\foo.bin", IOS_BINARY); std::filebuf *pbuf = f.rdbuf(); size_t size = pbuf->pubseekoff(0, std::ios::end, std::ios::in); pbuf->pubseekpos(0, std::ios::in); void* data = malloc(size); pbuf->sgetn((char*) data, size); i.close();

Section 7 is interesting. I do contest the idea in 7.2.2.1 that typeid is a useful alternative to dynamic_cast for determining type compatibility. My experiments with roll your own RTTI so far show the == operator to be excruciatingly slow. This could be an MSVC thing, but I kind of doubt it. strcmp in a profile (which is what you get from this operator) is never a sign of your program behaving in a nice way, and I'm sure it's going to show up in gcc based compiles as well. Of course, this is what led me to create the lightweight type system in this series of articles in the first place.

Section 7.2.2.3 on exceptions is worth reading and underlining, the logic here is better than the dogma you'll generally find in discussions on this topic.

Skipping a lot of pages brings us to Appendix D, where the timing code is available.

Appendix E has a large and worthwhile bibliography.

Spaces Between

2008-07-06 15:46:19

Why roll your own RTTI?