StringMap

The smallest possible attribute system needed a fastest possible lookup. Functionally, a std::map<std::string, int> or std::map<const char*, int, Pred> might have done the job. The performance wasn't there however. I did a bit of research and discovered ternary search trees by Bentley and Sedgewick; wrapped them up with a template to turn what was effectively a set into a map, timed it, and provide the result here. Note that real production code would test for error conditions, and deal with them in a fashion appropriate to your application.

This code is intended to get you thinking about what kind of container might be appropriate to the job you have at hand, and also to encourage you to not limit yourself to what the library vendor provided you or what dogma might dictate.

Timing

For a timing test, I read in a dictionary, added it to a std::map and a StringMap in alphabetical order, shuffled the dictionary and added all the words out of order, then shuffled the dictionary again, and searched for all the words in the map and the StringMap. In general, the results below can be multiplied by two to know how fast StringMap runs in debug, and multiplied by six to ten to know how fast std::map runs in debug. Timings are reported in CPU cycles, and measured under MSVC 2008 with the out of the box STL, optimized build, multi-threaded release libraries, RTTI and exceptions turned on.

Update. I have averaged many more samples. I have retested with all combinations of RTTI and exceptions off and on. Contrary to conventional wisdom, I was not able to measure a statistically significant difference between the performance of all four variations.

Test std::map StringMap
insert, in order 540M 55M
insert, shuffled 540M 85M
find all 260M 85M

Observations

A further benefit of StringMap is that very few heap allocations are involved. The insert time of std::map could be improved with a custom predicator, and a custom allocator, however the find time can't be improved greatly. One of the big differences between a map of strings and a StringMap is that during find, the std::map needs to do a strcmp for every node of the tree it visits, whereas the StringMap advances a character at a time through the search string as it advances through the nodes. This alone makes a huge difference.

The last merit I want to point out is the sheer tiny elegance of Bentley and Sedgewick's code - insert is a mind boggling 50 lines, and find is a mere 22 lines long. (Ignore my clumsy InOrder function, it's not the elegant part!)

If you want to use this code as a set, remove the Data from the Tnode, and all references to it - this will bring the code back in line with the original.

For more information and formal performance analysis on ternary search trees, I refer you to Bentley & Sedgewick's informative article at Dr. Dobb's, and their scholarly research.

Exception Safety

The code as presented here is exception safe. I have replaced the malloc/free in the original posting with new[]/delete[], and added a throw of bad::alloc in the case of filling up all the pools. It is permissible for assignment of data to throw an exception, it is up to the user of the StringMap to catch and respond accordingly - if data cannot be assigned, it is not up to StringMap to do something about it.

 

#ifndef STRINGMAP_H #define STRINGMAP_H #include <stdlib.h> // for malloc/free #include <vector> // for the InOrder function. /* This code is described in "Ternary Search Trees" by Jon Bentley and Robert Sedgewick in the April, 1998, Dr. Dobb's Journal. http://www.ddj.com/windows/184410528 Other algorithms and analysis are available here: http://www.cs.princeton.edu/~rs/strings/ Ternary search trees offer substantial advantages over both binary search trees and digital search tries. We feel that they are superior to hashing in many applications for the following reasons: Efficient and easy to implement No extra overhead for insertion or successful searches Usually substantially faster than hashing for unsuccessful searches Gracefully grow and shrink; hash tables need to be rebuilt after large size changes Support advanced searches, such as partial-match and near-neighbor search. Traversal to report items in sorted order. ------------------------------------------------- I've added the StringMap template wrapper to encapsulate the glboal variables in Bentley & Sedgewick's original code. Otherwise, this implementation is substantially identical. StringMap needs a simple and efficient iterator in order to get rid of the InOrder method I haven't bothered with adding an erase function, or STL compatibility, because my goal is an as-light-as-possible symbol table look up. The extra complexity doesn't help that goal. */ #ifdef HAVE_EXCEPTIONS # define STRINGMAP_ERROR throw std::bad_alloc(); #else # include <assert.h> # define STRINGMAP_ERROR assert(false); #endif template <class Data> class StringMap { public: static const int buffPoolGrowSize = 1000; static const int maxPools = 1000; struct Tnode { char splitchar; Data data; Tnode* lokid; Tnode* eqkid; // contains middle kid if splitchar != 0, else char* pointer to full string (node is a leaf)  Tnode* hikid; }; StringMap() : buf(0), bufn(0), freen(0), size(0), root(0) { memset(freearr, 0, sizeof(freearr)); } ~StringMap() { for (int i = 0; i < freen; i++) delete[](freearr[i]); } // A major cost of insert functions is calling malloc to create each node.  // This function uses the ancient C technique of allocating a buffer of nodes and  // dealing them out as needed. Profiling shows that this eliminates most of the  // time spent in storage allocation.  void insert(char const* s, const Data& data) { char const* instr = s; Tnode* pp; Tnode** p; p = &root; while (pp = *p) { // optimization: save a difference in a comparison  int d; if ((d = *s - pp->splitchar) == 0) { if (*s++ == 0) return; p = &(pp->eqkid); } else if (d < 0) p = &(pp->lokid); else  p = &(pp->hikid); } for (;;) { // *p = (Tptr) malloc(sizeof(Tnode)); reference code; this is what the pools replace  if (bufn-- == 0) { if (freen == maxPools - 1) { STRINGMAP_ERROR } buf = new Tnode[buffPoolGrowSize]; freearr[freen++] = buf; bufn = buffPoolGrowSize - 1; } *p = buf++; pp = *p; pp->splitchar = *s; pp->lokid = pp->eqkid = pp->hikid = 0; if (*s++ == 0) { // store a pointer to every string in the tree; this data will be used by later search  // exploit the fact that a node with a null splitchar cannot have an eqkid  pp->eqkid = (Tnode*) instr; pp->data = data; ++size; // count it  return; } p = &(pp->eqkid); } } // If the search character is less, go to lokid;  // if it is greater, go to hikid;  // if it is equal, go to the next character and eqkid.  bool find(char const* s, Data* data) const  { Tnode* p; p = root; while (p) { if (*s < p->splitchar) p = p->lokid; else if (*s == p->splitchar) { if (*s++ == 0) { *data = p->data; return true; } p = p->eqkid; } else  p = p->hikid; } return false; } void inOrderR(std::vector<std::pair<char const*, Data*> >& nodes, Tnode* p) const  { if (!p) return; inOrderR(nodes, p->lokid); if (p->splitchar) inOrderR(nodes, p->eqkid); else  nodes.push_back(std::pair<char const*, Data*>((char const*)p->eqkid, &p->data)); inOrderR(nodes, p->hikid); } void inOrder(std::vector<std::pair<char const*, Data*> >& nodes) const  { inOrderR(nodes, root); } // print all members of the tree in order  void traverse(Tnode* p) { if (!p) return; traverse(p->lokid); if (p->splitchar) traverse(p->eqkid); else  printf("%s\n", (char *) p->eqkid); traverse(p->hikid); } Tnode* buf; // next available Tnode  int bufn; // number of Tnodes available in current pool  int freen; // number of pools to be freed  Tnode* freearr[maxPools]; // all the allocated pools  int size; // number of nodes  Tnode* root; // root of the tree }; #endif 
programming/code/stringmap

Content by Nick Porcino (c) 1990-2011