Tommy Benchmarks

To evaluate Tommy performances, an extensive benchmark was done, comparing it to the best libraries of data structures available:

Specifically we test:

Note that googlelibchash and concurrencykit are not shown in the graphs because they present a lot of spikes. See the Notes the end.

The Benchmark

The benchmark consists in storing a set of N pointers to objects and searching them using integer keys.

Compared to the case of mapping integers to integers, mapping pointers to objects means that the pointers are also dereferenced, to simulate the object access, resulting in additional cache misses. This gives an advantage to implementations that store information in the objects itself, as the additional cache misses are already implicit.

The test done are:

The Change, Hit and Miss tests operate always with N objects in the containers. The Insert test starts with an empty container, and the Remove test ends with an empty container. The objects are always dereferenced, as we are supposing to use them. This happens even in the remove case, as we are supposing to deallocate them.

All the objects are preallocated in the heap, and the allocation and deallocation time is not included in the test.

The objects contain an integer value field used for consistency checks, an unused payload field of 16 bytes, and any other data required by the data structure.

The objects are identified and stored using integer and unique keys. The key domain used is dense, and it's defined by the set of N even numbers starting from 0x80000000 to 0x80000000+2*N.

The use of even numbers allows to have missing keys inside the domain for the Miss and Change test. In such tests it's used the key domain defined by the set of N odd numbers starting from 0x80000000+1 to 0x80000000+2*N+1. Note that using additional keys at the corners of the domain would have given an unfair advantage to tries and trees as they implicitly keep track of the maximum and minimum key value inserted.

The use of the 0x80000000 base, allow to test a key domain not necessarily starting at 0. Using a 0 base would have given an unfair advantage to some implementation handling it as a special case.

For all the hashtables the keys are hashed using the tommy_inthash_u32() function that ensures an uniform distribution. This hash function is also reversible, meaning that no collision is going to be caused by hashing the keys. For tries and trees the keys are not hashed, and used directly.

The tests are repeated using keys in Random mode and in Forward mode. In the forward mode the key values are used in order from the lowest to the highest. In the random mode the key values are used in a completely random order. In the Change test in forward mode, each object is reinserted using the previous key incremented by 1. In random mode each object is reinserted using a completely different and uncorrelated key.

The forward order advantages tries and trees as they use the key directly and they have a cache advantage on using consecutive keys. The random order advantages hashtables, as the hash function already randomizes the key. Usually real uses case are in between, and the random one is the worst case.

Results

The most significant tests depend on your data usage model, but if in doubt, you can look at Random Hit and Random Change. They represent the real world worst condition.

img_random_hit.png

In the Random Hit graph you can see a vertical split at the 100.000 elements limit. Before this limit the cache of modern processor is able to contains most of the data, and it allows a very fast access with almost all data structures. After this limit, the number of cache misses is the dominant factor, and the curve depends mainly on the number of cache-miss required.

rbtree and nedtrie grow as log2(N) as they have two branches on each node, tommy_trie_inplace grows as log4(N), tommy_trie as log8(N) and hashtables are almost constant and don't grow. For tommy_trie_inplace and tommy_trie you can change the grow curve configuring a different number of branches for node.

img_random_change.png

The Random Change graph confirms the vertical split at the 100.000 elements limit. It also show that hashtables are almost unbeatable with a random access.

Random order

Here you can see the whole Random test results in different platforms.

In the Random test, hashtables are almost always winning, seconds are tries, and as last trees.

The best choices are tommy_hashdyn, tommy_hashlin, and googledensehash, with tommy_hashlin having the advantage to be real-time friendly and not increasing the heap fragmentation.

img_random_insert.png
img_random_hit.png
img_random_miss.png
img_random_change.png
img_random_remove.png

Forward order

Here you can see the whole Forward test results in different platforms.

In the Forward test, tries are the winners. Hashtables are competitive until the cache limit, then they lose against tries. Trees are the slowest.

The best choices are tommy_trie and tommy_trie_inplace, where tommy_trie is a bit faster, and tommy_trie_inplace doesn't require a custom allocator.

Note that also hashtables are faster in forward order than random. This may seem a bit surprising as the hash function randomizes the access even with consecutive keys. This happens because the objects are allocated in consecutive memory, and accessing them in order, improves the cache utilization, even if the hashed key is random.

Note that you can make hashtables to reach tries performance tweaking the hash function to put near keys allocated nearby. This is possible if you know in advance the distribution of keys. For example, in the benchmark you could use something like:

 #define hash(v) tommy_inthash_u32(v & ~0xF) + (v & 0xF)

and make keys that differ only by the lowest bits to have hashes with the same property, resulting in objects stored nearby, and improving cache utilization.

img_forward_insert.png
img_forward_hit.png
img_forward_miss.png
img_forward_change.png
img_forward_remove.png

Size

Here you can see the memory usage of the different data structures.

img_random_size.png

Code

The compilers used in the benchmark are:

The following is pseudo code of the benchmark used. In this case it's written for the C++ unordered_map.

 #define N 10000000 // Number of elements
 #define PAYLOAD 16 // Size of the object

 // Basic object inserted in the colletion
 struct obj {
     unsigned value; // Key used for searching
     char payload[PAYLOAD];
 };

 // Custom hash function to avoid to use the STL one
 class custom_hash {
 public:
     unsigned operator()(unsigned key) const { return tommy_inthash_u32(key); }
 };

 // Map collection from "unsigned" to "pointer to object"
 typedef std::unordered_map<unsigned, obj*, custom_hash> bag_t;
 bag_t bag;

 // Preallocate objects
 obj* OBJ = new obj[N];

 // Keys used for inserting and searching elements
 unsigned INSERT[N];
 unsigned SEARCH[N];

 // Initialize the keys
 for(i=0;i<N;++i) {
     INSERT[i] = 0x80000000 + i * 2;
     SEARCH[i] = 0x80000000 + i * 2;
 }

 // If random order is required, shuffle the keys with Fisher-Yates
 // The two key orders are not correlated
 if (test_random) {
     std::random_shuffle(INSERT, INSERT + N);
     std::random_shuffle(SEARCH, SEARCH + N);
 }

Insert benchmark

 for(i=0;i<N;++i) {
     // Setup the element to insert
     unsigned key = INSERT[i];
     obj* element = &OBJ[i];
     element->value = key;

     // Insert it
     bag[key] = element;
 }

Change benchmark

 for(i=0;i<N;++i) {
     // Search the element
     unsigned key = SEARCH[i];
     bag_t::iterator j = bag.find(key);
     if (j == bag.end())
         abort();

     // Remove it
     obj* element = j->second;
     bag.erase(j);

     // Reinsert the element with a new key
     // Use +1 in the key to ensure that the new key is unique
     key = INSERT[i] + 1;
     element->value = key;
     bag[key] = element;
 }

Hit benchmark

 for(i=0;i<N;++i) {
     // Search the element
     // Use a different key order than insertion
     // Use +1 in the key because we run after the "Change" test
     unsigned key = SEARCH[i] + 1;
     bag_t::const_iterator j = bag.find(key);
     if (j == bag.end())
         abort();

     // Ensure that it's the correct element.
     // This operation is like using the object after finding it,
     // and likely involves a cache-miss operation.
     obj* element = j->second;
     if (element->value != key)
         abort();
 }

Miss benchmark

 for(i=0;i<N;++i) {
     // Search the element
     // All the keys are now shifted by +1 by the "Change" test, and we'll find nothing
     unsigned key = SEARCH[i];
     bag_t::const_iterator j = bag.find(key);
     if (j != bag.end())
         abort();
 }

Remove benchmark

 for(i=0;i<N;++i) {
     // Search the element
     // Use +1 in the key because we run after the "Change" test
     unsigned key = SEARCH[i] + 1;
     bag_t::iterator j = bag.find(key);
     if (j == bag.end())
         abort();

     // Remove it
     bag.erase(j);

     // Ensure that it's the correct element.
     obj* element = j->second;
     if (element->value != key)
         abort();
 }

Other benchmarks

Here some links to other performance comparison:

Comparison of Hash Table Libraries

Hash Table Benchmarks

Notes

Here some notes about the data structure tested not part of Tommy.

Google C libchash

It's the C implementation located in the experimental/ directory of the googlesparsehash archive. It has very bad performances in the Change test for some N values. See this graph with a lot of spikes. The C++ version doesn't suffer of this problem.

Google C++ densehash

It doesn't release memory on deletion. To avoid an unfair advantage in the Remove test, we force a periodic resize calling resize(0) after any deallocation. The resize is executed when the load factor is lower than 20%.

khash

It doesn't release memory on deletion. This gives an unfair advantage on the Remove test.

nedtrie

I've found a crash bug when inserting keys with the 0 value. The fix of this issue is now in the nedtries github. We do not use the C++ implementation as it doesn't compile with gcc 4.4.3.

Judy

Sometimes it has bad performances in some specific platform and for some specific input data size. This makes difficult to predict the performance, as it is usually good until you get one of these cases. See for example this graph with a big replicable spike at 50.000 elements.

Concurrency Kit

It has very bad performances in the Change test for some N values. See this graph with a lot of spikes.