Thursday, August 18, 2011

Big Map, String, GC, Memory Consumption and Performance

Hashtables are used in many cases to cache information in memory. There are other alternatives, like in memory databases, or flushing data to disk, but when all you need is a simple key-value relation and you need very high performance, the simple Java HashMap can do the trick. Usually cache can allow loosing information that was not claimed recently, assuming that recent usage of the data can foresee higher chances of this data to be claimed again soon, compared to less recently used data. But in some cases we prefer the cache to hold as much information as possible, which means a VERY BIG MAP.

Important morals below are based on an application with HashMap holding 20M entries, in 64bit JVM and 6 GB (-d64 -Xmx6g ). The Map itself uses much less memory, but in order for the entire application to work smoothly we need the 6 GB.

When coming to very big HashMap, there are a few considerations to take into account:


  1. The default load factor of 0.75 is reasonable. Don't play with it without understanding what you are doing. The load factor determines when the map should grow and rehash should occur, this happens when number of elements in the map gets above the result of (capacity * load factor). Higher values of load factor means less rehash operations, but a map that is more dense. Calls to get in a dense map are more costly since buckets are full with many entries that shall be iterated. Lower load factor values will create more rehash operations and a sparse map, which is more memory consuming but more efficient in get calls. It should be noticed that load factor can be bigger than 1, which means we allow the table (number of buckets) to be smaller than number of entries.

    To summarize: don't start with optimizing the load factor, it's a bad start.

  2. Initial capacity is important. Starting the HashMap too small will result with rehash operations upon calls to put. Though HashMap doubles itself on resize, still failing to provide appropriate initial capacity may result with a few costly resize cycles.
    The recommended initial capacity, for load factor of 0.75, is: 1.5 * numElements
    (Read more: http://stackoverflow.com/questions/434989/hashmap-intialization-parameters-load-initialcapacity).

    In our case, the performance difference of setting 20M entries into the map, with setting the initial capacity to 30M (Java will in fact set the initial capacity to the closest power of 2 above this size), compared to no initial capacity set, was almost double the insertion time, with very clear workload on rehash and GC work following it.

    To summarize: initial capacity is important. Set it!

  3. Beware of memory leaks!
    Maps, as any long-leaving memory storage entity, are a potential source for leaks.
    In our case, the Map got Strings from file, but needed to use for the entries' keys only a small portion of each line read from the file. Thus, after reading each line from the file, some string tokenizing was done resulting with the key and value to store.
    Unfortunately the entire line read from file is kept in memory, even though we need only part of it... This is due to the way substring and other similar tokenizing methods are implemented, which return a new String, that holds the old one with offsets inside. This is a known issue (for example see: http://eyalsch.wordpress.com/2009/10/27/stringleaks/).
    Solution is to manually create a new String and give it the substring portion you want to keep:
    String key = new String(bigLine.substring(from, to));
    In our case, fixing a similar issue resulted with requiring 1.5GB less memory for the exact same scenario! It turned out that the map keys' kept a long tail that should have been cut.

    To summarize: beware of holding a substring in a Map! Do it the right way.

  4. In such a huge map, every byte in the key-value becomes 20M in the entire map. Thus saving a few bytes can worth the hassle. For example, instead of using String as the value, we decided to use char array (a simple char[]) thus saving: [1] a reference to the string object from the hashmap entry - instead holding a direct reference to char[]. In 64 bit this values to 8 bytes saving! [2] three int fields that are in String class: offset, count and hashmap. The offset and count fields are there to allow String to point at a position which is partial in the char[] it holds (directly related to the substring leak mention in item 3 above). We don't need this info. Also caching the hashmap value is redundant, as the Map entry itself does it.

    By using char[] instead of String as our value, we save 20 bytes per entry, which is 400MB total! And without hurting the code or making it more complex.

    Our key is also a String, but it cannot be turned into char[], as we need a proper hashcode and equals functions. Still, one should consider implementing a dedicated Key class that holds only the relevant info, char[] in our case. Since the extra reference cannot be saved, the saving here could be the 3 int fields in class String, which totals to 240MB for 20M entries.

    To summarize: caching Strings in big Maps is costly. Think how to reduce the size of your key and value!

  5. The hashcode is crucial. Bad hashcode on the key, which creates bad distribution (too many keys get the same hashcode value) can be performance devastating. On the other hand, though of much less importance, a more efficient hashcode method has an influence over at least 20M insertion operations. Read how String hashcode is implemented to get the idea: http://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier

    To summarize: check your key hashcode method, use some real data and validate that you get reasonable distribution. If needed, work on improving the hashcode method by using higher power multiplication on the key fields.

  6. Know when to stop optimizing. In some cases you see a programmer keep optimizing when there is really nothing more to optimize (well, there is always more to optimize, but then it may result with taking parts to C and using JNI, or use direct byte arrays, things you would better not go to, if not really needed...).
    Calculate the most minimal amount of memory you need, save your time and don't try to optimize your application into less.

    To summarize: set goals to your optimization efforts. Make sure the goals are not too aggressive, i.e. are feasible without re-writing the entire thing.

  7. The entries of the big map will eventually arrive to the GC old-generation section. It means that in the described usage the old generation is going to be very big. There are two options to handle this, one is to play with the generation ratios, the other option is to set a big enough total heap memory to allow the old generation to be big enough. If you have enough resources on the machine you can go with the second option, otherwise you would need to dig into the generation ratio configuration.

    To summarize: size of the old generation section is crucial to prevent unnecessary full GC cycles. Configure the max heap size and/or the generation ratios.

By optimizing the memory consumtion we got also much better performance, as the application went into less GC cycles. Some other parts of the drill include removing unused entries from the cache, in a dedicated thread, when reaching certain threshold, and more... The result is a huge cache running smoothly on a reasonable sized server.