Thursday, August 18, 2011

Big Map, String, GC, Memory Consumption and Performance

Hashtables are used in many cases to cache information in memory. There are other alternatives, like in memory databases, or flushing data to disk, but when all you need is a simple key-value relation and you need very high performance, the simple Java HashMap can do the trick. Usually cache can allow loosing information that was not claimed recently, assuming that recent usage of the data can foresee higher chances of this data to be claimed again soon, compared to less recently used data. But in some cases we prefer the cache to hold as much information as possible, which means a VERY BIG MAP.

Important morals below are based on an application with HashMap holding 20M entries, in 64bit JVM and 6 GB (-d64 -Xmx6g ). The Map itself uses much less memory, but in order for the entire application to work smoothly we need the 6 GB.

When coming to very big HashMap, there are a few considerations to take into account:


  1. The default load factor of 0.75 is reasonable. Don't play with it without understanding what you are doing. The load factor determines when the map should grow and rehash should occur, this happens when number of elements in the map gets above the result of (capacity * load factor). Higher values of load factor means less rehash operations, but a map that is more dense. Calls to get in a dense map are more costly since buckets are full with many entries that shall be iterated. Lower load factor values will create more rehash operations and a sparse map, which is more memory consuming but more efficient in get calls. It should be noticed that load factor can be bigger than 1, which means we allow the table (number of buckets) to be smaller than number of entries.

    To summarize: don't start with optimizing the load factor, it's a bad start.

  2. Initial capacity is important. Starting the HashMap too small will result with rehash operations upon calls to put. Though HashMap doubles itself on resize, still failing to provide appropriate initial capacity may result with a few costly resize cycles.
    The recommended initial capacity, for load factor of 0.75, is: 1.5 * numElements
    (Read more: http://stackoverflow.com/questions/434989/hashmap-intialization-parameters-load-initialcapacity).

    In our case, the performance difference of setting 20M entries into the map, with setting the initial capacity to 30M (Java will in fact set the initial capacity to the closest power of 2 above this size), compared to no initial capacity set, was almost double the insertion time, with very clear workload on rehash and GC work following it.

    To summarize: initial capacity is important. Set it!

  3. Beware of memory leaks!
    Maps, as any long-leaving memory storage entity, are a potential source for leaks.
    In our case, the Map got Strings from file, but needed to use for the entries' keys only a small portion of each line read from the file. Thus, after reading each line from the file, some string tokenizing was done resulting with the key and value to store.
    Unfortunately the entire line read from file is kept in memory, even though we need only part of it... This is due to the way substring and other similar tokenizing methods are implemented, which return a new String, that holds the old one with offsets inside. This is a known issue (for example see: http://eyalsch.wordpress.com/2009/10/27/stringleaks/).
    Solution is to manually create a new String and give it the substring portion you want to keep:
    String key = new String(bigLine.substring(from, to));
    In our case, fixing a similar issue resulted with requiring 1.5GB less memory for the exact same scenario! It turned out that the map keys' kept a long tail that should have been cut.

    To summarize: beware of holding a substring in a Map! Do it the right way.

  4. In such a huge map, every byte in the key-value becomes 20M in the entire map. Thus saving a few bytes can worth the hassle. For example, instead of using String as the value, we decided to use char array (a simple char[]) thus saving: [1] a reference to the string object from the hashmap entry - instead holding a direct reference to char[]. In 64 bit this values to 8 bytes saving! [2] three int fields that are in String class: offset, count and hashmap. The offset and count fields are there to allow String to point at a position which is partial in the char[] it holds (directly related to the substring leak mention in item 3 above). We don't need this info. Also caching the hashmap value is redundant, as the Map entry itself does it.

    By using char[] instead of String as our value, we save 20 bytes per entry, which is 400MB total! And without hurting the code or making it more complex.

    Our key is also a String, but it cannot be turned into char[], as we need a proper hashcode and equals functions. Still, one should consider implementing a dedicated Key class that holds only the relevant info, char[] in our case. Since the extra reference cannot be saved, the saving here could be the 3 int fields in class String, which totals to 240MB for 20M entries.

    To summarize: caching Strings in big Maps is costly. Think how to reduce the size of your key and value!

  5. The hashcode is crucial. Bad hashcode on the key, which creates bad distribution (too many keys get the same hashcode value) can be performance devastating. On the other hand, though of much less importance, a more efficient hashcode method has an influence over at least 20M insertion operations. Read how String hashcode is implemented to get the idea: http://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier

    To summarize: check your key hashcode method, use some real data and validate that you get reasonable distribution. If needed, work on improving the hashcode method by using higher power multiplication on the key fields.

  6. Know when to stop optimizing. In some cases you see a programmer keep optimizing when there is really nothing more to optimize (well, there is always more to optimize, but then it may result with taking parts to C and using JNI, or use direct byte arrays, things you would better not go to, if not really needed...).
    Calculate the most minimal amount of memory you need, save your time and don't try to optimize your application into less.

    To summarize: set goals to your optimization efforts. Make sure the goals are not too aggressive, i.e. are feasible without re-writing the entire thing.

  7. The entries of the big map will eventually arrive to the GC old-generation section. It means that in the described usage the old generation is going to be very big. There are two options to handle this, one is to play with the generation ratios, the other option is to set a big enough total heap memory to allow the old generation to be big enough. If you have enough resources on the machine you can go with the second option, otherwise you would need to dig into the generation ratio configuration.

    To summarize: size of the old generation section is crucial to prevent unnecessary full GC cycles. Configure the max heap size and/or the generation ratios.

By optimizing the memory consumtion we got also much better performance, as the application went into less GC cycles. Some other parts of the drill include removing unused entries from the cache, in a dedicated thread, when reaching certain threshold, and more... The result is a huge cache running smoothly on a reasonable sized server.


Friday, April 22, 2011

Decent Code

Retrospecting the posts labled Decent Code I see many good points worth memorizing.



Do you have an API? - the basic question before coding!

Do you have a theory? - the basic question before debugging!


Be proud of your code! - a simple advice, yet powerful, for code reviews

The risks of redundant code (or - Less is More) - another powerful advice for code reviews

Use Explaining Variables! - and another one for code reviews (I should consider a label for code reviews probably)


Resource Files - this is obvious for anybody today, yet ignored too often...

Freaking behavior of a small little C/C++ bug - avoid the non-void not returning a value!

Semi-colon and java.lang.OutOfMemoryError - the methodic way of analyzing OOM in Java programs


and many more...


like for example...

Make sure to have a strict XSD!

or To AJAX or NOT to AJAX? - important to many web developers today which neglect the request-response model for AJAX idol, in many cases for no good reason.


and still, many more...

Thursday, March 31, 2011

The success of iPhone

Three years back, I explained why Nokia is not supporting external applications.


Well, since then Apple has AppStore and Nokia has followed with OVI. And Samsung has its Internet@TV portal.


It appears that focusing on the appliance itself is not enough nowadays.

Thursday, January 20, 2011

Thoughts following 2010 FogBugz and Kiln World Tour





I was attending yesterday Joel Spolsky world tour for FogBugz and Kiln.


I must admit that till yesterday I thought that Joel has a great blog about SW development, but I didn't quite understand - who needs yet another bug tracking tool... is there still a real market for that, when you have today so many good free tools!?


A few thoughts following the event:



  1. Joel's presentation skills do not fall from his writing. The gathering starting with a projected countdown analog clock, Joel went on stage right at the moment when the countdown reached its end - with a pre-planned bug on the clock finish resulting with some fake errors and blue screens, which led Joel to open a bug, start his presentation, then get a response on his newly opened bug "poping accidentally" while in his presentation (interestingly enough, while his Outlook is closed :-) , which led to getting into the code itself using Kiln, comparing versions and eventually "solving" the bug and checking in. A full bug detection and solving cycle in 10 minutes. With a few jokes here and there, and while going quickly through the products' abilities and strength points.


  2. The traditional tools for bug tracking and source control are OK. But even in this "already solved" domain, there is still room for improvements and for new players, either open source or commercial. When we have an idea to develop something we usually tend to check who already done that and how good it is. And when we see that there are already several reasonable solutions we assume that the market is closed for us on that. Well, Joel shows that there is always a market for realy good products, you don't have to invent a new big thing, you may need however to invent many small things.


  3. Distributed Source Version Control - GIT, Mercurial, Kiln - disconnects the developers own copy from a repository, while preserving full history and repository context. This concept has several significant advantages:
    (a) You do not postpone your check-ins, being afraid of hurting the main repository, check-ins are made into your private copy. When you are ready you push your copy back to the main repository. Your check-ins preserve full history during your development! This is also important when you perform a merge and realize that your original file was overriden by a wrong change - you don't have to worry as you have your full history at hand and you can get back to your last check-in.
    (b) You can push your developments to another repository, e.g. to a QA repository, thus allowing to push relevant fixes quickly, without releasing them yet to the development repository.
    (c) Upon merge, it's easier to see for each change the exact origin of it. Managing several customer releases you can see which changes in the main release where merged into the customer release and which not.

Thursday, January 13, 2011

Online IDE for almost any SW lang you can think of

Take a look at this one: http://ideone.com/


It supports:
Ada, Assembler, AWK, Bash, bc, Brainf**k, C, C#, C++ se, C++0x, C99 strict, CLIPS, Clojure, COBOL, COBOL 85, Common Lisp (clisp), D (dmd), Erlang, F#, Factor, Falcon, Forth, Fortran, Go, Groovy, Haskell, Icon, Intercal, Java, JavaScript (rhino), JavaScript (spidermonkey), Lua, Nemerle, Nice, Nimrod, Objective-C, Ocaml, Oz, Pascal (fpc), Pascal (gpc), Perl, Perl 6, PHP, Pike, Prolog (gnu), Prolog (swi), Python, Python 3, R, Ruby, Scala, Scheme (guile), Smalltalk, SQL, Tcl, Text, Unlambda, Visual Basic .NET, Whitespace


It's very useful when you want to check little programs, like the ones I tried when writing Freaking behavior of a small little C/C++ bug:

[1]


[2]

Freaking behavior of a small little C/C++ bug

Oh boy.
Read till the end the event and its root cause. Important morals follow below.





We run systems that on high capacity events handle thousands of transactions per second. One of the most heavy-traffic periods is New-Year's-Eve, the 31st of December, were most of our systems are under heavy stress around the world, stress that tends to difuse to our support teams. Structured and strict preparations usually make us pass this heavy-traffic day properly in most, if not all sites. Which happily was the case also this year.

Shockingly, on January 2nd we had a crash in two sites.

Analyzing the crash led to a timer that instead of re-scheduling itself for every 5 seconds, keeps snapping abruptly in periods of milliseconds.

While still analyzing the case, reproducing it in our labs, the problem vanished as suddenly as it appeared, on the end of the same day. January 3rd, 00:00, systems went back to behave nicely.

That's really odd. How does the bug relates to the date? Is it a coincidence? It doesn't look so, as a second after midnight problem disappears. Trying to reproduce it in the lab we got the same behavior: it is the bug of January 2nd 2011. (By the way, when running the system in our labs in debug mode, problem didn't reproduce! Bug appears only when running without debug! That's common for memory related bugs, smears etc.)

To some of us, it sounded like the iPhone alarm bug. Which was reported also not to work properly on 2011 start, being fixed on its own, by January 3rd.

http://www.tipb.com/2010/12/31/iphone-bugs-alarms-working-2011/


Maybe it's the same bug?

iPhone runs on iOS which is Linux based. We also run on Linux. Maybe there is something with Linux timers on beginning of 2011?
Looking for something in this direction led to nothing.

On the other hand, analytical investigation led to the following:

  1. The timer, when awakes, calls our callbak function. The callback function shall return an int value. Any value except 1 says "OK", 1 says - please call me again.
  2. Our callback function didn't return a value at all
Wait... - is it legal not to return a value from a non-void method?
Unfortunately, in C/C++ it is. And the bevior is undefined. The function do return a value, in some environmnets it will be the last value from the register. And, well, occasionaly it can be 1.
See:
http://stackoverflow.com/questions/1610030/why-can-you-return-from-a-non-void-function-without-returning-a-value-without-pro/1610454#1610454
http://stackoverflow.com/questions/2598084/function-with-missing-return-value-behavior-at-runtime
What shall be done?
Read:
http://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
-Wreturn-type
Warn whenever a function is defined with a return-type that defaults to int. Also warn about any return statement with no return-value in a function whose return-type is not void (falling off the end of the function body is considered returning without a value), and about a return statement with an expression in a function whose return-type is void. For C++, a function without return type always produces a diagnostic message, even when -Wno-return-type is specified. The only exceptions are `main' and functions defined in system headers. This warning is enabled by -Wall.

Morals
  • Listen to compiler warnings!
    Solve all warnings, you should have a zero warnings policy.
    The problem above could be caught and solved as a warning (-Wreturn-type).
  • If you don't keep a policy of zero warnings, which you should, turn bad warnings as the one above into an error, with a compilation flag, e.g.: -Werror=return-type
  • You may want to test your software in future time, for example, have a test system that runs all the time 30 days ahead, if there is a time related bug it may help catching it on time. It won't probably catch everything, but it could have catch the problem we had above!