Saturday, March 13, 2010

Don’t be smart. Never implement a resource bundle cache!

by Eduardo Rodrigues

Well, first of all, I’d like to apologize for almost 1 year of complete silence. Since I’ve transferred from Oracle Consulting in Brazil to product development at the HQ in California, it’s been a little bit crazy here. It took a while for me to adjust and adapt to this completely new environment. But I certainly can’t complain, cause it’s been AWESOME! Besides, I'm of the opinion that, if there’s nothing really interesting to say, then it’s better to keep quiet :)

Having said that, today I want to share an interesting experience I had recently here at work. First I’ll try to summarize the story to provide some context.

The Problem

A few months ago, a huge transition happened in Oracle’s internal IT production environment when 100% of its employees (something around 80K users) were migrated from old OCS (Oracle Collaboration Suite) 10g to Oracle’s next generation enterprise collaboration solution: Oracle Beehive. Needless to say, the expectations were big and we were all naturally tense, waiting to see how the system would behave.

Within a week with the system up and running, some issues related to the component I work on (open source Zimbra x Oracle Beehive integration) started to pop up. Among those, the most serious was a mysterious memory leak, which had never been detected before during any stress test or even after more than a year of production, but was now causing containers in the mid-tier cluster to crash after a certain period.

After a couple days of heap dump and log files analysis, we discovered that the culprit were 2 different resource caches maintained by 2 different components in Zimbra’s servlet layer, both related to its internationalization capabilities. In summary, one was a skin cache and the other was a resource bundle cache.

Once we dove into Zimbra’s source code, we quickly realized we were not really facing a memory leak per se but an implementation which clearly underestimated the explosive growth in memory consumption that a worldwide deployment like ours has a huge potential to trigger.

Both caches were simply HashMap objects and, ironically, their keys were actually the key to our problem. The map keys were defined as a combination that included client’s locale, user agent and, in the case of the skins cache, the skin name was also included. Well… you can probably imagine how many different combinations of these elements are possible within a worldwide system deployment, right? Absolutely! In our case, each HashMap would quickly reach 200MB. Of course, consuming 400MB out of 1GB of configured heap space with only 2 objects is not very economic (to say the least).

So, OK. Great! We found our root cause (which is awesome enough in this kind of hard-to-analyze-production-only bugs). But now comes the harder part: how can we fix it?!

The Solution

First of all, it’s very important to keep this very important aspect in mind: we were dealing with a source code that wasn’t ours, therefore, keeping changes as minimal as possible was always crucial.

One thing we noticed right away was the fact that we were most likely creating multiple entries in both maps that ended up containing identical copies of a same skin or resource bundle content. That’s because our system only supported 15 distinct locales, which means, every unsupported client locale would fallback to one of the supported locales, ultimately, the default English locale. However, the map key would still be composed with the client’s locale, thus creating a new map entry, and even worse, mapping to a new copy of the fallback locale. Yes, locales and skins that had already been loaded and processed, were constantly being reloaded, reprocessed and added to the caches.

So, our first approach was to perform a small intervention with the only intention to prevent any unsupported client locale from originating a new entry in those maps. Ideally, we would want to change the maps’ key composition but we were not very comfortable with this idea, mainly because we were not sure we fully understood all the consequences of that, and fix the problem causing another was not an option.

Unfortunately, days after patching the system, our containers were crashing with OutOfMemory exceptions again. As we discovered – the hardest way – simply containing the variation of the locale component in the maps’ key composition was enough to slow down the heap consumption but not enough to avoid the OOM crashes.

Now it was time to put our “fears” aside and dig deeper. And we decided to dig in two simultaneous fronts: the skin cache and the resource bundle cache. In this post, I’ll only talk about the resource bundle front leaving the skin cache front to a next post.

When I say “resource bundle”, I’m actually referring to Java’s java.util.ResourceBundle, more specifically its subclass java.util.PropertyResourceBundle. With that in mind, 2 strange things caught my attention while looking carefully into the heap dumps:

  1. Each ResourceBundle instance had a “parent” attribute pointing to its next fallback locale and so on, until the ultimate fallback, the default locale. This means that each loaded resource bundle could actually encapsulate other 2 bundles.
  2. There were multiple ResourceBundle instances (each one with a different memory address) for 1 same locale.

So, number 1 made me realize that the memory consumption issue was even worse than I thought. But number 2 made no sense at all. Why have a cache that is only stocking objects but is not able to reuse existing ones? So I decided to take a look at the source code of class java.util.ResourceBundle in JDK 5.0. The Javadoc says:

Implementations of getBundle may cache instantiated resource bundles and return the same resource bundle instance multiple times.

Well, turns out Sun’s implementation (the one we use) DOES CACHE instantiated resource bundles. Even better, it uses a soft cache, which means all content is stored as soft references, granting the garbage collector the permission to discard one or more of its entries if it decides it needs to free up more heap space. Problem solved! – I thought. I just needed to completely remove the unnecessary resource bundle cache from Zimbra’s code ant let it take advantage of the JVM’s internal soft cache. And that’s exactly what I tried. But, of course, it wouldn’t be that easy…

Since at this point I already knew exactly how to simulate the root cause of our problem, I started debugging my modified code and I was amazed when I saw that the internal JVM’s cache was also stocking up multiple copies of bundles for identical locales. The good thing was that now I could understand what was causing #2 above. But why?! The only logical conclusion was, again, to blame the cache’s key composition.

The JVM’s resource bundle cache also uses a key, which is composed by the bundle’s name + the corresponding java.util.Locale instance + a weak reference to the class loader used to load the bundle. But then, how come a second attempt to load a resource bundle named “/zimbra/msg/ZmMsg_en_us.properties”, corresponding to en_us locale and using the very same class loader was not hitting the cache?

After a couple hours thinking I was loosing my mind, I finally noticed that, in fact, each time a new load attempt was made, the class loader instance, although of the same type, was never the same. And I also noticed that its type was actually an extended class loader implemented by inner-class com.zimbra.webClient.servlet.JspServlet$ResourceLoader. When I checked that code, I immediately realized that class com.zimbra.webClient.servlet.JspServlet, which itself is an extension of the real JspServlet being used in the container, was overriding method service() and creating a new private instance of custom class loader ResourceLoader and forcefully replacing the current thread’s context class loader with this custom one, which was then utilized to load the resource bundles.

My first attempt to solve this mess was to make the custom class loader also override methods hashCode() and equals(Object) so they would actually proxy the parent class loader (which was always the original one that was replaced in method service()). Since the web application’s class loader instance would always be the same during the application’s entire life cycle, both hashCode and equals for the custom loader would consistently return the same results at all times, thus causing the composed keys to match and cached bundles to be reused instead of reloaded and re-cached. And I was wrong once again.

Turns out, as strange as it may look at first sight, when the JVM’s resource bundle cache tries to match keys in its soft cache, instead of calling the traditional equals() to compare the class loader instances, it simply uses the “==” operator, which simply compares their memory addresses. Actually, if we think more about it, we start to understand why they implemented this way. Class loaders are never expected to be constantly instantiated, over and over again, during the life cycle of any application, so why make an overhead method call to equals()?

Finally, now I knew for sure what was the definitive solution. I just needed to transform the private instances of ResourceLoader into a singleton, keeping all the original logic. Bingo! Now I could see the internal bundle cache being hit as it should be. Problem solved, at last!

At the end, after having completely removed the custom resource bundle cache implemented in Zimbra’s servlet layer and performed the necessary changes to make Zimbra take full and proper advantage of the built-in bundle cache offered by the JVM, instead of wasting a lot of time and memory reloading and storing hundreds of instances of resource bundles, mostly multiple copies of identical bundles, I could now confirm that despite all different client locales coming in clients' requests, the JVM’s bundle cache was holding no more than those corresponding to the 15 supported locales. With that, we had finally fixed the memory burning issue for good.

Conclusion

As this article’s title suggests, don’t try to be smarter than the JVM itself without first checking whether it’s doing it’s job well enough or not. Always do carefully read the Javadocs and, if needed, check your JVM’s source code to be sure about its behavior.

And remember…. never implement a resource bundle cache in your application (at least if you’re using Sun’s JVM) and be very careful when implementing and using your own custom class loaders too.

That’s all for now and…
Keep reading!

3 comments:

Anonymous said...

Genial fill someone in on and this post helped me alot in my college assignement. Thank you on your information.

Anonymous said...

Approvingly your article helped me truly much in my college assignment. Hats high to you post, will look forward for more interdependent articles soon as its anecdote of my pick question to read.

Webmaster said...

I appreciate the information. Java is one of the consistent player from the development industry which has been providing the wider scope for developers to come out with different solutions.