Will my hashing cache keys get a conflict?

April 26, 2011 By Shuyang Zhou

The short answer to the title question is YES. However a longer version is : After LPS-16789 you don't really need to worry about it. The possibility for seeing an actual conflict is negligible for most people in most cases.

If your system is pursuing 100% safety in any case, stop reading this blog, go to LPS-16744, that is what you need.

However, if you are willing to take a little risk (It is very very low, keep reading you will see the actual number) by hashing your cache key for store, the reward will be a performance boost on old generation gc improving.

Let's clarify some conceptions first:

  1. No hashing algorithm is strong enough to prevent any conflict from unlimited input source. The reason is there is no one-to-one mapping from an umlimited space to a fixed size space.
  2. No system is 100% safe. Are you sure your colo will never have a power failure? Or your hardware will never have a "heart attack"? (Even the EARTH can not live for ever).

Your system will eventually halt, either by an exceptional reason or for an on schedule maintaining. So if the hashing conflict possibility is lower than those non-preventable situations, I suggest you to just let it go, the performance pay back for this will be much more worthy than your sacrifice.

Before I start to do math for caculating the hashing conflict possibility, I have to make an assumption.

Like no hashing algorithm is strong enough, no hashing algorithm can do the mapping in an absolutely even manner. Liferay's HashCodeCacheKeyGenerator is using the following formula(which is the exact same algorithm as java.lang.String.hashCode() using):

Hash(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Note: s is a String with n length.

It is not absolutely even for sure, to analyze its flatness is beyound my limited mathematics knowledge, so please pardon me to consider it as an even algorithm from now on for the sake of simplify analyzing.

With the above assumption, the problem now is much clear, let me describe the same problem in a different way:

Throwing n balls into b buckets(n <= b), what is the possibility that we end up with no bucket has more than 1 ball?

While I was trying to remind myself about how to do the Permutation and Combination, my wife surprisingly gave me the follow answer within 1 minute(To be honest, I was pissed off by this. Hate to admit she can do math better than meBut glad for married a smart women)

P = (b/b) * ((b-1)/b) * ((b-2)/b) * ... * ((b-n+1)/b)
    = (b * (b-1) * (b-2) * ... * (b-n+1))/b^n
    = b!/(b^n*(b-n)!)

Then the conflict possibility is:

Pc = 1 - P = 1 - b!/(b^n*(b-n)!)

Now let's put numbers into this formula, we use long (64bit) to hold the hash result, so we have 2^64 usable hash code, which means b = 2^64.
The n is your cache size, although you could try to configure your cache to support a super large size, but unless your using an off heap cache provider, I can promise you mostly likely you will see an OOM before the hashing conflict appears.

A reasonable in-heap cache size in the real world would be about: 10,000 ~ 100,000 entries.
If you are using a cache more than this size, maybe you should consider to break it down into smaller regions and maybe even move them out of the jvm heap, regardless of the hash conflict, a huge cache region is slow for looking up and inserting.

Cache warmup hashing conflict possibility
Cache sizeConflict possibility
10,000 entries2.710e-12
100,000 entries2.710e-10

 

Please be aware the number above is the possibility of conflict when you insert 10,000 and 100,000 entries into an empty cache. I will call this cache warmup hashing conflict rate.

Once the cache is properly populated, its size will remain, however new entries will come in, old entries will evict out, the cache is entering swapping phase.
Now let's analyze the hashing conflict possibility for a swapping cache.

This is actually quite easy, you have n balls inside b buckets(n <= b), no bucket has more than 1 ball, now you throw a new ball toward those buckets, what is the possibility that you end up with putting the new ball into a bucket that already contains a ball? Obviously it is n/b.

The difficult question here is how othen do you throw a new ball(inserting new entry into cache)? For any individual cache the actual insert rate normally is very low, the reason is cache is designed for reading, not writing. If you have a cache that is mainly be written to, rather than reading from. You'd better reconsider your design. In my experience this number should be under 50 ips (insert per second), well this is an experience number, I can not prove it. If you have a way to measure the actual insert rate of your cache, please replace it with your number.

Note: you may see a high cache missing rate from Ehcahe JMX report for certain cache(could be 100+ per second sometimes). In ehcache, missing means go fetch from DB then cache it, which equals to an insert. But this is not the exact same insert rate I am talking here. In my assumption, the input source is unlimited which means it won't generate any same result, every single new entry is different from all history results. This kind of thing does not exist in real world. All input source repeat itself at certain degree. For example, you may find user cache has a super high swapping rate at peak load, maybe 300+ ips(Means within one second, your system has 300+ users login and 300+ users logout). But it is ok, people login/logout, for the same user no matter how many times he login, it only counts once. The reason is the actual cache key for that user is same, insert same key multiple time will not increase the conflict rate at all. Only different keys can cause conflict.

Now let's see what is the possibility of keeping your system running for whole year long without maintaining(not even clearing the cache from Liferay's control panel).

There are 365 * 24 * 60 * 60 = 31536000 seconds in a year, let's say your cache insert rate is 50 ips, then in one year you will insert 31536000 * 50 = 1576800000 times.

For each insert you are taking the risk of n/2^64 to see a conflict, n is your cache size, so we have:

Cache swapping hashing conflict possibility
Cache sizeConflict possibility
10,000 entries8.548e-7
100,000 entries8.548e-6

 

So in summary, within one year if the possibility for your server to be out of service (no matter it is a system failure or an on schedule maintaining) is higher than 2.710e-10 + 8.548e-6 ≈ 8.548e-6(This is almost guaranteed to be true), I suggest you to hash your cache key for store. The possibility to see a conflict is very very low, even it actually happens you can recover it by clearing the cache in control panel.

Let me emphasize this again, if you do need a 100% safe cache key, see LPS-16744, that is the solution.

Direct JSP Servlet

December 21, 2010 By Shuyang Zhou

A lot of Liferay TagLibs use including JSP files to present their content.

Is there any difference between TagLib used JSPs and the normal ones?
Functionally, they are the same thing.
But when we see them from performance view, there is a huge difference.

Whenever we do a jsp include, there are two major performance related steps:
 

  • 1) Jsp servlet looking up.

Code example:
RequestDispatcher requestDispatcher = servletContext.getRequestDispatcher(path);

The ServletContext.getRequestDispatcher(String path) is a heavy app-server api call, which needs to look up app-server's jsp servlet resource.

  • 2) Filter Stack setting up and traversing

Code example:
requestDispatcher.include(request, response);

No matter you want it or not, the invocation will set up the filter stack(by request URL regex pattern mapping), and traverse all the filters in the stack one by one.

For a normal jsp including, there is nothing wrong with these two steps. But for TagLib, things are a little different, And this little difference can bring us huge performance improvement.

For 1) in general cases, path parameter is various, there is a large candidate pool.
But for TagLibs, the candidate pool is relatively smaller, which means we can consider to cache the path mapping to the underneath JSP servlet. This can save us the look up time significantly.

For 2) in general cases, filters are absolutely needed.
But for all Liferay TagLibs, filters are guaranteed not needed. So the setting up and traversing filters are huge performance waste, we should call the JSP servlet directly without bothering the filters.

In LPS-13776, I made an improvement based on these analyzing.
I added a cache to hold path to servlet mapping, so the looking up will only happen once when the jsp is included for the first time. After that all looking up hit the cache.
By using the DirectRequestDispatcher, all include calls is sent directly to the jsp servlet without any filter processing.

You can turn on this feature by setting:
direct.servlet.context.enabled=true
By default this is on.

One drawback about the mapping cache is losing jsp dynamic reloading ability at runtime, which is not acceptable for developers. To overcome this pitfall, I added a timestamp to the mapped jsp servlet, whenever receives a request for a jsp servlet, it will check the jsp file's timestamp, compare to the jsp servlet's timestamp. If the external jsp file is newer, do a reload. This reloading is only required at development time, on production server it is totally a waste.
You can turn off this feature by setting:
direct.servlet.context.reload=false
By default this is on.

New AOP Mechanism

July 7, 2010 By Shuyang Zhou

Liferay is heavily using Spring's aop. Spring's aop is doing an amazing job to make your life easy by making your code clean.

So why i am creating something new, if the exist one is good enough? Spring is taking care of general use cases, but Liferay's use case is a little bit special, so it gives me a chance to improve.

Let's see how does Spring do the general aop first.
Basically the aop is a wrapper on top of the real bean, so it gives you a chance to do extra processing before(or after, or throwing, or finally) real logic. And to make this more powerful, the wrapper can wrap another wrapper, which means multiple layers aop support.

Your extra processing is put inside advice class, and the advice will be binded to certain point-cut at runtime. The following picture demonstrates this.


Besides your own logic inside advices, which parts contribute the most overhead? I mean the overhead purely generated by the framework itself.
There are two parts:

  1. The aop wrapper's creation and invoking.
  2. The runtime wiring between advice and point-cut.


For 1, each aop wrapper is a JDK dynamic proxy object(or CGLib proxy), as you all know the creation of proxy object is kind of heavy. And the generated code introduces a few more stack-calls, so the more aops you have, the deeper the stack is. If you have ever seen liferay's thread dump, you will know what i am talking about.
For 2, the wiring between advice and point-cut is using a regular express like pattern matching which is a complex calculating process. So the more point-cuts you have and the more complex the patterns are, the slower your code is.

Now you should see, to improve the performance, we should create aop wapper objects as fewer as possible, we should use fewer point-cuts.
These rules can not be applied to Spring itself, since for general purpose processing, you can not limit the usage for them.

But for liferay, things are different, the key point is, the aop advices are mainly for service beans.

First, we could limit the point-cut to just service beans, if an advice only cares a few service beans, it can do a second matching by bean name or annotation. Either way is cheaper than regex.

Second, since advice itself will take care of secondly matching, all aop wrappers are just against service beans. So why bother creating a new wrapper for each advice? All advices could share the same wrapper which is created by the first invoked advice, then all other advices will be invoked as a chain. The following picture demonstrates this.


So this reduce Na(advice number) aop wrapper to 1, Np(point-cut number) point-cut matching to 1(still needs a few secondly matching, but they are much cheaper).

Since the advices now are chained up as a linked list, it becomes possible to modify aops structures esaily at runtime, just insert or remove elements from the list. By this way you can take out the unused advices complete(not just disable them, even they are disabled, they still cause stack calls) for performance, or apply new advices without restart the server(of course you have to take care of the thread safe property yourself).

So for a complex system(Liferay system) with a lot of aop usage, this can improve performance significantly.

For more detail info, please see LPS-9793 and LPS-9795

Embed StringBundler into App-server

July 6, 2010 By Shuyang Zhou

If you have read my previous blog, you must already know StringBundler.

StringBundler can improve your String processing performance a lot, but it is still not enough.

Now answer this question: In a web application, which part creates most String garbages?

Normally it is not your Java code, but the template engine for generating the page content. Most likely is Jsps. Jsps will be complied into Servlets before providing services. Jsp's performance depends on Jsp Compiler's optimization skills. Well not a single Jsp Compiler knows StringBundler(It is liferay's internal code), so they must either use StringBuffer, StringBuilder or some similar home-made way. So can we improve jsp's String processing performance? It seems almost impossible without modify Jsp Compiler's code. Thanks to the well designed extendable Jsp framework. It is actually doable!

Let's skip all the explanation and impl detail, directly see the optimization result.

By adding the follow line to your code(you can add it to any where, but for better performance, i suggest to add it into startup progress):

JspFactorySwapper.swap();

You can improve String processing performance significantly. How much that would be actually? In our MessageBoard benchmark test case, this line reduces 66% old generation collection.

Now let me explain what kind of things this simple line do to provide you the improvement.

If you are familiar with Jsp standard, you must know PageContext. It is the center of the bottleneck. It has two performance problems.

  1. It has a JspWriter, and Jsp standard does not provide a standard way to configure that writer's buffer, most Jsp Compilers make it a fix size, a normal size is 8K.
  2. It is responsible for creating BodyContent, and most app-server's BodyContentImpl uses a StringBuilder like way to collect page data.


For 1, if you have read my IO performance blog, you should know this buffer is actually hurting your performance by introduce unnecessary data copying. So we should disable the buffer.

For 2, if we can use StringBundler to collect the data, we can avoid a lot of intermediate data buffers.

Jsp standard provides a way to hack into its internal code, without modify any app-server's code. We can install our own JspFactory instance by:

JspFactory.setDefaultFactory(newJspFactory);

And in our JspFactoryImpl, we wrapper out the actual PageContextImpl to disable the writer buffer and wrapper the BodyContentImpl with a StringBundler enabled wrapper.

The whole process is following Jsp standard, no reflection and code modification is required, so theoretically it should work on all app-servers, as long as their also follow the Jsp standard. And in practice, it works on all app-servers except WebLogic. WebLogic's Jsp Compiler is not following the rules programming to interface, the compiled servlet code assumes all code are their internal impl code, it does downcast everywhere. So i am sorry say we have to do this optimization without WebLogic.

For more impl detail, please see LPS-9099

New Ehcache Replication Mechanism

July 5, 2010 By Shuyang Zhou

2010's benchmark and optimization is reaching the end, it is time to write something about what we did in last few months.

Today we will talk about an EE-only feature, a new ehcache cluster replication mechanism. Since this is only available in EE version portal, i won't talk about impl detail too much. Just some general concepts about what is the problem and how we fix it.

Ehcache supports cluster, by default it uses RMI replication mechanism, which is a point to point communication graph. As you can guess this kind of structure can not scale for large cluster with many nodes. Because each node has to send out N-1 same event to other nodes, when N is too big, this will become a network traffic disaster.



To make things even more worse, ehcache creates a replication thread for each cache entity. In a large system like Liferay Portal, it is very easy to have more than 100 cache entities, this means 100+ cache replication threads. Threads are expensive, because they take resource(memory and cpu power). But these threads are most likely sleeping over 99% time, since they only start to work when a cache entity needs to talk to remote peers. Without regard to thread heap memory(Because this is application depended), just consider stack memory footprint of that 100+ threads. By default on most platform, the thread stack size is 2MB, that means 200+MB. If you include the heap memory size, this number may even reach 500MB(This is just for one node!). Even memory chips are cheap today, we still should not waste them! And massive threads can also cause frequent context switch overhead.

We need to fix both the 1 to N-1 network communication and massive threads bottleneck.

Liferay Portal has a facility called ClusterLink which is basicly an abstract communication channel, and the default impl is using JGroups' UDP multicast to communicate.

By using ClusterLink we can fix the 1 to N-1 network communication easily.
To reduce the replication thread number, we provide a small group of dispatching threads. They are dedicated for delivering cache cluster event to remote peers. Since all cache enities' cluster event will go through one place to network, this gives us a chance to do coalesce, if two modification to the same cache object are close enough, we only need to notify remote peers once to save some network traffic.



(Newer version ehcache supports JGroups replicator, it can also fix the 1 to N-1 network communication, but it cannot fix the massive threads problem and cannot do coalesce.)

For EE customer who is in interested in this feature, you can contact our support engineers for more detail info.

Showing 1 - 5 of 11 results.
Items 5
of 3