Showing posts with label GC. Show all posts
Showing posts with label GC. Show all posts

Friday 14 March 2014

Concurrent mode failure: Tuning JVM GC for Solr

The machine
I have an 8 CPU VM server with 32GB RAM running Solr. My JVM is 1.6.0_37 with the following JVM settings:
-Xms28g
-Xmx28g
-XX:NewSize=6g
-XX:MaxNewSize=6g
-XX:SurvivorRatio=4
-XX:PermSize=512m
-XX:MaxPermSize=512m
-XX:SoftRefLRUPolicyMSPerMB=500
-XX:+PrintCommandLineFlags
-XX:+HeapDumpOnOutOfMemoryError
-XX:+DumpGCHistoryOnOutOfMemory
-XX:+DumpDetailedClassStatisticOnOutOfMemory
-XX:HeapDumpPath=/opt/alfresco/tomcat/dumps
-verbose:gc
-Xloggc:/opt/alfresco/tomcat/dumps/gc-logs/gc-2014-03-13-10-00-07.log
-XX:+GCHistory
-XX:+CMSClassUnloadingEnabled
-XX:+DisableExplicitGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC

The reason for such huge heap is that the Solr data are about 130 GB and Sorl is heavily utilized from around 100 concurrent threads performing text search on documents.
I notice that Sorl application pauses for some time without responding. I discovered on the GC logs the following problem:
2014-03-17T14:29:23.438+0100: 7991.661: [GC2014-03-17T14:29:23.438+0100: 7991.661: [ParNew (promotion failed)
Desired survivor size 268435456 bytes, new threshold 15 (max 15)
- age   1:  125233576 bytes,  125233576 total
: 2621440K->2228808K(2621440K), 5.9399450 secs]2014-03-17T14:29:29.378+0100: 7997.601: [CMS2014-03-17T14:29:32.715+0100: 8000.938: [CMS-concurrent-sweep: 21.573/33.920 secs] [Times: user=118.80 sys=3.29, real=33.91 secs]
 (concurrent mode failure): 20465547K->11174034K(26214400K), 33.5774490 secs] 22674213K->11174034K(28835840K), [CMS Perm : 47873K->47648K(524288K)], 39.5176760 secs] [Times: user=46.19 sys=2.36, real=39.51 secs]

This issue is summarized in the official ORACLE documentation for JVM v6 as follows:

..a concurrent collection needs to be started at a time such that the collection can finish before the tenured generation becomes full; otherwise the application would observe longer pauses due to concurrent mode failure. There are several ways a concurrent collection can be started. 

See:  Concurrent Mode Failure

The message "concurrent mode failure" signifies that the concurrent collection of the tenured generation did not finish before the tenured generation became full. In other words, the new generation is filling up too fast, it is overflowing to tenured generation but the CMS could not clear out the tenured generation in the background. When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW) collection. All the application threads are stopped, a different algorithm is used to collect the tenured generation (our particular flavor of a mark-sweep-compact), the applications threads are started again, and life goes on....

Seems that a concurrent mode failure is responsible for a "Stop the World" JVM pausing.
See also another wonderful Blog about the same issue here :

In order to treat problem we tune the following JVM flags:
-XX:CMSInitiatingOccupancyFraction=10
Indicating that a concurrent collection will start if the occupancy of the tenured generation exceeds 10% instead of 92% that is the default threshold.
-XX:CMSIncrementalSafetyFactor=100
Indicating to the JVM GC to start a concurrent collection at the next opportunity without any delay.
See also Oracles GC tunning instructions here

Tuesday 28 January 2014

Overriding finalize() for reference count on active threads

The need for this small article came from the implementation of a custom thread controller. In my design I am using a ThreadGroup that holds threads that dynamically load classes and invoke methods of objects created on the fly.

The problem here comes with the ThreadGroup. I want to have the number of all threads in the group and we mean absolute number not something like myThreadGroup.activeCount() that shows only the active threads of the group. Normally ThreadGroup.activeCount() returns an estimate of the number of active threads in this thread group, so we cannot rely on the this.
One solution is to hold a container of references to all the new threads, but then I have to implement custom synchronized code for accessing the container and blah blah...

The solution I finally followed was to implement an object reference count on the parent class CustomPlugin. When a new object of any descendant class rooted from CustomPlugin, reference count is augmented. When a plugin exits execute() method, the curring thread exits run().
Plugin object reference terminated and destroyed by the JVM GC and the number of concurent plugins is decreased to something less than MAX_TOTAL_PLUGINS. Only then a new plugin can be created.

public abstract class CustomPlugin extends Thread implements Plugin{
 
 public static final int MAX_TOTAL_PLUGINS = 3;
 
 private static int refCount = 0;

 public CustomPlugin(String name) throws Exception{
    if(!allowLoad()){
       throw new Exception("Maximum plugin objects already loaded!");
    }
    Thread.currentThread().setName(name + "-" + Thread.currentThread().getId() );
    refCount++;
 }
 

 @Override
 protected void finalize(){
    try {
      super.finalize();
    } catch (Throwable e) {
      e.printStackTrace();
    }finally{
      refCount--; 
    }
 }

 public static boolean allowLoad() {
    return (refCount>=MAX_TOTAL_PLUGINS?false:true);
 }

 public abstract execute(String [] args);
}
Remember that you have only CustomPlugin.MAX_TOTAL_PLUGINS=3 available threads that you can use for your plugins, meaning that only MAX_TOTAL_PLUGINS can be used in total, regardless if their threads are sleeping! When a plugin exits execute() method, the curring thread exits run(). Plugin object reference terminated and destroyed by the JVM GC and the number of concurrent plugins is decreased to something less than MAX_TOTAL_PLUGINS. Only then a new plugin can be called.