Java using 100% of CPU

soddoadam · April 15, 2015, 7:02am

We have had a deployment of OpenMRS running on Ubuntu server for about 5 months. It has been running well.

We are using tomcat7

This morning the cpu was maxed out and the webserver was very sluggish. I logged in and checked ‘top’ and saw that java was using 90-99% of my cpu on the server.

I restarted tomcat7 but this didn’t solve the problem. I rebooted the server and it solved the problem.

Any thoughts on what could have caused this? ’ Here is the output of top command.

Tasks: 164 total,   2 running, 162 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.3 us,  1.0 sy,  0.0 ni, 95.4 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:   4048356 total,  3164036 used,   884320 free,   229608 buffers
KiB Swap:  4190204 total,        0 used,  4190204 free.  1242480 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1311 tomcat7   20   0 1772176 856780  18584 S 93.4 21.2 137:17.54 java
27161 omrsadm+  20   0   24968   1532   1064 R  6.2  0.0   0:00.01 top
    1 root      20   0   33856   3192   1472 S  0.0  0.1   0:05.10 init
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S  0.0  0.0   0:04.57 ksoftirqd/0
    5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S  0.0  0.0   0:15.96 rcu_sched
    8 root      20   0       0      0      0 R  0.0  0.0   0:33.69 rcuos/0
    9 root      20   0       0      0      0 S  0.0  0.0   0:00.00 rcu_bh
   10 root      20   0       0      0      0 S  0.0  0.0   0:00.00 rcuob/0
   11 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0
   12 root      rt   0       0      0      0 S  0.0  0.0   0:02.07 watchdog/0
   13 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 khelper
   14 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kdevtmpfs
   15 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 netns
   16 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 writeback
   17 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kintegrityd

r0bby · April 15, 2015, 7:29am

What was OpenMRS doing at the time? Do the logs show anything of interest?

soddoadam · April 15, 2015, 9:25am

Sifting through the log from this morning I found nothing of note before I tried to restart Tomcat, but I found these from after I restarted TOMCAT7

SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: unexpected EOF in middle of data block
java.io.StreamCorruptedException: unexpected EOF in middle of data block
        at java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2539)
        at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2615)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2820)
        at java.io.ObjectInputStream.readInt(ObjectInputStream.java:971)
        at java.util.HashMap.readObject(HashMap.java:1154)
        at sun.reflect.GeneratedMethodAccessor504.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.catalina.session.StandardSession.readObject(StandardSession.java:1595)
        at org.apache.catalina.session.StandardSession.readObjectData(StandardSession.java:1060)
        at org.apache.catalina.session.StandardManager.doLoad(StandardManager.java:282)
        at org.apache.catalina.session.StandardManager.load(StandardManager.java:202)
        at org.apache.catalina.session.StandardManager.startInternal(StandardManager.java:489)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5476)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1073)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)



Exception in thread "Thread-212" org.openmrs.api.APIException: contextDAO is null
        at org.openmrs.api.context.Context.getContextDAO(Context.java:173)
        at org.openmrs.api.context.Context.closeSession(Context.java:737)
        at org.openmrs.api.context.Daemon$4.run(Daemon.java:215)
Apr 15, 2015 9:33:32 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [/openmrs] appears to have started a thread named [Thread-2] but has failed to stop it. This is very likely to create a memory leak.
Apr 15, 2015 9:33:32 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/openmrs] created a ThreadLocal with key of type [org.springframework.core.NamedThreadLocal] (value [Transactional resources]) and a value of type [java.util.HashMap] (value [{org.hibernate.impl.SessionFactoryImpl@271e65fe=org.springframework.orm.hibernate3.SessionHolder@1594c6f6}]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

harshadura · April 15, 2015, 10:19am

By analysing the logs, i suspect it could be a threading issue with hashMap usage. Could you please mention the OpenMRS version and the modules you are using in your setup. So that would help to rectify the issue. Also if possible please submit the full tomcat log.

Thank you, Harsha

soddoadam · April 15, 2015, 11:17am

Here is the log for the last few days. Please feel free to comment on any of our errors, but we usually run just fine with them.

we are running 2.1 modules on platform 1.10. with a couple custom modules. (registration, ethiopian calendar)

here’s a list of our modules. addresshierarchy-2.4-SNAPSHOT.omod allergyapi-1.0.1.omod allergyui-1.0.omod appframework-2.3-SNAPSHOT.omod appointmentscheduling-1.2.omod appointmentschedulingui-1.0-SNAPSHOT.omod appui-1.3-SNAPSHOT.omod atlas-2.2-SNAPSHOT.omod calculation-1.1.omod coreapps-1.5-SNAPSHOT.omod dataexchange-1.2.omod emrapi-1.5-SNAPSHOT.omod event-2.1.omod formentryapp-1.1-SNAPSHOT.omod htmlformentry19ext-1.5-SNAPSHOT.omod htmlformentry-2.4.omod htmlformentryui-1.1-SNAPSHOT.omod htmlwidgets-1.6.5.omod idgen-3.1-SNAPSHOT.omod logic-0.5.2.omod metadatadeploy-1.2.omod metadatamapping-1.0.1.omod metadatasharing-1.1.8.omod namephonetics-1.4.omod providermanagement-2.2.omod referenceapplication-2.2-SNAPSHOT.omod referencemetadata-2.1.1.omod registrationapp-1.0.omod registrationcore-1.0.omod reporting-0.9.2.1.omod reportingrest-1.3.omod serialization.xstream-0.2.7.omod soddoregistration-1.0-SNAPSHOT.omod uicommons-1.4-SNAPSHOT.omod uiframework-3.3-SNAPSHOT.omod uilibrary-2.0.4.omod webservices.rest-2.10-SNAPSHOT.omod xforms-4.3.2-SNAPSHOT.omod

catalina.out.txt (2.4 MB)

cintiadr · April 15, 2015, 12:13pm

I know it doesn’t help you know now, but I few tricks that have worked before to diagnose this kind of behaviour:

Take thread dumps of the java process. If in linux or mac, send a kill -3 <pid>, let’s say every 30 seconds. The thread dumps will appear on the tomcat logs, the one for stdout. A tool like https://java.net/projects/tda can help investigate what the threads were actually doing or waiting for, as well as deadlocks. It’s a non-blocker process, so it’s quite alright to run it anytime, even if there’s a chance of recovering.
It’s nice to configure the the JVM to print information about its garbage collection. A set of useful options: -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintConcurrentLocks -verbose:gc. Every invocation to the garbage collector will print a line on the logs, and you can see if it’s doing a lot of full GCs and not freeing any memory (I’m used to call it ‘GC death’ or ‘GC catalepsy’).

arbaughj · April 15, 2015, 2:37pm

Hi Adam. I have seen similar behavior when reports are running. You might see if you have any reports running and/or check the Tomcat log, catalina.out, should this happen again.

James

harsha89 · April 15, 2015, 2:58pm

@cintiadr @arbaughj You can take a thread dump using jstack -l JAVA_PID > jstack.out . Also if you running on jdk 1.7.50 or above jfr dumps also very useful when analyzing these situations. If you can provide the thread dump or jfr recordings, it will really helpful for analyze the issue.

soddoadam · April 15, 2015, 6:26pm

Harsha, I’m not familiar with what you mean by hashMap usage. Can you explain?

Thanks, Adam

harshadura · April 16, 2015, 2:19am

Hi Adam, Its a technical term & it just came to my mind by looking at the log, so dont worry much about it. (Basically hashMaps are used in the code.)

I think as suggested by everyone, now what you could do is: Follow these steps (before restarting the servers or tomcat) whenever occur this again. So we can dig up into source of the problem.

I found a well written article on SO about it, so you can easily follow it: http://code.nomad-labs.com/2010/11/18/identifying-which-java-thread-is-consuming-most-cpu/

In short what you need to do is.

run top -H (on *nix terminal)
get PID of the thread with highest CPU
get stack dump of java process: jstack <Chosen PID> : it would be better to get multiple dumps in regular intervals. I have updated the instructions here: https://wiki.openmrs.org/display/docs/Troubleshooting+High+CPU+Utilization#Getting stack dump of Java process.
convert the chosen PID to HEX : 0x + http://www.binaryhexconverter.com/decimal-to-hex-converter
in stack dumps look for thread with the matching HEX PID.

After that you can report us with your findings / File a ticket on OpenMRS jira. Otherwise its very difficult to exactly say what causing this problem. Other than OpenMRS system issues, it also could be your Ubuntu server configuration error/issues of other installed applications/Java-JRE issue/etc.

HTH, Harsha

burke · April 16, 2015, 2:35am

@harshadura, that’s a great tip! Do we have that on the wiki somewhere?

harshadura · April 16, 2015, 4:11am

@burke, I just added it here: https://wiki.openmrs.org/display/docs/Troubleshooting+High+CPU+Utilization

soddoadam · April 16, 2015, 9:16am

The problem returned today. I was able to do some of your debugging steps to log some more information.

The offending thread is this one:

"VM Thread" prio=10 tid=0x00007f36b406c000 nid=0x52e runnable

but there is not any info in the logs beyond that. The logs are attached in 2 parts with the Java dumps.

2016-04-16.catalina.out.part1.txt (2.8 MB) 2016-04-16.catalina.out.part2.txt (547.9 KB)

Please help! Two crashes in 3 days is no good!

Adam

soddoadam · April 16, 2015, 9:19am

Sorry, some more data. Here is top -H

top - 11:51:13 up 1 day,  2:16,  2 users,  load average: 1.62, 2.27, 2.91
Threads: 352 total,   3 running, 349 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.7 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   4048356 total,  2641076 used,  1407280 free,   226664 buffers
KiB Swap:  4190204 total,        0 used,  4190204 free.  1039104 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1326 tomcat7   20   0 1719820 885436  18588 R 92.7 21.9  58:07.47 java
 1206 mysql     20   0  887568 218248   7592 S  0.3  5.4   0:00.33 mysqld
 9247 mysql     20   0  887568 218248   7592 S  0.3  5.4   0:05.95 mysqld
 9442 mysql     20   0  887568 218248   7592 S  0.3  5.4   0:01.62 mysqld
 9837 mysql     20   0  887568 218248   7592 S  0.3  5.4   0:00.49 mysqld
 1333 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:51.02 java
 2048 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:13.61 java
 8460 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:24.13 java
 8601 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:24.67 java
 8603 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:24.52 java
 8999 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:23.93 java
 9000 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:27.02 java
 9102 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:23.31 java
 9163 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:22.94 java
 9166 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:22.83 java
 9170 tomcat7   20   0 1719820 885436  18588 S  0.3 21.9   0:22.81 java
 9171 tomcat7   20   0 1719820 885436  18588 S

PID of offending thread 1326 = 52E hex

soddoadam · April 16, 2015, 11:39am

Is this article a potential fix?

https://mail-archives.apache.org/mod_mbox/tomcat-users/200709.mbox/<46F567EC.4000303@netscape.net>

It’s 8 years old but the person was having the same problem. Does starting tomcat with a -server option make sense to anyone?

Thanks! Adam

harshadura · April 16, 2015, 12:29pm

Hi Adam,

Good job! But i think you have missed one step. (4th step: get stack dump of java process) What was the output of: jstack PID for example in this case: jstack 1326 UPDATE: Adam it would be better to get multiple dumps in regular intervals. I have updated the instructions here: https://wiki.openmrs.org/display/docs/Troubleshooting+High+CPU+Utilization#Getting stack dump of Java process.

burke · April 16, 2015, 12:38pm

Awesome! Thanks, @harshadura

lluismf · April 16, 2015, 10:40pm

Looks like the heap is full, probably cause of thrashing or maybe the garbage collector cannot free objects fast enough and instead of throwing an out of memory tries to stay alive. If it’s a production environment 150Mb is very small, give it some Gb. The more the merrier.

Heap
 def new generation   total 157248K, used 152685K [0x00000000d0000000, 0x00000000daaa0000, 0x00000000daaa0000)
  eden space 139776K, 100% used [0x00000000d0000000, 0x00000000d8880000, 0x00000000d8880000)

A similar case:

harshadura · April 17, 2015, 3:07am

I don’t think so, but you can just try it.

sunbiz · April 17, 2015, 7:08am

Not anymore. The newer JVMs default to -server when it finds 64-bit OS or more than a GB of RAM. I agree what @lluismf suggested above is necessary for OpenMRS. Larger heap and PermGen space. In addition, try clearing the logs. The StreamCorruptedException might be a due to a bad client requests. That might be another cause of 100% CPU usage. Please come back to report what resolved the problem