Detection of a Java memory problem causing overuse of the garbage collector:

[orhus@lxinalta ~]$ top
top - 15:09:00 up 134 days, 14:14,  1 user,  load average: 4,20, 3,91, 2,94
Tasks: 223 total,   1 running, 222 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98,5 us,  1,5 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :  31804,2 total,    334,3 free,  29177,8 used,   2292,1 buff/cache
MiB Swap:   4096,0 total,   2266,8 free,   1829,2 used.   1980,2 avail Mem
 Unknown command - try 'h' for help
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1816128 orhus     20   0 9601648   7,4g  42824 S 387,9  23,8 170:32.46 java
1885835 orhus     20   0   54576   4512   3588 R   3,0   0,0   0:00.05 top
      1 root      20   0  247756  10436   6372 S   0,0   0,0  52:29.29 systemd
      2 root      20   0       0      0      0 S   0,0   0,0   0:08.29 kthreadd
      3 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 rcu_par_gp
      5 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 slub_flushwq
      6 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 netns
      8 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/0:0H-events_highpri
     10 root       0 -20       0      0      0 I   0,0   0,0   1:26.86 kworker/0:1H-kblockd
     11 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 mm_percpu_wq
     12 root      20   0       0      0      0 S   0,0   0,0   0:00.00 rcu_tasks_rude_
     13 root      20   0       0      0      0 S   0,0   0,0   0:00.00 rcu_tasks_trace
     14 root      20   0       0      0      0 S   0,0   0,0   2:02.70 ksoftirqd/0
     15 root      20   0       0      0      0 I   0,0   0,0  21:31.90 rcu_sched
     16 root      rt   0       0      0      0 S   0,0   0,0   0:27.14 migration/0
     18 root      20   0       0      0      0 S   0,0   0,0   0:00.00 cpuhp/0
...

We are on a server with 4 cpu cores. We can see that 98,5% of these 4 cors is used, it means 387,9% of one cpu.

The culprit is the Java process with ID 1816128.

[orhus@lxinalta ~]$ top -H -p 1816128
top - 15:09:49 up 134 days, 14:15,  1 user,  load average: 4,08, 3,91, 2,99
Threads: 245 total,   4 running, 241 sleeping,   0 stopped,   0 zombie
%Cpu(s): 93,8 us,  1,2 sy,  0,0 ni,  4,6 id,  0,0 wa,  0,0 hi,  0,0 si,  0,5 st
MiB Mem :  31804,2 total,    343,6 free,  29167,4 used,   2293,2 buff/cache
MiB Swap:   4096,0 total,   2266,8 free,   1829,2 used.   1990,3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1816494 orhus     20   0 9603668   7,4g  42824 R  93,0  23,8  21:10.20 GC Thread#2
1816495 orhus     20   0 9603668   7,4g  42824 R  93,0  23,8  21:07.37 GC Thread#3
1816134 orhus     20   0 9603668   7,4g  42824 R  92,0  23,8  21:08.55 GC Thread#0
1816492 orhus     20   0 9603668   7,4g  42824 R  92,0  23,8  21:09.43 GC Thread#1
1816154 orhus     20   0 9603668   7,4g  42824 S   5,0  23,8   0:39.12 VM Thread
1816128 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8   0:00.00 java
1816131 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8   0:12.48 java
1816136 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8   0:00.69 G1 Main Marker
1816138 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8  21:15.46 G1 Conc#0
1816146 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8   0:56.42 G1 Refine#0
1816147 orhus     20   0 9603668   7,4g  42824 S   0,0  23,8   0:28.82 G1 Young RemSet

The top command with the -H option displays the threads associated with the Java process. We can see that almost all of the CPU is being used by four garbage collection threads, which means that our Java application is currently looping in an attempt to free up memory, but is not succeeding sufficiently, so it keeps trying again and again…

The solution is to increase the Xmx parameter of the java application.