今天灵聪这边在把数据灌入到Titan发现,Titan-Server速度非常慢,而且灌到1w多条的时候系统基本不响应了。整个过程Titan server的进程使用CPU非常高,一直徘徊在2000%左右。

这其实是一个很常见的问题。思路很简单,找出哪个java线程占用了大量的CPU,查清楚耗时在哪里,依次解决就可以了。

首先找到titan server的进程号,整个用top或者ps或者jps就可以简单快速的找到。

然后查看其子线程的CPU信息:

$ top -H -p  32763
top - 15:44:32 up 91 days, 20:14,  7 users,  load average: 10.81, 10.77, 7.66
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s): 36.3%us,  1.2%sy,  0.1%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132042748k total, 129237780k used,  2804968k free,   202112k buffers
Swap:        0k total,        0k used,        0k free, 106302180k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32824 work      20   0 7818m 1.2g  24m S 43.9  0.9   7:41.57 java
32835 work      20   0 7818m 1.2g  24m S 43.9  0.9   7:40.17 java
32837 work      20   0 7818m 1.2g  24m S 43.9  0.9   7:40.15 java
32828 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.38 java
32834 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.11 java
32836 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.25 java
32838 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:39.94 java
32839 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.02 java
32841 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.28 java
32849 work      20   0 7818m 1.2g  24m S 43.6  0.9   7:40.14 java
32825 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.16 java
32826 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.34 java
32827 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.05 java
32829 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.07 java
32830 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.42 java
32831 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.11 java
32832 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:39.83 java
32833 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.30 java
32840 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.15 java
32842 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.16 java
32843 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.37 java
32844 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.00 java
32845 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.10 java
32846 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.02 java
32847 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.11 java
32848 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.01 java
32850 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.28 java
32851 work      20   0 7818m 1.2g  24m S 43.2  0.9   7:40.08 java
32855 work      20   0 7818m 1.2g  24m S 38.9  0.9   2:49.93 java
12540 work      20   0 7818m 1.2g  24m S  4.0  0.9   0:29.16 java
12539 work      20   0 7818m 1.2g  24m S  3.7  0.9   1:02.45 java

发现基本上每个线程都挺平均的(都挺耗CPU的),随便选一个,转成16进制,用jstack查看一下,一下子就发现问题了:

$ jstack 32763 | grep 8038 -A 30
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fe4ec031000 nid=0x8038 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fe4ec032800 nid=0x8039 runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007fe4ec034800 nid=0x803a runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fe4ec036000 nid=0x803b runnable

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fe4ec038000 nid=0x803c runnable

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fe4ec039800 nid=0x803d runnable

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fe4ec03b800 nid=0x803e runnable

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fe4ec03d000 nid=0x803f runnable

"GC task thread#8 (ParallelGC)" os_prio=0 tid=0x00007fe4ec03f000 nid=0x8040 runnable

"GC task thread#9 (ParallelGC)" os_prio=0 tid=0x00007fe4ec040800 nid=0x8041 runnable

"GC task thread#10 (ParallelGC)" os_prio=0 tid=0x00007fe4ec042800 nid=0x8042 runnable

"GC task thread#11 (ParallelGC)" os_prio=0 tid=0x00007fe4ec044000 nid=0x8043 runnable

"GC task thread#12 (ParallelGC)" os_prio=0 tid=0x00007fe4ec046000 nid=0x8044 runnable

"GC task thread#13 (ParallelGC)" os_prio=0 tid=0x00007fe4ec047800 nid=0x8045 runnable

"GC task thread#14 (ParallelGC)" os_prio=0 tid=0x00007fe4ec049800 nid=0x8046 runnable

"GC task thread#15 (ParallelGC)" os_prio=0 tid=0x00007fe4ec04b000 nid=0x8047 runnable

是GC线程,而且有15个之多。难道是GC的问题?用jstat看一下就知道了:

$ jstat -gcutil 32763 1000
S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
0.00   0.00 100.00  99.95  78.14  40.96    323    2.155  3543 1009.229 1011.384
0.00   0.00 100.00  99.95  78.14  40.97    323    2.155  3548 1010.225 1012.380
0.00   0.00 100.00  99.95  78.14  40.97    323    2.155  3553 1011.159 1013.314
0.00   0.00 100.00  99.95  78.14  40.97    323    2.155  3558 1012.189 1014.344
0.00   0.00 100.00  99.95  78.15  40.97    323    2.155  3563 1013.142 1015.297
0.00   0.00 100.00  99.95  78.15  40.97    323    2.155  3568 1014.119 1016.274
0.00   0.00 100.00  99.95  78.15  40.97    323    2.155  3573 1015.060 1017.215
0.00   0.00 100.00  99.95  78.15  40.97    323    2.155  3578 1016.030 1018.185
0.00   0.00 100.00  99.95  78.16  40.97    323    2.155  3583 1016.967 1019.122
0.00   0.00 100.00  99.95  78.16  40.97    323    2.155  3588 1017.934 1020.089
0.00   0.00 100.00  99.95  78.16  40.97    323    2.155  3593 1018.921 1021.076
0.00   0.00 100.00  99.95  78.16  40.97    323    2.155  3598 1019.904 1022.059
0.00   0.00 100.00  99.95  78.16  40.98    323    2.155  3602 1020.699 1022.855
0.00   0.00  99.16  99.95  78.16  40.98    323    2.155  3608 1021.837 1023.992
0.00   0.00 100.00  99.95  78.17  40.98    323    2.155  3612 1022.579 1024.734
0.00   0.00 100.00  99.95  78.17  40.98    323    2.155  3617 1023.650 1025.805
0.00   0.00 100.00  99.95  78.07  40.98    323    2.155  3620 1024.456 1026.611
0.00   0.00  99.19  99.95  78.08  40.98    323    2.155  3624 1025.724 1027.879
0.00   0.00 100.00  99.95  78.08  40.98    323    2.155  3628 1026.462 1028.617
0.00   0.00 100.00  99.95  78.08  40.98    323    2.155  3632 1027.567 1029.722
0.00   0.00 100.00  99.95  78.08  40.98    323    2.155  3636 1028.582 1030.737
0.00   0.00  99.20  99.95  78.08  40.99    323    2.155  3641 1029.660 1031.815
0.00   0.00 100.00  99.95  78.04  40.90    323    2.155  3646 1030.532 1032.687
0.00   0.00 100.00  99.95  78.04  40.90    323    2.155  3651 1031.482 1033.637

果然GC非常频繁,基本是Full GC(O区基本都是满的),而且耗时(超过90%的时间都在FGC。。)

那么我们配置的堆大小是怎样的呢?ps一看,才发现默认的配置非常小,/bin/gremlin-server.sh:

JAVA_OPTIONS="-Xms32m -Xmx512m -javaagent:$LIB/jamm-0.3.0.jar"

将其改大,重启观察,发现CPU一直稳定在90%左右了,吞吐率也上了,单机QPS基本稳定在50左右(但是还是很低,难道Titan的性能这么差的?)。

再看一下GC情况,基本只有偶尔的Young GC了:

$ jstat -gcutil  25719 1000
S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
52.59   0.00  47.97   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  53.20   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  59.03   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  64.10   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  69.62   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  75.08   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  80.81   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  85.70   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  91.20   3.13  82.61  47.07     14    0.593     4    1.051    1.644
52.59   0.00  96.37   3.13  82.61  47.07     14    0.593     4    1.051    1.644
 0.00  99.80   3.28   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80   7.88   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  12.83   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  18.67   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  24.01   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  29.67   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  34.73   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  40.22   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  45.24   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  50.80   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  56.30   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  61.53   3.13  82.09  46.10     15    0.636     4    1.051    1.687
 0.00  99.80  66.77   3.13  82.09  46.10     15    0.636     4    1.051    1.687

补充信息

1、进制转换

linux命令行可以很方便的进行十六进制和十进制之间的转换:

$ printf "%x\n" 15
f
$ printf "%d\n" 0xf
15

2、Linux下查看某个进程的线程信息(CPU,内存等)

法一:top命令

top命令可以实时显示各个线程情况。可以在启动top命令时指定“-H”选项,该选项会列出所有Linux线程。在top运行时,你也可以通过按“H”键将线程查看模式切换为开或关(Show Theads On/Off)。不过top并不会显示线程之间的关联关系(比如同属于一个进程,虽然按”c”键可以显示详细的进程目录但是还是不大清楚),如果要只显示某个进程下的所有线程信息,那么可以在启动top的时候指定进程号:

top -p <pi> -H

法二:ps命令

在ps命令中,“-T”选项可以开启线程查看

$ ps -T -p

不过这个只能看到线程ID,看不到CPU和运行时间等信息,所以下面这个命令更实用一些:

$ ps -m -p -o THREAD,tid,time

-m 表示 Show threads after processes, -o 表示自定义的输出格式。

法三:htop命令

其实就是加强版的top,显示上做了优化,同时可以支持进程树展示,而且可以直接发送信号给进程。需要另外安装(可以用jumbo安装)。

进入htop后,按”t”表示tree view展示,非常方便直观。