MongoDB被Linux OOM Kill

今天无意发现线上mongodb集群中,某个mongodb连不上了,到服务器上看mongodb日志并未发现异常日志,就得过且过的想把它启起来就算了。
结果启动后大约1小时左右又突然消失,日志仍然没任何退出信息。。
这问题就大了,进程突然消失就像被kill -9了一样,但是查history记录并没人执行过。
于是 dmesg|grep mongo了一下果然发现问题

1
2
3
4
5
6
7
8
9
10
11
12
mongod invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
mongod cpuset=/ mems_allowed=0
Pid: 10123, comm: mongod Not tainted 2.6.32-431.el6.x86_64 #1
[ 3235] 0 3235 2568388 270165 6 0 0 mongod
[10113] 0 10113 1094500 749847 0 0 0 mongod
Out of memory: Kill process 3235 (mongod) score 252 or sacrifice child
Killed process 3235, UID 0, (mongod) total-vm:10273552kB, anon-rss:1080440kB, file-rss:220kB
[10113] 0 10113 1179731 454666 0 0 0 mongod
[32534] 0 32534 1549332 1316079 3 0 0 mongod
[32663] 0 32663 232939 37877 2 0 0 mongodump
Out of memory: Kill process 32534 (mongod) score 184 or sacrifice child
Killed process 32534, UID 0, (mongod) total-vm:6197328kB, anon-rss:5263596kB, file-rss:724kB

与mongodb日志对比发现,被杀的pid与最后一次启动进程的pid一致,可以确认被linux oom(Out of Memory) killer杀了。
而触发oom killer一般是应用程序大量请求内存导致系统内存不足造成的,而为了保证整个系统的稳定linux内核会杀掉某个进程。

Linux 内核根据应用程序的要求分配内存,通常来说应用程序分配了内存但是并没有实际全部使用,为了提高性能,这部分没用的内存可以留作它用,这部分内存是属于每个进程的,内核直接回收利用的话比较麻烦,所以内核采用一种过度分配内存(over-commit memory)的办法来间接利用这部分 “空闲” 的内存,提高整体内存的使用效率。一般来说这样做没有问题,但当大多数应用程序都消耗完自己的内存的时候麻烦就来了,因为这些应用程序的内存需求加起来超出了物理内存(包括 swap)的容量,内核(OOM killer)必须杀掉一些进程才能腾出空间保障系统正常运行。

接着往上看dmesg发现这段

1
2
3
4
5
6
7
8
9
10
11
12
Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15612kB
Node 0 DMA32: 505*4kB 347*8kB 230*16kB 175*32kB 122*64kB 86*128kB 48*256kB 16*512kB 8*1024kB 2*2048kB 0*4096kB = 65660kB
Node 0 Normal: 1227*4kB 1125*8kB 827*16kB 353*32kB 145*64kB 43*128kB 4*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 54244kB
7665 total pagecache pages
5466 pages in swap cache
Swap cache stats: add 16986448, delete 16980982, find 7091644/8157320
Free swap = 0kB
Total swap = 8241144kB
4194288 pages RAM
110410 pages reserved
1934 pages shared
4043507 pages non-shared

可以确定mongodb进程的消失是因为swap耗尽导致。而swap的耗尽又可能是因为业务的必要机制需要当新集群服务上线从而进行集中的整库的查询导致swap耗尽。

通过

1
cat /proc/meminfo


1
free -m

看了下,机器硬件条件确实一般,内存和swap也所剩不多。

给进程按内存排个序

1
ps -e -o 'pid,comm,args,rsz,vsz'|sort -nrk4

前三甲的大户都是pid小于5k的一等公民,也不敢动,自然pid五位数又能吃内存的mongodb被选中kill。要是我也选它。

触发oom killer后选择哪个进程被杀,是根据内核特定的算法给每个进程打分从而决定是否被选中,分数可以在

1
/proc/$pid/oom_score

中看到,而设置oom_adj的值可以调整oom killer的行为,比如

1
echo -17 > /proc/$pid/oom_adj

oom_adj的可调值为15到-16,其中15最大-16最小,-17为禁止使用oom,值越小越不容易被杀。

虽然问题找到,但解决起来却很纠结,oom killer虽然杀掉mongodb进程,但它也是为了保证整个系统稳定,即使调整mongodb的分数,但最终也会有一个进程被选中杀掉。如关闭oom killer当系统资源耗尽可能导致的结果就是系统无响应需要重启,并且只是将问题暂时的盖住而已,问题仍然在那不符合fast fail的理念。

serverfault上有个不错的oom killer实践可以参考的看看。

OOM killer is not a way anyone manages memory; it is Linux kernels way to handle fatal failure in last hope to avoid system lockup!

What you should do is:

make sure you have enough swap. If you are sure, still add more.
implement resource limits! At LEAST for applications you expect that will use memory (and even more so if you don’t expect them to - those ones usually end up being problematic). See ulimit -v (or limit addressspace) commands in your shell and put it before application startup in its init script. You should also limit other stuff (like number of processes -u, etc)… That way, application will get ENOMEM error when there is not enough memory, instead of kernel giving them non-existent memory and afterwards going berserk killing everything around!
tell the kernel not to overcommit memory. You could do:

echo “0” > /proc/sys/vm/overcommit_memory
or even better (depending on your amount of swap space)

echo “2” > /proc/sys/vm/overcommit_memory; echo “80” > /proc/sys/vm/overcommit_ratio
See Turning off overcommit for more info on that.

That would instruct kernel to be more carefull when giving applications memory it doesn’t really have (similarity to worlds global economic crisis is striking)
as a last dich resort, if everything on your system except MangoDB is expendable (but please fix two points above first!) you can make lower the chances of it being killed (or even making sure it won’t be killed - even if alternative is hangup machine with nothing working) by tuning /proc/$pid/oom_score_adj and/or /proc/$pid/oom_score.

echo “-1000” > /proc/pidof mangod/oom_score_adj
See Taming the OOM killer for more info on that subject.