生产环境一次Redis导致OOM Killer的问题

Redis 引发系统OOM Killer

昨晚（2016-9-5），生产环境的Redis发生警报，一段时间后，内存被降到50%多（之前一直在90%左右），然后发现Redis的进程挂了。第一时间看Redis的log文件，发现有如下信息:

10866:M 05 Sep 20:15:19.711 # Background saving terminated by signal 9
10866:M 05 Sep 20:18:43.898 # Background saving terminated by signal 9
10866:M 05 Sep 20:26:46.434 # Background saving terminated by signal 9
10866:M 05 Sep 20:34:49.161 # Background saving terminated by signal 9
10866:M 05 Sep 20:42:52.406 # Background saving terminated by signal 9
10866:M 05 Sep 20:42:55.332 # Background saving terminated by signal 9
1758:M 05 Sep 21:28:11.114 # Background saving terminated by signal 9
1758:M 05 Sep 21:30:18.479 # Background saving terminated by signal 9
1758:M 05 Sep 21:32:55.275 # Background saving terminated by signal 9

可知Redis收到kill -9的信号终止了.然后，当时第一反应，应该是“有人”人工去kill Redis进程吗？不知道怎么的，当时自己就去查看操作系统日志dmesg -T | grep redis，真的发现是有内幕：

dmesg -T | grep redis | grep "Out of memory"
[Mon Sep  5 20:15:18 2016] Out of memory: Kill process 725 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:18:42 2016] Out of memory: Kill process 786 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:26:45 2016] Out of memory: Kill process 914 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:34:48 2016] Out of memory: Kill process 1022 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:42:50 2016] Out of memory: Kill process 1127 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:42:52 2016] Out of memory: Kill process 10866 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:50:57 2016] Out of memory: Kill process 1235 (redis-server) score 517 or sacrifice child
[Mon Sep  5 20:50:57 2016] Out of memory: Kill process 10866 (redis-server) score 517 or sacrifice child
[Mon Sep  5 21:28:10 2016] Out of memory: Kill process 1886 (redis-server) score 479 or sacrifice child
[Mon Sep  5 21:30:17 2016] Out of memory: Kill process 1758 (redis-server) score 479 or sacrifice child
[Mon Sep  5 21:32:54 2016] Out of memory: Kill process 1972 (redis-server) score 479 or sacrifice child

dmesg -T | grep redis | grep "oom-killer"
[Mon Sep  5 20:26:44 2016] redis-server invoked oom-killer: gfp_mask=0x10200da, order=0, oom_score_adj=0
[Mon Sep  5 21:32:53 2016] redis-server invoked oom-killer: gfp_mask=0x8200da, order=0, oom_score_adj=0

当天查看的Redis内存信息如下:

# Memory
used_memory:7877146344
used_memory_human:7.34G
used_memory_rss:8699490304
used_memory_rss_human:8.10G
used_memory_peak:8462552976
used_memory_peak_human:7.88G
total_system_memory:16828653568
total_system_memory_human:15.67G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:9573741824
maxmemory_human:8.92G
maxmemory_policy:noeviction
mem_fragmentation_ratio:1.10
mem_allocator:jemalloc-4.0.3

原因

当时服务器还有个MySQL slave在进行复制备份，服务器一共16GB的内存，然后MySQL用掉了5GB，还有11GB内存，除去一些其他的简单的占用和消耗外，估计还有10GB的内存真正给Redis可用。

可以看到上面的INFO，当时分配给Redis最大的内存为差不多9GB，那应试还有1GB可用空间，那到底是什么导致了OS触发OOM Killer机制呢？

原来，Redis当时开启了RDB功能，而Redis自身是通过fork()进程来处理RDB文件的。可以man fork知道，它是精确复制与父进程来处理RDB文件的。

Redis在后台的存储机制依赖于操作系统中fork的copy-on-write：也就是redis fork（创建一个子进程）是父进程的一个完整精确拷贝。子进程转储到磁盘上的数据库然后退出。理论上来说，子进程作为一个副本应该使用和父亲一样多的内存，但是实际上由于大部分现代操作系统的copy-on-write的实现，父进程和子进程将共享内存页。当他被父进程或者子进程改变的时候，一个内存页将被复制。因此，从理论上讲，当子进程存储的时候，所有内存页可能被改变，Linux不能提前告诉子进程多少内存被使用，所以如果overcommit_memory设置被设置为0，创建将会失败，除非有同样多的空闲内存。结果是，如果你有3GB的redis数据并且只有2GB的空闲内存，它将会失败。

把overcommit_memory设置为1来告诉Linux以更加乐观的方式来执行fork操作，并且这确实是你想要的。

参考资料:

Redis FAQ ifeve redis faq中文翻译

overcommit_memory 对fork的影响

我们可以来做一个实验，来一步一步理解这个参数到fork的影响. 测试的环境是:

┌─[sky@sky-linux] - [~] - [2016-09-06 02:43:37]
└─[0] <> uname -a
Linux sky-linux 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
┌─[sky@sky-linux] - [~] - [2016-09-06 02:46:14]
└─[0] <> free -h 
             total       used       free     shared    buffers     cached
Mem:           15G        10G       4.6G       750M       209M       3.8G
-/+ buffers/cache:       7.0G       8.6G
Swap:           0B         0B         0B

可以看到，还有8.6GB空闲内存。默认的情况下，overcommit_memory的值是：

┌─[sky@sky-linux] - [~] - [2016-09-06 02:47:16]
└─[0] <> cat /proc/sys/vm/overcommit_memory 
0

即它是以试探式分配内存，如果不够的话，就直接拒绝该次的内存分配.

示例代码：父进程分配了5GB，这时再fork()一个子进程，看看会发生什么事：

#include<stdio.h>
#include <unistd.h>
int main()
{
  void *ptr_one;
  pid_t childPID;
  ptr_one = calloc(1342177280, sizeof(int));

  if (ptr_one == 0)
		{
			printf("ERROR: Out of memory\n");
			return 1;
		}
  childPID = fork();
  if(childPID >= 0) // fork was successful
    {
      if(childPID == 0) // child process
        {
          printf("\n child process\n");
          sleep(60);
        }
      else //Parent process
        {
          printf("\n parent process\n");
          sleep(60);
        }
    }
  else // fork failed
    {
      printf("\n Fork failed, quitting!!!!!!\n");
      return 1;
    }
  return 0;
}

overcommit_memory = 0

编译运行：

┌─[sky@sky-linux] - [/tmp] - [2016-09-06 02:56:40]
└─[0] <> gcc mall.c 
mall.c: In function ‘main’:
mall.c:7:13: warning: incompatible implicit declaration of built-in function ‘calloc’ [enabled by default]
   ptr_one = calloc(1342177280, sizeof(int));
             ^
┌─[sky@sky-linux] - [/tmp] - [2016-09-06 02:56:44]
└─[0] <> ./a.out 

 Fork failed, quitting!!!!!!

因为空闲内存是8.6G，而父进程使用了5GB，这时再fork一个子进程的话，那5GB+5GB=10GB，这时是不够内存分配的，所以OS直接拒绝了该次内存申请.

overcommit_memory = 1

sky-linux# echo "1" > /proc/sys/vm/overcommit_memory

┌─[sky@sky-linux] - [/tmp] - [2016-09-06 02:56:48]
└─[1] <> ./a.out 

 parent process

 child process

这时查看内存使用情况:

sky-linux# free -h
             total       used       free     shared    buffers     cached
Mem:           15G        15G       227M       606M       116M       3.2G
-/+ buffers/cache:        11G       3.6G

8.6GB-5GB=3.6GB，那为什么不是占用10GB，而只是5GB呢？这是因为Linux使用copy-on-write机制.它只对要修改的数据块进行copy-and-write。

overcommit_memory = 2

于禁止overcommit (vm.overcommit_memory=2) 还需要细说：首先，如何才算是overcommit呢？kernel设有一个阈值，申请的内存总数超过这个阈值就算overcommit，在/proc/meminfo中可以看到这个阈值的大小：

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB
1
2
3
# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB

CommitLimit: 就是overcommit的阈值，申请的内存总数超过CommitLimit的话就算是overcommit。那么这个阈值是如何计算出来的呢？它既不是物理内存的大小，也不是free memory的大小，它是通过内核参数间接设置的，公式如下： CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap

这个是非常严格的，这样子，因为它是never overcommit。运行上面的代码:

┌─[sky@sky-linux] - [/tmp] - [2016-09-06 04:23:00]
└─[1] <> ./a.out
ERROR: Out of memory

可以看到它直接报错了，即申请5GB内存不成功。因为CommitLimit=15GB * 50 /100 = 7.5GB，即最多可以分配7.5GB空间，但已经用了7GB，这时再申请5GB的话，已经超过了7.5GB的限制，所以OS直接拒绝了该次内存申请。

参考资料: * linuxperf * coolshell

解决

加内存咯（如果已经充分利用了Redis的存储结构来保存了最适合的数据，即Redis层已经最优了，不然的话，可以先优化Redis层先）
将Redis的Maxmemory调到可用的物理内存的3/5，并且将overcommit_memory=1，这样子虽然可以避免OOM，不过，这样子可以让我们知道，物理内存是瓶颈了。

调整Redis的OOM权重，即不能让OS的OOM机制Kill掉redis，但这样子会导致OOM Killer其他进程哦:

echo "-17" > /proc/redis的进程IP/oom_adj

新版本的Linux参数修改为：

[Tue Sep  6 16:50:06 2016] echo (349): /proc/5360/oom_adj is deprecated, please use /proc/5360/oom_score_adj instead.
即使用 oom_score_adj 来代替.

OOM Killer是会kill掉 /proc/PID/oom_score（新版本为oom_score_adj)中分数越高的进程.(oom_score_adj的范围为-1000到1000)

可以模拟触发OOM：

echo f > /proc/sysrq-trigger

执行完上面的命令后，查看

dmesg -T

查找出系统中首先会被Kill的进程

dstat --top-oom

for i in /proc/*/oom_score; do pid=$(echo "${i}" | cut -d/ -f3); echo "oom_score=$(cat "${i}"), PID=${pid}, exe=$(readlink -e /proc/${pid}/exe)"; done 2> /dev/null | sort -rn -t, -k 1.11 | head -n 50

参考资料

查看历史命令的执行时间

export HISTTIMEFORMAT="%d/%m/%y %T "

这有个问题，就是之前的历史命令，会统一为执行这个export的命令，然后之后的命令才会记录下时间.

Contents