OOM Killer 原理

时间:2015-08-19

最近线上一台机器的 Redis 进程频繁被系统自动 Kill 掉,syslog(/var/log/syslog)记录了当时的日志:

Aug 19 11:16:31 jsldg kernel: [4836881.950256] Out of memory: Kill process 11063 (redis-cli) score 538 or sacrifice child
Aug 19 11:16:31 jsldg kernel: [4836881.950338] Killed process 11063 (redis-cli) total-vm:4649480kB, anon-rss:3098176kB, file-rss:4kB

这段日志表示 Redis 触发了 OOM(Out of Memory) Killer 机制。

当系统资源不足时,Linux 会调用 out_of_memory 函数,out_of_memory 会选出系统中占用资源过大的进程并 kill 掉,其中 out_of_memory 会调用 select_bad_process 函数,选出占用资源最大的进程。

OOM Kill 机制相关代码在/mm/oom_kill.c:

p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
        dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
        panic("Out of memory and no killable processes...\n");
}
if (p != (void *)-1UL) {
        oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
                         nodemask, "Out of memory");
        killed = 1;

当找到占用资源最大的进程后,函数调用 oom_kill_process 来结束进程。

select_bad_process 函数负责选举要被 kill 的进程,首先遍历进程,然后找出资源占用最大的进程。实现如下:

static struct task_struct *select_bad_process(unsigned int *ppoints,
                unsigned long totalpages, const nodemask_t *nodemask,
                bool force_kill)
{
        struct task_struct *g, *p;
        struct task_struct *chosen = NULL;
        unsigned long chosen_points = 0;

        rcu_read_lock();
        for_each_process_thread(g, p) {
                unsigned int points;

                switch (oom_scan_process_thread(p, totalpages, nodemask,
                                                force_kill)) {
                case OOM_SCAN_SELECT:
                        chosen = p;
                        chosen_points = ULONG_MAX;
                        /* fall through */
                case OOM_SCAN_CONTINUE:
                        continue;
                case OOM_SCAN_ABORT:
                        rcu_read_unlock();
                        return (struct task_struct *)(-1UL);
                case OOM_SCAN_OK:
                        break;
                };
                points = oom_badness(p, NULL, nodemask, totalpages);
                if (!points || points < chosen_points)
                        continue;
                /* Prefer thread group leaders for display purposes */
                if (points == chosen_points && thread_group_leader(chosen))
                        continue;

                chosen = p;
                chosen_points = points;
        }
        if (chosen)
                get_task_struct(chosen);
        rcu_read_unlock();

        *ppoints = chosen_points * 1000 / totalpages;
        return chosen;
}

其中调用了 oom_badness 函数来给每个进程打分,根据 point 的分数大小决定哪个进程被杀。