uWSGI re-fork master 踩坑记

Zero downtime Python application deployment

0x00 参考文档

如上文档选定了实现 zero downtime deployment 方案：re-fork master

0x01 First Try

uWSGI 配置文件：

[uwsgi]
chdir = /home/foo/helloworld
module = helloworld.wsgi
processes  = 4
socket = /var/run/uwsgi/app.sock
pidfile = /var/run/uwsgi/app.pid
vacuum = false

master-fifo = /tmp/app.new.fifo
master-fifo = /tmp/app.running.fifo

if-exists = /tmp/app.running.fifo
  hook-accepting1-once = writefifo:/tmp/app.running.fifo q
endif =
hook-accepting1-once = writefifo:/tmp/app.new.fifo 1P

首先配置了两个 master-fifo，app.new.fifo 为默认，参考 The Master FIFO
期望的部署情况：
第一次部署时，部署前无任何 uWSGI 进程，此时 app.running.fifo 和 app.new.fifo 均不存在。uWSGI 进程启动，此时 master fifo 为 app.new.fifo，app.running.fifo 文件不存在故 hook-accepting1-once = writefifo:/tmp/app.running.fifo q 不会被执行，hook-accepting1-once = writefifo:/tmp/app.new.fifo 1P 在 uWSGI 第一个 worker 能够接收请求时执行一次将 master fifo 切换成 1 即 app.running.fifo 并更新 pid 文件中 master 进程 pid
再次部署时，使用 echo f > /tmp/app.running.fifo 通知 uWSGI re-fork master，uWSGI re-fork 一组新的 uWSGI 进程，app.running.fifo 已存在，故 hook-accepting1-once = writefifo:/tmp/app.running.fifo q 在 uWSGI 第一个 worker 能够接收请求时执行一次通知旧的 uWSGI 进程优雅退出（即停止接收新的请求并在处理完已经接收的请求后退出），hook-accepting1-once = writefifo:/tmp/app.new.fifo 1P 也执行一次将 master fifo 切换成 1 即 app.running.fifo 并更新 pid 文件中 master 进程 pid

看起来很完美，实际尝试发现：

master fifo 文件生成是异步的，第一次部署时执行 hook-accepting1-once = writefifo:/tmp/app.new.fifo 1P 时 app.new.fifo 可能还没有创建，导致启动失败。详细可参考 GitHub issue: Zerg dance: writefifo race condition

解决方案上面的 issue 里面已经提到了，使用 spinningfifo 代替 writefifo 即 hook-accepting1-once = spinningfifo:/tmp/app.new.fifo 1P。（spinningfifo 代码已经合并到 uWSGI core 但尚未发布新版本，可以使用 uwsgi-spinningfifo 作为插件加载）

解决了第一个问题后，尝试进行第二次部署（即 echo f > /tmp/app.running.fifo），实际发现有时成功有时 echo 会一直阻塞住，lsof | grep fifo 发现 uWSGI 加载的 app.running.fifo 处于删除状态然而 /tmp/app.running.fifo 文件是存在的。GitHub 上搜索发现相关 issue: Race condition graceful restart while fork master

After #race_condition_point there is 2 ways:

1. the old master receive 'q' and delete and create fifo-file, after that new master delete and create fifo-file and change master-fifo to 1

2. the new master delete and create fifo file, change master-fifo to 1 and after this old master delete and create fifo file (because we send 'q' him)

显然第二种情况下会导致上述的情况从而导致 echo 阻塞。

所以问题的根源在于 old master 和 new master 共用了 app.running.fifo 导致重建 app.running.fifo 时存在竞争条件，上述的 GitHub issue 尚未解决，想到的 workaround 方案很简单，避免共用 app.running.fifo，可以增加一个 app.quit.fifo 专门用来通知旧的 uWSGI master 进程优雅退出

0x02 Second Try

新的 uWSGI 配置文件：

[uwsgi]
chdir = /home/foo/helloworld
module = helloworld.wsgi
processes  = 4
socket = /var/run/uwsgi/app.sock
pidfile = /var/run/uwsgi/app.pid
vacuum = false

master-fifo = /tmp/app.new.fifo
master-fifo = /tmp/app.running.fifo
master-fifo = /tmp/app.quit.fifo

if-exists = /tmp/app.running.fifo
  hook-accepting1-once = writefifo:/tmp/app.running.fifo 2q
endif =
hook-accepting1-once = spinningfifo:/tmp/app.new.fifo 1P

增加了 app.quit.fifo，使用 echo f > /tmp/app.running.fifo 触发部署时，hook-accepting1-once = writefifo:/tmp/app.running.fifo 2q 通知旧的 master 进程切换 master-fifo 到 2 即 app.quit.fifo 后优雅退出（q）

Works great.

0x03 Caveats

re-fork master 很强大也很危险，如果用到了 uWSGI 的 attach-daemon 等管理后台进程，由于 re-fork master 也会新开启一组后台进程，需要保证这些后台进程可以启动多个而没有副作用。
uwsgi --build-plugin 如果没有任何输出，应尝试安装缺失的依赖如 python, gcc 等。

messense

Some rights reserved

Except where otherwise noted, content on this page is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.