在有些时候,会遇到PHP进程异常卡死的情况。面对这种情况,首先考虑到的就是分析代码进行优化改进,或者重启进程。但是,这种方式来排查不一定能找到根本原因。因为有些时候异常PHP进程卡死的原因可能是非常奇葩的问题,比如外部资源异常如DB、Redis与第三方API等。上一次关于超级话题签到提醒定时任务PHP进程异常的处理,问题竟然出现在外部的DB的链接上。由于网络原因导致读取DB没有响应卡死。
这一次情况比较严重,超级话题积分系统的计算有一部分是通过Trigger(类似队列)接受全站数据进行积分计算与入库。由于用户反馈问题,来到队列机查看进程情况。如下:
$ ps -ef | grep Trigger | grep Comment
www 10693 1 7 16:06 ? 00:04:09 /usr/local/bin/php /path/cli.php request_uri=/Cli/Trigger/Comment/proc_num/1/
www 20382 1 7 16:55 ? 00:00:26 /usr/local/bin/php /path/cli.php request_uri=/Cli/Trigger/Comment/proc_num/3/
www 31051 1 8 14:22 ? 00:13:07 /usr/local/bin/php /path/cli.php request_uri=/Cli/Trigger/Comment/proc_num/4/
www 34964 1 8 14:55 ? 00:11:04 /usr/local/bin/php /path/cli.php request_uri=/Cli/Trigger/Comment/proc_num/2/
之后,我们查看某个进程的处理日志。如下:
$ tail -f comment_1.log
Progress:2016-11-29 11:08:44 Time:2016-11-29 16:05:15
Progress:2016-11-29 11:33:53 Time:2016-11-29 16:05:33
[START] 2016-11-29 16:06:11
发现进程(pid 10693)在重启之后卡主。
于是利用strace查看进程处理过程:
$ strace -p 10693
Process 10693 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 1
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
open("/data1/www/privdata/huati.weibo.com/cli_beat/beat_trigger-comment-1.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(5, 0, SEEK_CUR) = 0
write(5, "1480410489", 10) = 10
close(5) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
如strace信息可以看出,该异常PHP进程一直在循环的读取,貌似进入死循环:
sendto(4, "get comment$huati_comment$freya2"..., 76, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 76
recvfrom(4, 0x29e95a8, 8196, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "VALUE comment$huati_comment$frey"..., 8196, MSG_DONTWAIT, NULL, NULL) = 105
接着利用lsof来查看该进程的所有链接资源,如下:
lsof -p 10693
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
php 10693 www 0r CHR 1,3 0t0 3938 /dev/null
php 10693 www 1w CHR 1,3 0t0 3938 /dev/null
php 10693 www 2w FIFO 0,8 0t0 957455406 pipe
php 10693 www 3ur REG 8,5 0 152 /tmp/.ZendSem.KdjyYc (deleted)
php 10693 www 4u IPv4 957501475 0t0 TCP freya216:36076->xx.xx.xx.xx:xxxx (ESTABLISHED)
所以,可以看出该PHP异常进程是在不断的读取FD值为4u对应的资源,该资源正式trigger的服务资源。
通过上述的判断,可以初步断定出现在trigger服务上,接着就是找相关部门的人继续深入排查了。
注:已处理敏感信息,如有问题请及时联系站长。
文章来源:胡旭博客 => 在Linux下通过strace与lsof命令排查PHP异常进程
转载请注明出处,违者必究!