1.17+ k8s使用cni的vxlan模式下63秒超时的一次定位
flannel的vxlan和calco的vxlan模式都有这个问题,原因是v1.17+的k8s会引起内核的某个bug而不是cni的软件层面,weave没有,后面说。
写这篇文章的日期是 05/28
,发现是上周五也就是 05/23
号,文章从时间线写起(因为很多时候想发文章但是没空,所以文章的发布日期是 05/23
)
由来
上周五我经过同事的工位看到同事的桌面是k8s的get po的输出,问他咋开始学k8s了,他说跟着视频学下,看了下用的kubeadm部署了一套 1.18.2
的集群。1.18的kube-proxy的ipvs包的parseIP有bug,我推荐他换 v1.17.5
。他当时在部署一个入门的svc实验,换了后还是无法解析域名(实际上后面对照组 v1.18.2
也解析不了)。使用dig命令排查了下,下面是对照:
dig @ +short kubernetes.default.svc.cluster.local dig @10.96.0.10 +short kubernetes.default.svc.cluster.local
很多市面上的kubeadm部署教程都是直接kubeadm init,所以我推荐同事去按照我文章的kubeadm部署一套后再试试,叫他用v1.17的最新版本 v1.17.5
,结果还是上面一样。coredns实际上还有metrics的http接口,从http层测了下:
curl -I 10.96.0.10:9153/metrics curl -I :9153/metrics
涉及到本次排查的信息为
$ kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master Ready master 7d8h v1.18.2 10.0.100.3 CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://19.3.8 node1 Ready 7d7h v1.18.2 10.0.100.4 CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://19.3.8 node2 Ready 7d7h v1.18.2 10.0.100.15 CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://19.3.8 $ kubectl get po -o wide -n kube-system -l k8s-app=kube-dns NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES coredns-546565776c-v5wwg 1/1 Running 2 25h 10.244.2.73 node2
多次尝试发现很久的时间都是一样,用time命令观察了下一直是63秒返回。包括其他任何svc都是这样
$ time curl -I 10.96.0.10:9153/metrics HTTP/1.1 200 OK Content-Type: text/plain; version=0.0.4; charset=utf-8 Date: Wed, 25 May 2020 08:39:35 GMT real 1m3.091s user 0m0.002s sys 0m0.007s
proxyMode是ipvs,用ipvsadm看下超时的时候的状态,一直是 SYN_RECV
,也就是发送了SYN,没收到回包
# ipvsadm -lnc |& grep 9153 TCP 00:59 SYN_RECV 10.96.0.10:41282 10.96.0.10:9153 10.244.2.73:9153
抓包
因为cni使用的flannel,用的vxlan模式。master上抓9153和flannel.1的8472端口,coredns的pod所在node上抓flannel的vxlan包,下面三个是对应的:
[root@master /root]# tcpdump -nn -i flannel.1 port 9153 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes 16:30:56.705696 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17148909 ecr 0,nop,wscale 7], length 0 16:30:57.708489 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17149912 ecr 0,nop,wscale 7], length 0 16:30:59.712458 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17151916 ecr 0,nop,wscale 7], length 0 16:31:03.716441 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17155920 ecr 0,nop,wscale 7], length 0 16:31:11.732562 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17163936 ecr 0,nop,wscale 7], length 0 16:31:27.764498 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17179968 ecr 0,nop,wscale 7], length 0 16:31:59.828493 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0 16:31:59.829565 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0 16:31:59.829611 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0 16:31:59.829714 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87 16:31:59.829897 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0 16:31:59.831300 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112 16:31:59.831322 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0 16:31:59.831435 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0 16:31:59.831633 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0 16:31:59.831660 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
[root@master /root]# tcpdump -nn -i eth0 port 8472 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 16:30:56.705718 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17148909 ecr 0,nop,wscale 7], length 0 16:30:57.708523 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17149912 ecr 0,nop,wscale 7], length 0 16:30:59.712478 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17151916 ecr 0,nop,wscale 7], length 0 16:31:03.716452 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17155920 ecr 0,nop,wscale 7], length 0 16:31:11.732590 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17163936 ecr 0,nop,wscale 7], length 0 16:31:27.764513 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17179968 ecr 0,nop,wscale 7], length 0 16:31:59.828541 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0 16:31:59.829521 IP 10.0.100.15.56771 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0 16:31:59.829617 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0 16:31:59.829729 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87 16:31:59.829883 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0 16:31:59.831292 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112 16:31:59.831327 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0 16:31:59.831448 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0 16:31:59.831612 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0 16:31:59.831665 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
[root@node2 /root]# tcpdump -nn -i eth0 port 8472 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 16:31:59.836137 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0 16:31:59.836328 IP 10.0.100.15.56771 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0 16:31:59.836811 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0 16:31:59.836910 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87 16:31:59.836951 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0 16:31:59.838385 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112 16:31:59.838522 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0 16:31:59.838621 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0 16:31:59.838703 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0 16:31:59.838836 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
先看上面的第一部分,搜了下资料 https://blog.csdn.net/u010039418/article/details/78234570
得知tcp默认SYN报文最大retry 5次,每次超时了翻倍, 1s -> 3s -> 7s -> 15s -> 31s -> 63s
。只有63秒的时候node的机器上才收到了vxlan的报文。说明pod所在node压根没收到63秒之前的。
一般lvs的dr模式下tcp的时间戳混乱或者其他几个arp的内核参数不对下 SYN
是一直收不到的而不是63秒后有结果,所以和内核相关参数无关。于是同样上面的步骤tcpdump抓包,加上-w选项把抓的包导出下来导入到wireshark里准备看看
报文分析
9153的包wireshark里看63秒前面都是tcp的SYN重传,看到了master上向外发送的vxlan报文的时候有了发现。下面的截图是放github上的,网络不好会看不到截图
可以看到udp的checksum是 0xffff
,我对udp报文不太熟悉,udp的header的checksum没记错的话crc32校验的,不可能是这种两个字节都置1的0xffff,明显就是udp的header的校验出错了。后面几个正常包的checksum都是missing的
wireshark的 编辑
-> 首选项
-> Protocols
-> UDP
的 Validate the UDP checksum if possible
勾上更直观看
解决
搜了下 wireshark linux udp checksum incorrect
,都是推荐把checksum offload disable掉就行了,例如我这里是flannel,则是
$ /sbin/ethtool -K flannel.1 tx-checksum-ip-generic off Actual changes: tx-checksumming: off tx-checksum-ip-generic: off tcp-segmentation-offload: off tx-tcp-segmentation: off [requested on] tx-tcp-ecn-segmentation: off [requested on] tx-tcp6-segmentation: off [requested on] tx-tcp-mangleid-segmentation: off [requested on] udp-fragmentation-offload: off [requested on]
再测下正常
$ time curl -I 10.96.0.10:9153 HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Wed, 27 May 2020 02:14:04 GMT Content-Length: 19 real 0m0.009s user 0m0.005s sys 0m0.003s
你以为这样就完了?其实并没有,因为我自己维护了一套ansible部署k8s的方案,每次新版本发布我都会实际测下。并且同事反映了他同样云主机开出来用我ansible部署 v1.17.5
没有这个问题。这就很奇怪了,原因后面说,请接着继续看
什么是checksum offload
资料:
- https://wiki.wireshark.org/CaptureSetup/Offloading
- https://zh.wikipedia.org/wiki/TCP%E6%A0%A1%E9%AA%8C%E5%92%8C%E5%8D%B8%E8%BD%BD
- Checksum Offload)是网络卡的一个功能选项。如果该选项开启,则网络卡会负责计算需要发送或者接收到的TCP消息的校验和,从而节省CPU的计算开销。此时,在需要发送的TCP消息到达网络卡前,系统会在TCP报头的校验和字段填充一个随机值。
但是,尽管TCP校验和卸载能够降低CPU的计算开销,但受到计算能力的限制,某些环境下的一些网络卡计算速度不如主频超过400MHz的CPU快
正文
对照组
很奇怪的就是为啥就是我的ansible部署的二进制就正常没这个问题,而kubeadm部署的就不正常,后面我花时间整了以下几个对照组(期间同事也帮我做了几个条件下的测试,但是不是系统用错了就是版本整错了。。。),终于找到了问题的范围,下面是我自己统计的对照组信息,kubeadm和ansible版本均为 1.17.5
测试。os不重要,因为最终排查出和os无关
os | type(kubeadm or ansible) | flannel version | flannel is running in pod? | will 63 sec |
---|---|---|---|---|
7.6 | kubeadm | v0.11.0 | yes | yes |
7.6 | kubeadm | v0.12.0 | yes | yes |
7.6 | kubeadm | v0.11.0 | no | yes |
7.6 | kubeadm | v0.12.0 | no | yes |
7.6 | ansible | v0.11.0 | yes | no |
7.6 | ansible | v0.12.0 | yes | no |
7.6 | ansible | v0.11.0 | no | no |
7.6 | ansible | v0.12.0 | no | no |
这就看起来很迷了。但是排查出和flannel无关,感觉kube-proxy有关系,然后今天 05/28
针对kube-proxy做了个对照组
os | type(kubeadm or ansible) | kube-proxy version | kube-proxy is running in pod? | will 63 sec |
---|---|---|---|---|
7.6 | kubeadm | v1.17.5 | yes | yes |
7.6 | kubeadm | v1.17.5 | no | no |
7.6 | kubeadm | v1.16.9 | yes | no |
7.6 | kubeadm | v1.16.9 | no | no |
7.6 | ansible | v1.17.5 | yes | yes |
7.6 | ansible | v1.17.5 | no | no |
可以看出就是1.17以上的版本如果使用pod则会有这个问题,而非pod则不会,github上compare了下 v1.17.0
和 v1.16.3
https://github.com/kubernetes/kubernetes/compare/v1.16.3…v1.17.0
发现了Dockerfile的改动 https://github.com/kubernetes/kubernetes/commit/fed582333f639dc22e879f4bbb258e403c210c30
见上面的commit 1.17.0里的Dockerfile是没有安装 iptabes
,而是利用脚本去替代 iptables
,hack下镜像验证下
FROM registry.aliyuncs.com/google_containers/kube-proxy:v1.17.5 RUN rm -f /usr/sbin/iptables && clean-install iptables
构建的镜像推送到了dockerhub上 zhangguanzhang/hack-kube-proxy:v1.17.5
$ kubectl -n kube-system get ds kube-proxy -o yaml | grep image: image: zhangguanzhang/hack-kube-proxy:v1.17.5
测试访问成功
time curl -I 10.96.0.10:9153 HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Thu, 28 May 2020 04:47:21 GMT Content-Length: 19 real 0m0.008s user 0m0.003s sys 0m0.003s
对于这个问题我在flannel的pr下面也参与了回复, https://github.com/coreos/flannel/pull/1282