1.17+ k8s使用cni的vxlan模式下63秒超时的一次定位

flannel的vxlan和calco的vxlan模式都有这个问题,原因是v1.17+的k8s会引起内核的某个bug而不是cni的软件层面,weave没有,后面说。

写这篇文章的日期是 05/28 ,发现是上周五也就是 05/23 号,文章从时间线写起(因为很多时候想发文章但是没空,所以文章的发布日期是 05/23 )

由来

上周五我经过同事的工位看到同事的桌面是k8s的get po的输出,问他咋开始学k8s了,他说跟着视频学下,看了下用的kubeadm部署了一套 1.18.2 的集群。1.18的kube-proxy的ipvs包的parseIP有bug,我推荐他换 v1.17.5 。他当时在部署一个入门的svc实验,换了后还是无法解析域名(实际上后面对照组 v1.18.2 也解析不了)。使用dig命令排查了下,下面是对照:

dig @ +short kubernetes.default.svc.cluster.local
dig @10.96.0.10 +short kubernetes.default.svc.cluster.local

很多市面上的kubeadm部署教程都是直接kubeadm init,所以我推荐同事去按照我文章的kubeadm部署一套后再试试,叫他用v1.17的最新版本 v1.17.5 ,结果还是上面一样。coredns实际上还有metrics的http接口,从http层测了下:

curl -I 10.96.0.10:9153/metrics
curl -I :9153/metrics

涉及到本次排查的信息为

$ kubectl get node -o wide
NAME     STATUS   ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION          CONTAINER-RUNTIME
master   Ready    master   7d8h   v1.18.2   10.0.100.3            CentOS Linux 7 (Core)   3.10.0-957.el7.x86_64   docker://19.3.8
node1    Ready       7d7h   v1.18.2   10.0.100.4            CentOS Linux 7 (Core)   3.10.0-957.el7.x86_64   docker://19.3.8
node2    Ready       7d7h   v1.18.2   10.0.100.15           CentOS Linux 7 (Core)   3.10.0-957.el7.x86_64   docker://19.3.8

$ kubectl get po -o wide -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
coredns-546565776c-v5wwg   1/1     Running   2          25h   10.244.2.73   node2              

多次尝试发现很久的时间都是一样,用time命令观察了下一直是63秒返回。包括其他任何svc都是这样

$ time    curl -I 10.96.0.10:9153/metrics
HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8
Date: Wed, 25 May 2020 08:39:35 GMT


real    1m3.091s
user    0m0.002s
sys 0m0.007s

proxyMode是ipvs,用ipvsadm看下超时的时候的状态,一直是 SYN_RECV ,也就是发送了SYN,没收到回包

# ipvsadm -lnc |& grep 9153
TCP 00:59  SYN_RECV    10.96.0.10:41282   10.96.0.10:9153    10.244.2.73:9153

抓包

因为cni使用的flannel,用的vxlan模式。master上抓9153和flannel.1的8472端口,coredns的pod所在node上抓flannel的vxlan包,下面三个是对应的:

[root@master /root]# tcpdump -nn -i flannel.1 port 9153
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:30:56.705696 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17148909 ecr 0,nop,wscale 7], length 0
16:30:57.708489 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17149912 ecr 0,nop,wscale 7], length 0
16:30:59.712458 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17151916 ecr 0,nop,wscale 7], length 0
16:31:03.716441 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17155920 ecr 0,nop,wscale 7], length 0
16:31:11.732562 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17163936 ecr 0,nop,wscale 7], length 0
16:31:27.764498 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17179968 ecr 0,nop,wscale 7], length 0
16:31:59.828493 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0
16:31:59.829565 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0
16:31:59.829611 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0
16:31:59.829714 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87
16:31:59.829897 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0
16:31:59.831300 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112
16:31:59.831322 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0
16:31:59.831435 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
16:31:59.831633 IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0
16:31:59.831660 IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0

[root@master /root]# tcpdump -nn -i eth0 port 8472
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:30:56.705718 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17148909 ecr 0,nop,wscale 7], length 0
16:30:57.708523 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17149912 ecr 0,nop,wscale 7], length 0
16:30:59.712478 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17151916 ecr 0,nop,wscale 7], length 0
16:31:03.716452 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17155920 ecr 0,nop,wscale 7], length 0
16:31:11.732590 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17163936 ecr 0,nop,wscale 7], length 0
16:31:27.764513 IP 10.0.100.3.48683 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17179968 ecr 0,nop,wscale 7], length 0
16:31:59.828541 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0
16:31:59.829521 IP 10.0.100.15.56771 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0
16:31:59.829617 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0
16:31:59.829729 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87
16:31:59.829883 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0
16:31:59.831292 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112
16:31:59.831327 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0
16:31:59.831448 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
16:31:59.831612 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0
16:31:59.831665 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0

[root@node2 /root]# tcpdump -nn  -i eth0 port 8472
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:31:59.836137 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [S], seq 911217171, win 43690, options [mss 65495,sackOK,TS val 17212032 ecr 0,nop,wscale 7], length 0
16:31:59.836328 IP 10.0.100.15.56771 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [S.], seq 435819916, ack 911217172, win 27960, options [mss 1410,sackOK,TS val 17212067 ecr 17212032,nop,wscale 7], length 0
16:31:59.836811 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 0
16:31:59.836910 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [P.], seq 1:88, ack 1, win 342, options [nop,nop,TS val 17212033 ecr 17212067], length 87
16:31:59.836951 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [.], ack 88, win 219, options [nop,nop,TS val 17212067 ecr 17212033], length 0
16:31:59.838385 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [P.], seq 1:113, ack 88, win 219, options [nop,nop,TS val 17212069 ecr 17212033], length 112
16:31:59.838522 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 113, win 342, options [nop,nop,TS val 17212034 ecr 17212069], length 0
16:31:59.838621 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [F.], seq 88, ack 113, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0
16:31:59.838703 IP 10.0.100.15.34571 > 10.0.100.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.73.9153 > 10.244.0.0.2201: Flags [F.], seq 113, ack 89, win 219, options [nop,nop,TS val 17212069 ecr 17212035], length 0
16:31:59.838836 IP 10.0.100.3.56618 > 10.0.100.15.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.0.0.2201 > 10.244.2.73.9153: Flags [.], ack 114, win 342, options [nop,nop,TS val 17212035 ecr 17212069], length 0

先看上面的第一部分,搜了下资料 https://blog.csdn.net/u010039418/article/details/78234570

得知tcp默认SYN报文最大retry 5次,每次超时了翻倍, 1s -> 3s -> 7s -> 15s -> 31s -> 63s 。只有63秒的时候node的机器上才收到了vxlan的报文。说明pod所在node压根没收到63秒之前的。

一般lvs的dr模式下tcp的时间戳混乱或者其他几个arp的内核参数不对下 SYN 是一直收不到的而不是63秒后有结果,所以和内核相关参数无关。于是同样上面的步骤tcpdump抓包,加上-w选项把抓的包导出下来导入到wireshark里准备看看

报文分析

9153的包wireshark里看63秒前面都是tcp的SYN重传,看到了master上向外发送的vxlan报文的时候有了发现。下面的截图是放github上的,网络不好会看不到截图

可以看到udp的checksum是 0xffff ,我对udp报文不太熟悉,udp的header的checksum没记错的话crc32校验的,不可能是这种两个字节都置1的0xffff,明显就是udp的header的校验出错了。后面几个正常包的checksum都是missing的

wireshark的 编辑 -> 首选项 -> Protocols -> UDPValidate the UDP checksum if possible 勾上更直观看

解决

搜了下 wireshark linux udp checksum incorrect ,都是推荐把checksum offload disable掉就行了,例如我这里是flannel,则是

$ /sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
    tx-checksum-ip-generic: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
    tx-tcp-ecn-segmentation: off [requested on]
    tx-tcp6-segmentation: off [requested on]
    tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]

再测下正常

$ time curl -I 10.96.0.10:9153
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Wed, 27 May 2020 02:14:04 GMT
Content-Length: 19


real    0m0.009s
user    0m0.005s
sys 0m0.003s

你以为这样就完了?其实并没有,因为我自己维护了一套ansible部署k8s的方案,每次新版本发布我都会实际测下。并且同事反映了他同样云主机开出来用我ansible部署 v1.17.5 没有这个问题。这就很奇怪了,原因后面说,请接着继续看

什么是checksum offload

资料:

正文

对照组

很奇怪的就是为啥就是我的ansible部署的二进制就正常没这个问题,而kubeadm部署的就不正常,后面我花时间整了以下几个对照组(期间同事也帮我做了几个条件下的测试,但是不是系统用错了就是版本整错了。。。),终于找到了问题的范围,下面是我自己统计的对照组信息,kubeadm和ansible版本均为 1.17.5 测试。os不重要,因为最终排查出和os无关

os type(kubeadm or ansible) flannel version flannel is running in pod? will 63 sec
7.6 kubeadm v0.11.0 yes yes
7.6 kubeadm v0.12.0 yes yes
7.6 kubeadm v0.11.0 no yes
7.6 kubeadm v0.12.0 no yes
7.6 ansible v0.11.0 yes no
7.6 ansible v0.12.0 yes no
7.6 ansible v0.11.0 no no
7.6 ansible v0.12.0 no no

这就看起来很迷了。但是排查出和flannel无关,感觉kube-proxy有关系,然后今天 05/28 针对kube-proxy做了个对照组

os type(kubeadm or ansible) kube-proxy version kube-proxy is running in pod? will 63 sec
7.6 kubeadm v1.17.5 yes yes
7.6 kubeadm v1.17.5 no no
7.6 kubeadm v1.16.9 yes no
7.6 kubeadm v1.16.9 no no
7.6 ansible v1.17.5 yes yes
7.6 ansible v1.17.5 no no

可以看出就是1.17以上的版本如果使用pod则会有这个问题,而非pod则不会,github上compare了下 v1.17.0v1.16.3 https://github.com/kubernetes/kubernetes/compare/v1.16.3…v1.17.0

发现了Dockerfile的改动 https://github.com/kubernetes/kubernetes/commit/fed582333f639dc22e879f4bbb258e403c210c30

见上面的commit 1.17.0里的Dockerfile是没有安装 iptabes ,而是利用脚本去替代 iptables ,hack下镜像验证下

FROM registry.aliyuncs.com/google_containers/kube-proxy:v1.17.5
RUN rm -f /usr/sbin/iptables && 
    clean-install iptables

构建的镜像推送到了dockerhub上 zhangguanzhang/hack-kube-proxy:v1.17.5

$ kubectl -n kube-system get ds kube-proxy -o yaml | grep image:
        image: zhangguanzhang/hack-kube-proxy:v1.17.5

测试访问成功

 time curl -I 10.96.0.10:9153
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 28 May 2020 04:47:21 GMT
Content-Length: 19


real    0m0.008s
user    0m0.003s
sys 0m0.003s

对于这个问题我在flannel的pr下面也参与了回复, https://github.com/coreos/flannel/pull/1282