golang http服务的graceful问题

背景

一个web服务如果能将自身变动和对外服务隔离对服务的稳定性和可用性是友好的,所以出现了graceful的东西,实现随不同,但原理大致相似,我用的一个具体的实现请看 github.com/cgCodeLife/… ,在测试中发现一个现象,在执行graceful的服务热升级的时候发现golang的client会偶现EOF, read connection reset by peer, connection idle close等现象,所以需要结合我自己的测试代码和现象分析下内部原因,为此有这篇文章以此作为公共讨论的地方希望能有人给出好的建议。

结论

web服务虽然能实现graceful的能力,但是并不是理想的,client会偶尔出现连接问题,无关乎并发量

测试环境

golang client

goang版本1.10

http协议 1.1

是否长连接 是/否 都尝试过

并发数 1,30个都尝试过

每个连接发送的次数1, 1000次 其中次数为1次的实验在client端未发现连接问题

请求方式 post十几字节的字符串

golang server

golang版本 1.10

响应数据 自己的进程号,7字节左右

问题分析

golang client代码

package main

import (
    "net/http"
    log "github.com/sirupsen/logrus"
    "io/ioutil"
    "fmt"
    "bytes"
    "sync"
)

func main() {
    var wg sync.WaitGroup
    var count int
    var rw sync.RWMutex
TEST:
    for i := 0; i < 30; i++ {
        wg.Add(1)
        go func () {
            defer wg.Done()
            tr := http.Transport{DisableKeepAlives: false}
            client := &http.Client{Transport: &tr}
            for i := 0; i < 1000; i++ {
                f, err := ioutil.ReadFile("data")
                if err != nil {
                    fmt.Println("read file err", err)
                    return
                }
                fmt.Println(len(f))
                reader := bytes.NewReader(f)
                rw.Lock()
                count += 1
                index := count
                rw.Unlock()
                resp, err := client.Post("http://0.0.0.0:8888", "application/x-www-form-urlencoded", reader)
                if err != nil {
                    rw.RLock()
                    currentCount := count
                    rw.RUnlock()
                    log.Fatal(err, index, currentCount)
                }
                defer resp.Body.Close()
                data, err := ioutil.ReadAll(resp.Body)
                if err != nil {
                    log.Fatal(err)
                }
                log.Printf("data[%s]", string(data))
            }
        }()
    }
    wg.Wait()
    goto TEST
}复制代码

golang server代码

package main

import (
    graceful "github.com/cgCodeLife/graceful2"
    "net/http"
    log "github.com/sirupsen/logrus"
    "io/ioutil"
    "fmt"
    "os"
    "strconv"
)

func main() {
    server := graceful.NewServer()
    handler := http.HandlerFunc(handle)
    server.Register("0.0.0.0:8888", handler)
    err := server.Run()
    if err != nil {
        log.Fatal(err)
    }
}

func handle(w http.ResponseWriter, r *http.Request) {
    defer r.Body.Close()
    _, err := ioutil.ReadAll(r.Body)
    if err != nil {
        fmt.Println("read body error[%s] pid[%d]", err, os.Getpid())
    }

    w.Write([]byte(strconv.Itoa(os.Getpid())))
}复制代码

实验部分截图

1个连接请求1次并发是1的情况

1个连接请求1000次并发是1的情况

1个连接请求1次并发是30 (连接资源应该耗尽了,但是没有触发EOF, reset等连接问题)

1个连接请求1000次并发是30

这里简单描述的我用的graceful的原理,它是一个master-worker模式,master常驻,只处理信号和像worker发送terminate信号,worker负责web服务,在收到信号之后会进行shutdown操作,逻辑就这些。

看下shutdown的代码 src/net/http/server.go 2536行开始

// shutdownPollInterval is how often we poll for quiescence
// during Server.Shutdown. This is lower during tests, to
// speed up tests.
// Ideally we could find a solution that doesn't involve polling,
// but which also doesn't have a high runtime cost (and doesn't
// involve any contentious mutexes), but that is left as an
// exercise for the reader.
var shutdownPollInterval = 500 * time.Millisecond

// Shutdown gracefully shuts down the server without interrupting any
// active connections. Shutdown works by first closing all open
// listeners, then closing all idle connections, and then waiting
// indefinitely for connections to return to idle and then shut down.
// If the provided context expires before the shutdown is complete,
// Shutdown returns the context's error, otherwise it returns any
// error returned from closing the Server's underlying Listener(s).
//
// When Shutdown is called, Serve, ListenAndServe, and
// ListenAndServeTLS immediately return ErrServerClosed. Make sure the
// program doesn't exit and waits instead for Shutdown to return.
//
// Shutdown does not attempt to close nor wait for hijacked
// connections such as WebSockets. The caller of Shutdown should
// separately notify such long-lived connections of shutdown and wait
// for them to close, if desired. See RegisterOnShutdown for a way to
// register shutdown notification functions.
func (srv *Server) Shutdown(ctx context.Context) error {
    atomic.AddInt32(&srv.inShutdown, 1)
    defer atomic.AddInt32(&srv.inShutdown, -1)

    srv.mu.Lock()
    lnerr := srv.closeListenersLocked()
    srv.closeDoneChanLocked()
    for _, f := range srv.onShutdown {
        go f()
    }
    srv.mu.Unlock()

    ticker := time.NewTicker(shutdownPollInterval)
    defer ticker.Stop()
    for {
        if srv.closeIdleConns() {
            return lnerr
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
        }
    }
}复制代码

shutdown主要做两件事情

1.终止listen

2.关闭所有空闲连接

空闲连接怎么来的呢,在request的时候就把conn设置为active,然后在当前请求处理完成之后设置成idle,所以,所以我理解的在一个连接上如果发送多次请求会比较容易出现shutdown扫描的时候这个request还在handler上处理,处于active状态,所以这种情况会出现client连接问题。

验证

上边说的都是基于现象做的假设性结论,那怎么证明呢,证明实验如下

1连接持续发送包信息如下:888是服务端端口

很显然,服务端连接还没释放完服务就没了

1连接间隔1秒持续发送信息:

完整的四次挥手

解决方案

其实我理解这个问题如果靠server端达到完全避免的目的是不太合适的,因为发送与否的主动权在client上,所以这种情况需要client端配合处理网络问题。