如何处理每小时Bigtable连接关闭?

问题描述:

I have golang services with persistant Bigtable client. The services are making hundred of read/write operation on Bigtable per sec.

Every hours from the service boot, I experience hundred of errors like this one:

Retryable error: rpc error: code = Unavailable desc =
 the connection is draining, retrying in 74.49241ms

The error are followed by an increased processing time I can't allow when thoses errors occur.

I was able to figure out that Bigtable client is using a pool of gRPC connections.

It seems that Bigtable gRPC server have a connection maxAge of 1 hour which can explain the error above and the processing time increasing during reconnection.

A maxAgeGrace configuration is supposed to give additional time to complete current operations and avoid all pool connections to terminate at the same time.

I increased connection pool size from default 4 to 12 with no real benefit

How do I prevent processing time to increase during reconnections and these error to happen, given my traffic will keep growing?

我拥有带有持久性Bigtable客户端的golang服务。 该服务每秒对Bigtable进行数百次读/写操作 p>

从服务启动开始的每个小时,我都会遇到上百个这样的错误: p>

 可重试的错误:rpc错误:code =  desc = 
不可用,连接正在耗尽,请在74.49241ms中重试
  code>  pre> 
 
 

该错误伴随着处理时间的增加,这些错误发生时我不允许这样做。 p>

我能够确定Bigtable客户端正在使用gRPC连接池。 p>

似乎Bigtable gRPC服务器的连接maxAge为 1小时可以解释上述错误,并且重新连接期间的处理时间会增加。 p>

maxAgeGrace配置应该给额外的时间来完成当前操作,并避免所有池连接在同一时间终止 时间。 p>

我从默认状态开始增加了连接池的大小 t 4到12并没有真正的好处 p>

鉴于流量会持续增长,如何防止重新连接期间的处理时间增加以及发生这些错误? p> DIV>

Cloud bigtable clients use a pool of gRPC connections to connect to bigtable. Java client uses a channel pool per HBase connection, each channel pool has multiple gRPC connections. gRPC connections are shut down every hour (or after 15 minute of inactivity) and the underlying gRPC infrastructure performs a reconnect. The first request on each new connection performs a number of setup tasks such as TLS handshakes and warming server side caches. These operations are fairly expensive and may cause the latency spikes.

Bigtable is designed to be a high throughput system and the amortized cost of these reconnections with sustained query volume should be negligible. However, if the client application has very low QPS or long periods of idle time between queries and can not tolerate these latency spikes, it can create a new Hbase connection(java) or a new CBT client(golang) every 30-40 minutes and run no op calls (exist on hbase client or read a small row) on the new connection/client to prime the underlying gRPC connections (one call per connection, for hbase default is twice the number of CPUs, go has 4 connections by default). Once primed you can swap out the new connection/client for the main operations in the client application. Here is sample go code for this workaround.

I suspect that this may be due to a bug that gets introduced in a recent grpc-go release, and just got fixed. Basically, instead of reconnecting immediately when a connection goes away, we incorrectly wait 1s before reconnecting. Please try again with grpc-go master head. Thanks!