在线等，Kafka如果丢了消息怎么办？

<div id="js_content">
点击上方码猿技术专栏轻松关注，设为星标！ 
及时获取有趣有料的技术
<img src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-502108076712b96399d531ea10adc544.png">
“
Kafka 存在丢消息的问题，消息丢失会发生在 Broker，Producer 和 Consumer 三种。
<img src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-ec2f249365f102d1dceeed94c50165eb.png">
Broker
Broker 丢失消息是由于 Kafka 本身的原因造成的，Kafka 为了得到更高的性能和吞吐量，将数据异步批量的存储在磁盘中。 
消息的刷盘过程，为了提高性能，减少刷盘次数，Kafka 采用了批量刷盘的做法。即，按照一定的消息量，和时间间隔进行刷盘。 
这种机制也是由于 Linux 操作系统决定的。将数据存储到 Linux 操作系统种，会先存储到页缓存（Page cache）中，按照时间或者其他条件进行刷盘（从 Page Cache 到 file），或者通过 fsync 命令强制刷盘。 
数据在Page Cache中时，如果系统挂掉，数据会丢失。 
<img src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-51149214c1c1c8cab29bcc97350bc49d.png">
Broker 在 Linux 服务器上高速读写以及同步到 Replica
上图简述了 Broker 写数据以及同步的一个过程。Broker 写数据只写到 Page Cache 中，而 Page Cache 位于内存。 
这部分数据在断电后是会丢失的。Page Cache 的数据通过 Linux 的 flusher 程序进行刷盘。 
刷盘触发条件有三： 
<ul><li>主动调用 sync 或 fsync 函数。</li><li>可用内存低于阈值。</li><li>dirty data 时间达到阈值。dirty 是 Page Cache 的一个标识位，当有数据写入到 Page Cache 时，Page Cache 被标注为 dirty，数据刷盘以后，dirty 标志清除。</li></ul>
Broker 配置刷盘机制，是通过调用 fsync 函数接管了刷盘动作。从单个 Broker 来看，Page Cache 的数据会丢失。 
Kafka 没有提供同步刷盘的方式。同步刷盘在 RocketMQ 中有实现，实现原理是将异步刷盘的流程进行阻塞，等待响应，类似 Ajax 的 callback 或者是 Java 的 future。 
下面是一段 RocketMQ 的源码：
<pre class="blockcode"><code class="language-go">GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
service.putRequest(request);
boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); // 刷盘
</code></pre>
也就是说，理论上，要完全让 Kafka 保证单个 Broker 不丢失消息是做不到的，只能通过调整刷盘机制的参数缓解该情况。 
比如，减少刷盘间隔，减少刷盘数据量大小。时间越短，性能越差，可靠性越好（尽可能可靠）。这是一个选择题。
为了解决该问题，Kafka 通过 Producer 和 Broker 协同处理单个 Broker 丢失参数的情况。 
一旦 Producer 发现 Broker 消息丢失，即可自动进行 retry。除非 retry 次数超过阈值（可配置），消息才会丢失。 
此时需要生产者客户端手动处理该情况。那么 Producer 是如何检测到数据丢失的呢？是通过 ack 机制，类似于 http 的三次握手的方式。
The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed: 
acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won’t generally know of any failures). The offset given back for each record will always be set to -1. 
acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost. 
acks=allThis means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.
h