Spark Checkpointing 错误恢复

xiaoxiao2021-02-28 30

Checkpointing（检查点）

A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.

流式应用程序必须提供7*24小时的保障能力，因此，在处理与程序无关的故障(比如：系统错误，JVM崩溃等)必须具有一定的弹性（兼容性）。为了实现这样的能力，Spark Streaming需要来自于容错存储系统的足够的检查点信息，并通过这些信息达到错误恢复。有两种数据类型被作为检查点。

Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet.

元数据检查点：包含了流式计算信息的被保存在像hdfs这样的容错存储上，这用于从运行流式计算应用（后面详细讨论）的节点中进行错误恢复。元数据包括：配置-被用来创建流式应用程序的配置。DStream 操作：流式计算中的一系列DStream操作。不完整的批次-那些job已经被存储在队列里但是还没有被完成的批次。

Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

数据检查点:将生成的RDDS存储到可靠存储中，在不同批次间数据整合的状态转换中是必须的。在这样的转换中，生成的RDDS依赖于之前批次的产生的RDDS，这将造成随着时间的增加依赖关系也将增加。为了去避免这种在恢复时无止境的增长（与依赖关系成比例），状态转换的中间的中间RDDS被定期的做一次检查点到可靠存储（比如：HDFS），用来切断之前的依赖关系。

To summarize, metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

总的来说，元数据检查点主要用来从节点（驱动程序）错误中恢复，而如果有状态转换的操作时即使是非常基础的功能也应该应用数据检查点或者说RDD检查点。

转载请注明原文地址: https://www.6miu.com/read-1250266.html

技术

最新回复(0)