(版本定制)第16课：Spark Streaming源码解读之数据清理内幕彻底解密

蔷薇525 · 发表于 2019-1-31 07:37:56

　　本期内容：
　　 1、Spark Streaming元数据清理详解
　　 2、Spark Streaming元数据清理源码解析
　　

　　一、如何研究Spark Streaming元数据清理

　　操作DStream的时候会产生元数据，所以要解决RDD的数据清理工作就一定要从DStream入手。因为DStream是RDD的模板，DStream之间有依赖关系。
DStream的操作产生了RDD,接收数据也靠DStream，数据的输入，数据的计算，输出整个生命周期都是由DStream构建的。由此，DStream负责RDD的整个生命周期。因此研究的入口的是DStream。
　　基于Kafka数据来源，通过Direct的方式访问Kafka,DStream随着时间的进行，会不断的在自己的内存数据结构中维护一个HashMap,HashMap维护的就是时间窗口，以及时间窗口下的RDD.按照Batch Duration来存储RDD以及删除RDD.
　　Spark Streaming本身是一直在运行的，在自己计算的时候会不断的产生RDD，例如每秒Batch Duration都会产生RDD,除此之外可能还有累加器，广播变量。由于不断的产生这些对象，因此Spark Streaming有自己的一套对象，元数据以及数据的清理机制。
　　Spark Streaming对RDD的管理就相当于JVM的GC

　　

　　二、源码解析
　　Spark Streaming是通过我们设定的Batch Durations来不断的产生RDD，Spark Streaming清理元数据跟时钟有关，因为数据是周期性的产生，所以肯定是周期性的释放，这些都跟JobGenerator有关，所以我们先从这开始研究。
　　

　　1、RecurringTimer: 消息循环器将消息不断的发送给EventLoop
　　
= RecurringTimer(...millisecondslongTime => .post((Time(longTime))))　　2、eventLoop：onReceive接收到消息
　　
(): = synchronized {
(!= ) = EventLoop[JobGeneratorEvent]() {
(event: JobGeneratorEvent): = processEvent(event)
(e: ): = {
   jobScheduler.reportError(e)
}
  }
.start()
(.) {
restart()
  } {
startFirstTime()
  }
}　　3、在processEvent中接收清理元数据消息
　　
/** Processes all events */
private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time) //清理元数据
case DoCheckpoint(time, clearCheckpointDataLater) =>
   doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time) //清理checkpoint
  }
}　　具体的方法实现内容就不再这里说，我们进一步分析下这些清理动作是在什么时候被调用的，在Spark Streaming应用程序中，最终Job是交给JobHandler来执行的，所以我们分析下JobHandler
　　
private class JobHandler(job: Job) extends Runnable with Logging {
import JobScheduler._

def run() {
try {
val formattedTime = UIUtils.formatBatchTime(
      job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

ssc.sc.setJobDescription(
s"""Streaming job from $batchLinkText""")
      ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
      ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

// We need to assign `eventLoop` to a temp variable. Otherwise, because
      // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
      // it's possible that when `post` is called, `eventLoop` happens to null.
var _eventLoop = eventLoop
if (_eventLoop != null) {
      _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
// Disable checks for existing output directories in jobs launched by the streaming
      // scheduler, since we may need to write output to an existing directory during checkpoint
      // recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
         job.run()
      }
      _eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
      }
      } else {
// JobScheduler has been stopped.
}
   } finally {
      ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
      ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
   }
}
  }
}　　当Job完成的时候，会发JobCompleted消息给onReceive，通过processEvent来执行具体的方法
　　
private def processEvent(event: JobSchedulerEvent) {
try {
event match {
case JobStarted(job, startTime) => handleJobStart(job, startTime)
case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)
case ErrorReported(m, e) => handleError(m, e)
}
  } catch {
case e: Throwable =>
   reportError("Error in job scheduler", e)
  }
}private def handleJobCompletion(job: Job, completedTime: Long) {
val jobSet = jobSets.get(job.time)
  jobSet.handleJobCompletion(job)
  job.setEndTime(completedTime)
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
  logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
if (jobSet.hasCompleted) {
jobSets.remove(jobSet.time)
jobGenerator.onBatchCompletion(jobSet.time)
logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
   jobSet.totalDelay / 1000.0, jobSet.time.toString,
jobSet.processingDelay / 1000.0
))
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
  }
  job.result match {
case Failure(e) =>
   reportError("Error running job " + job, e)
case _ =>
  }
}　　通过jobGenerator.onBatchCompletion来清理元数据
　　
/**
* Callback called when a batch has been completely processed.
*/
def onBatchCompletion(time: Time) {
eventLoop.post(ClearMetadata(time))
}　　到这里Spark Streaming清理元数据的步骤基本上完成了
　　

　　

备注：

资料来源于：DT_大数据梦工厂（Spark发行版本定制）

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

账号		自动登录	找回密码
密码			立即注册

VMware vcenter+vSphere 6.5 U2共享

【跟谁学】韩宇极简英语课-技术人员不得不

用Zabbix通过JMX方式监控weblogic

winhex数据恢复教程（非常巨大，内容丰富）

Symantec Backup Exec 2015 2016/2012 BE20

NetScaler VPX部署之：NetScaler Gateway调

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] (版本定制)第16课：Spark Streaming源码解读之数据清理内幕彻底解密

扫码加入运维网微信交流群