@w1992wishes
        
        2018-11-20T08:00:47.000000Z
        字数 7702
        阅读 1299
    spark
本篇结构:
在介绍了 Spark RPC 框架的大致内容后,下面以 Standalone 运行模式分析下 Spark 启动过程中的通信。

启动过程中主要是 Master 和 Worker 之间的通信。首先 Worker 向 Master 发送注册消息,然后 Master 回复注册成功或者注册失败的消息,如果 Worker 收到注册成功的消息,就定时向 Master 发送心跳信息。
一般启动 spark 集群是运行 start-all.sh 脚本,该脚本先启动 Master,后启动 Worker。

start-master.sh 最后会在 Master 节点运行 org.apache.spark.deploy.master.Master 类,start-slaves.sh 中最后会在各 slave 节点运行 org.apache.spark.deploy.worker.Worker。
下面分步骤分析启动过程中的通信。
def main(argStrings: Array[String]) {Utils.initDaemon(log)val conf = new SparkConfval args = new WorkerArguments(argStrings, conf)val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,args.memory, args.masters, args.workDir, conf = conf)rpcEnv.awaitTermination()}
从 Worker 的 main 方法看起,先 startRpcEnvAndEndpoint 启动 RpcEnv 和 Endpoint。
在 startRpcEnvAndEndpoint 中调用 RpcEnv.create(systemName, host, port, conf, securityMgr) 方法,前面有介绍 RpcEnv.create() 方法创建的是 NettyRpcEnv,在 NettyRpcEnv 创建时会初始化 Dispathcer,进而创建 Inbox,在 Inbox 初始化时会将 Onstart 消息放入 messages 中,然后 MessageLoop 会调用 inbox.process() 方法消费该消息,经过 match case 匹配,调用到 endpoint.onStart():
NettyRpcEnv -> Dispathcer -> EndpointData -> Inbox -> messages.add(OnStart) -> MessageLoop -> data.inbox.process(Dispatcher.this) -> Worker.onStart()
所以重点关注 onStart() 方法:
override def onStart() {assert(!registered)logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(host, port, cores, Utils.megabytesToString(memory)))logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")logInfo("Spark home: " + sparkHome)createWorkDir()shuffleService.startIfEnabled()webUi = new WorkerWebUI(this, workDir, webUiPort)webUi.bind()workerWebUiUrl = s"http://$publicAddress:${webUi.boundPort}"registerWithMaster()metricsSystem.registerSource(workerSource)metricsSystem.start()// Attach the worker metrics servlet handler to the web ui after the metrics system is started.metricsSystem.getServletHandlers.foreach(webUi.attachHandler)}
和注册相关的在 registerWithMaster() 方法:
private def registerWithMaster() {// onDisconnected may be triggered multiple times, so don't attempt registration// if there are outstanding registration attempts scheduled.registrationRetryTimer match {case None =>registered = falseregisterMasterFutures = tryRegisterAllMasters()connectionAttemptCount = 0registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(new Runnable {override def run(): Unit = Utils.tryLogNonFatalError {Option(self).foreach(_.send(ReregisterWithMaster))}},INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,TimeUnit.SECONDS))case Some(_) =>logInfo("Not spawning another attempt to register with the master, since there is an" +" attempt scheduled already.")}}
来重点关注 tryRegisterAllMasters() 方法:
private def tryRegisterAllMasters(): Array[JFuture[_]] = {masterRpcAddresses.map { masterAddress =>registerMasterThreadPool.submit(new Runnable {override def run(): Unit = {try {logInfo("Connecting to master " + masterAddress + "...")val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)registerWithMaster(masterEndpoint)} catch {case ie: InterruptedException => // Cancelledcase NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)}}})}}
因为可能存在多个 Master(高HA),所以 Worker 需要向多个 Master 发送注册消息,Worker 在向多个 Master 发送注册消息时,采用了 registerMasterThreadPool 线程池的方式,线程池线程的数量等于 Master 的数量。
根据代码可知,在注册过程中,先获取 Master 的引用 val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME),接着调用 registerWithMaster(masterEndpoint) 方法:
private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(workerId, host, port, self, cores, memory, workerWebUiUrl)).onComplete {// This is a very fast action so we can use "ThreadUtils.sameThread"case Success(msg) =>Utils.tryLogNonFatalError {handleRegisterResponse(msg)}case Failure(e) =>logError(s"Cannot register with master: ${masterEndpoint.address}", e)System.exit(1)}(ThreadUtils.sameThread)}
registerWithMaster 方法中,通过 master 节点引用,调用其 ask 方法,该方法的流程在前面的文章有分析,ask 注册信息,等待回复,如果成功进入 handleRegisterResponse 方法,如果失败,记录日志并退出。
上篇关于 Spark Rpc 框架一文分析了 ask 的时序图,Worker 的注册消息会通过 TransportClient 发送出去,Master 接收到消息后最后会来到 Dispatcher 的 postMessage,将消息丢到 Inbox 中,由 Inbox.process() 方法进行处理,该方法中经过 match case 匹配会调用到 Master 的 receiveAndReply() 方法中:
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {case RegisterWorker(id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) =>logInfo("Registering worker %s:%d with %d cores, %s RAM".format(workerHost, workerPort, cores, Utils.megabytesToString(memory)))if (state == RecoveryState.STANDBY) {context.reply(MasterInStandby)} else if (idToWorker.contains(id)) {context.reply(RegisterWorkerFailed("Duplicate worker ID"))} else {val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,workerRef, workerWebUiUrl)if (registerWorker(worker)) {persistenceEngine.addWorker(worker)context.reply(RegisteredWorker(self, masterWebUiUrl))schedule()} else {val workerAddress = worker.endpoint.addresslogWarning("Worker registration failed. Attempted to re-register worker at same " +"address: " + workerAddress)context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "+ workerAddress))}}...}
分析上面的代码,先判断 Master 是否是 STANDBY 状态,reply MasterInStandby 消息,如果注册列表发现已经有该 Worker,则回复 RegisterWorkerFailed 注册失败的消息。
正常情况下则通过 registerWorker 方法将 worker 加入到 workers 中,然后 Master 向 Worker 回复 RegisteredWorker 注册成功的消息。
下面是 registerWorker() 方法代码:
private def registerWorker(worker: WorkerInfo): Boolean = {// There may be one or more refs to dead workers on this same node (w/ different ID's),// remove them.workers.filter { w =>(w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)}.foreach { w =>workers -= w}val workerAddress = worker.endpoint.addressif (addressToWorker.contains(workerAddress)) {val oldWorker = addressToWorker(workerAddress)if (oldWorker.state == WorkerState.UNKNOWN) {// A worker registering from UNKNOWN implies that the worker was restarted during recovery.// The old worker must thus be dead, so we will remove it and accept the new worker.removeWorker(oldWorker)} else {logInfo("Attempted to re-register worker at same address: " + workerAddress)return false}}workers += workeridToWorker(worker.id) = workeraddressToWorker(workerAddress) = workerif (reverseProxy) {webUi.addProxyTargets(worker.id, worker.webUiAddress)}true}
private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(workerId, host, port, self, cores, memory, workerWebUiUrl)).onComplete {// This is a very fast action so we can use "ThreadUtils.sameThread"case Success(msg) =>Utils.tryLogNonFatalError {handleRegisterResponse(msg)}case Failure(e) =>logError(s"Cannot register with master: ${masterEndpoint.address}", e)System.exit(1)}(ThreadUtils.sameThread)}private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {msg match {case RegisteredWorker(masterRef, masterWebUiUrl) =>logInfo("Successfully registered with master " + masterRef.address.toSparkURL)registered = truechangeMaster(masterRef, masterWebUiUrl)forwordMessageScheduler.scheduleAtFixedRate(new Runnable {override def run(): Unit = Utils.tryLogNonFatalError {self.send(SendHeartbeat)}}, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)if (CLEANUP_ENABLED) {logInfo(s"Worker cleanup enabled; old application directories will be deleted in: $workDir")forwordMessageScheduler.scheduleAtFixedRate(new Runnable {override def run(): Unit = Utils.tryLogNonFatalError {self.send(WorkDirCleanup)}}, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)}val execs = executors.values.map { e =>new ExecutorDescription(e.appId, e.execId, e.cores, e.state)}masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))case RegisterWorkerFailed(message) =>if (!registered) {logError("Worker registration failed: " + message)System.exit(1)}case MasterInStandby =>// Ignore. Master not yet ready.}}// Send a heartbeat every (heartbeat timeout) / 4 millisecondsprivate val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4
由上代码可知,定时发送心跳的代码是由 HEARTBEAT_MILLIS 决定,可由 spark.worker.timeout 参数控制,默认是其值的 1/4。