@changedi
2017-04-18T09:09:51.000000Z
字数 13930
阅读 3721
大数据 HDFS
所有的分析以单机安装的Hadoop版本2.6.4为例分析。步骤依赖于安装文档中的步骤,见Hadoop的单机安装
预制几个重要的脚本文件:
- 假设hadoop的安装目录在HADOOP_HOME。
- 重要的脚本文件hadoop-functions.sh。
第一步要:$ bin/hdfs namenode -format
主要执行HADOOP_HOME/bin/hdfs命令。其中设置了3个重要的变量名
namenode)HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.namenode.NameNode'hadoop_add_param HADOOP_OPTS hdfs.audit.logger "-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER}";;
然后最后执行
hadoop_java_exec "${HADOOP_SUBCMD}" "${HADOOP_CLASSNAME}" "${HADOOP_SUBCMD_ARGS[@]}"
其中的hadoop_java_exec是hadoop-functions.sh中声明的一个函数,其作用就是启动java进程执行command。
function hadoop_java_exec{# run a java command. this is used for# non-daemonslocal command=$1local class=$2shift 2hadoop_debug "Final CLASSPATH: ${CLASSPATH}"hadoop_debug "Final HADOOP_OPTS: ${HADOOP_OPTS}"hadoop_debug "Final JAVA_HOME: ${JAVA_HOME}"hadoop_debug "java: ${JAVA}"hadoop_debug "Class name: ${class}"hadoop_debug "Command line options: $*"export CLASSPATH#shellcheck disable=SC2086exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@"}
所以,整个命令的链路核心目标就是执行org.apache.hadoop.hdfs.server.namenode.NameNode类的main函数,传递的参数为format。
public static void main(String argv[]) throws Exception {if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {System.exit(0);}try {StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);NameNode namenode = createNameNode(argv, null);if (namenode != null) {namenode.join();}} catch (Throwable e) {LOG.error("Failed to start namenode.", e);terminate(1, e);}}
其中startupShutdownMessage方法会打印一些启动信息到控制台,同时如果是unix系统,会注册logger到signal,在接受 { "TERM", "HUP", "INT" }信号时打印错误日志。这样做的意义在于当有系统信号触发进程结束时,可以根据日志来判断是什么原因退出进程的。
if (SystemUtils.IS_OS_UNIX) {try {SignalLogger.INSTANCE.register(LOG);} catch (Throwable t) {LOG.warn("failed to register any UNIX signal loggers: ", t);}
接下来就是createNameNode了,首先解析出-format参数为StartOption.FORMAT,然后执行format方法,由于没有指定cluster,所以系统new一个clusterId,比如形如CID-d2425dab-c066-4a67-954f-32228c22abe6。
private static boolean format(Configuration conf, boolean force,boolean isInteractive) throws IOException {String nsId = DFSUtil.getNamenodeNameServiceId(conf);String namenodeId = HAUtil.getNameNodeId(conf, nsId);initializeGenericKeys(conf, nsId, namenodeId);checkAllowFormat(conf);if (UserGroupInformation.isSecurityEnabled()) {InetSocketAddress socAddr = DFSUtilClient.getNNAddress(conf);SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, socAddr.getHostName());}Collection<URI> nameDirsToFormat = FSNamesystem.getNamespaceDirs(conf);List<URI> sharedDirs = FSNamesystem.getSharedEditsDirs(conf);List<URI> dirsToPrompt = new ArrayList<URI>();dirsToPrompt.addAll(nameDirsToFormat);dirsToPrompt.addAll(sharedDirs);List<URI> editDirsToFormat =FSNamesystem.getNamespaceEditsDirs(conf);// if clusterID is not provided - see if you can find the current oneString clusterId = StartupOption.FORMAT.getClusterId();if(clusterId == null || clusterId.equals("")) {//Generate a new cluster idclusterId = NNStorage.newClusterID();}System.out.println("Formatting using clusterid: " + clusterId);FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);try {FSNamesystem fsn = new FSNamesystem(conf, fsImage);fsImage.getEditLog().initJournalsForWrite();if (!fsImage.confirmFormat(force, isInteractive)) {return true; // aborted}fsImage.format(fsn, clusterId);} catch (IOException ioe) {LOG.warn("Encountered exception during format: ", ioe);fsImage.close();throw ioe;}return false;}
接下来构造一个FSImage,设置默认的checkpoint目录,设置存储以及初始化edit log。其中NNStorage负责管理存储目录,FSEditLog是edit log对象。
protected FSImage(Configuration conf,Collection<URI> imageDirs,List<URI> editsDirs)throws IOException {this.conf = conf;storage = new NNStorage(conf, imageDirs, editsDirs);if(conf.getBoolean(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_KEY,DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_DEFAULT)) {storage.setRestoreFailedStorage(true);}this.editLog = FSEditLog.newInstance(conf, storage, editsDirs);archivalManager = new NNStorageRetentionManager(conf, storage, editLog);}
有了文件系统镜像,就可以构造FSNamesystem了,这是一个namespace状态存储的容器,负责承载NameNode的一切记录性质的工作。具体的构造函数代码较长,这里就不贴明细了。具体分析一下步骤:
1. 先创建KeyProvider,我们这个例子没有安全模式,因此no KeyProvider found。
2. 读取dfs.namenode.fslock.fair,构造FSNamesystemLock,默认true,即公平读写锁。
3. 设置用户和权限
4. check 是否HA
5. 初始化BlockManager及其代理的一堆manager,包括:DatanodeManager(管理DataNode的下线[DecommissionManager]和其他活动),HeartbeatManager(管理从datanode接收到的心跳),BlockIdManager(分配和管理GenerationStamp和block id)等。
6. 构造FSDirectory,这是个纯内存的结构,用来和FSNamesystem一起管理NameNode,构造INode。
7. 初始化CacheManager来管理DataNode的cache。
8. 初始化RetryCache。cache了一些非幂等的被RPCserver成功处理的请求,用以处理重试。
至此FSNamesystem初始化完成,最后执行FSImage的format方法,进行格式化。然后shutdown NameNode。
第二步就是启动NameNode和DataNode了,具体脚本如下:
$ sbin/start-dfs.sh
脚本核心代码:
#---------------------------------------------------------# namenodesNAMENODES=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -namenodes 2>/dev/null)if [[ -z "${NAMENODES}" ]]; thenNAMENODES=$(hostname)fiecho "Starting namenodes on [${NAMENODES}]"hadoop_uservar_su hdfs namenode "${HADOOP_HDFS_HOME}/bin/hdfs" \--workers \--config "${HADOOP_CONF_DIR}" \--hostnames "${NAMENODES}" \--daemon start \namenode ${nameStartOpt}HADOOP_JUMBO_RETCOUNTER=$?
也就是先hdfs getconf -namenodes来查询配置列出所有NameNode。然后执行hdfs namenode来启动NameNode。根据上面的分析,我们知道hdfs脚本就是启动对应命令的java进程,namenode子命令还是对应NameNode类的main方法,具体执行的其他步骤一样,只是在createNameNode时,因为参数不同而导致逻辑不同。因为启动脚本里namenode没有其他参数,因此启动默认逻辑
default: {DefaultMetricsSystem.initialize("NameNode");return new NameNode(conf);}
核心就是NameNode的构造方法。其首先通过setClientNamenodeAddress方法设置NameNode的地址,默认的就是fs.defaultFS配置对应的值hdfs://localhost:9000。
接着初始化NameNode
protected void initialize(Configuration conf) throws IOException {if (conf.get(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS) == null) {String intervals = conf.get(DFS_METRICS_PERCENTILES_INTERVALS_KEY);if (intervals != null) {conf.set(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS,intervals);}}UserGroupInformation.setConfiguration(conf);loginAsNameNodeUser(conf);NameNode.initMetrics(conf, this.getRole());StartupProgressMetrics.register(startupProgress);pauseMonitor = new JvmPauseMonitor();pauseMonitor.init(conf);pauseMonitor.start();metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);if (NamenodeRole.NAMENODE == role) {startHttpServer(conf);}loadNamesystem(conf);rpcServer = createRpcServer(conf);initReconfigurableBackoffKey();if (clientNamenodeAddress == null) {// This is expected for MiniDFSCluster. Set it now using// the RPC server's bind address.clientNamenodeAddress =NetUtils.getHostPortString(getNameNodeAddress());LOG.info("Clients are to use " + clientNamenodeAddress + " to access"+ " this namenode/service.");}if (NamenodeRole.NAMENODE == role) {httpServer.setNameNodeAddress(getNameNodeAddress());httpServer.setFSImage(getFSImage());}startCommonServices(conf);startMetricsLogger(conf);}
几个比较重要的步骤,其中startHttpServer会启动一个httpServer,默认地址是http://0.0.0.0:50070。HDFS的默认httpserver是一个Jetty服务器,启动httpserver后,打开页面可以看到整个hdfs的监控情况。然后加载Namesystem,先check参数,由于本地启动,会收到这样两个警告:
2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storagedirectories!2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of data loss due to lack of redundant storage directories!
无视存储和editlog的存储单目录问题,接下来和format逻辑一样,要构造FSNamesystem。接着就是loadFSImage,FSImage加载后需要判断是否保存,其逻辑上是
final boolean needToSave = staleImage && !haEnabled && !isRollingUpgrade();
由于单机模式,这几个值都是false,因此needToSave也是false,所以不会进行fsImage的saveNamespace方法。
结束后会看到一行日志:
2017-02-11 21:59:29,472 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 349 msecs
表示FSImage加载完毕。
后面跟着初始化RPC server。具体对应的类是RPC.Server,基于Protobuf的一个客户端rpc服务器。
方法的最后两行,startCommonServices会启动所有的*manager和httpServer以及rpcServer,还有如果有配置ServicePlugin,每个plugin也会启动。而startMetricsLogger开启日志记录
启动脚本
#---------------------------------------------------------# datanodes (using default workers file)echo "Starting datanodes"hadoop_uservar_su hdfs datanode "${HADOOP_HDFS_HOME}/bin/hdfs" \--workers \--config "${HADOOP_CONF_DIR}" \--daemon start \datanode ${dataStartOpt}(( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))
执行无参数的hdfs datanode。DataNode存储了一系列的block来存放实际的文件数据。DataNode会和NameNode通信,且也会和其他DataNode甚至客户端来通信。DataNode只维护了一个关系block到bytes流的映射关系。
具体DataNode的初始化,首先先初始MetricSystem。接着进入核心的代码段——DataNode的构造函数:
DataNode(final Configuration conf,final List<StorageLocation> dataDirs,final StorageLocationChecker storageLocationChecker,final SecureResources resources) throws IOException {super(conf);this.tracer = createTracer(conf);this.tracerConfigurationManager =new TracerConfigurationManager(DATANODE_HTRACE_PREFIX, conf);this.fileIoProvider = new FileIoProvider(conf, this);this.blockScanner = new BlockScanner(this);this.lastDiskErrorCheck = 0;this.maxNumberOfBlocksToLog = conf.getLong(DFS_MAX_NUM_BLOCKS_TO_LOG_KEY,DFS_MAX_NUM_BLOCKS_TO_LOG_DEFAULT);this.usersWithLocalPathAccess = Arrays.asList(conf.getTrimmedStrings(DFSConfigKeys.DFS_BLOCK_LOCAL_PATH_ACCESS_USER_KEY));this.connectToDnViaHostname = conf.getBoolean(DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME,DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME_DEFAULT);this.supergroup = conf.get(DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_KEY,DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_DEFAULT);this.isPermissionEnabled = conf.getBoolean(DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY,DFSConfigKeys.DFS_PERMISSIONS_ENABLED_DEFAULT);this.pipelineSupportECN = conf.getBoolean(DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED,DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED_DEFAULT);confVersion = "core-" +conf.get("hadoop.common.configuration.version", "UNSPECIFIED") +",hdfs-" +conf.get("hadoop.hdfs.configuration.version", "UNSPECIFIED");this.volumeChecker = new DatasetVolumeChecker(conf, new Timer());// Determine whether we should try to pass file descriptors to clients.if (conf.getBoolean(HdfsClientConfigKeys.Read.ShortCircuit.KEY,HdfsClientConfigKeys.Read.ShortCircuit.DEFAULT)) {String reason = DomainSocket.getLoadingFailureReason();if (reason != null) {LOG.warn("File descriptor passing is disabled because " + reason);this.fileDescriptorPassingDisabledReason = reason;} else {LOG.info("File descriptor passing is enabled.");this.fileDescriptorPassingDisabledReason = null;}} else {this.fileDescriptorPassingDisabledReason ="File descriptor passing was not configured.";LOG.debug(this.fileDescriptorPassingDisabledReason);}this.socketFactory = NetUtils.getDefaultSocketFactory(conf);try {hostName = getHostName(conf);LOG.info("Configured hostname is " + hostName);startDataNode(dataDirs, resources);} catch (IOException ie) {shutdown();throw ie;}final int dncCacheMaxSize =conf.getInt(DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_KEY,DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_DEFAULT) ;datanodeNetworkCounts =CacheBuilder.newBuilder().maximumSize(dncCacheMaxSize).build(new CacheLoader<String, Map<String, Long>>() {@Overridepublic Map<String, Long> load(String key) throws Exception {final Map<String, Long> ret = new HashMap<String, Long>();ret.put("networkErrors", 0L);return ret;}});initOOBTimeout();this.storageLocationChecker = storageLocationChecker;}
而其中最重要的就是startDataNode方法。其核心步骤摘要如下:
1. 注册MBean
2. 创建一个TcpPeerServer,监听50010端口。该server负责和Client和其他DataNode通信。此server不使用Hadoop的IPC机制
3. 启动JvmPauseManager,用于记录Jvm的暂停,发现则log一条
4. 初始化IpcServer,监听50020端口。
5. 构造一个BPOfferService线程,然后启动线程。BPServiceActor是这样一个线程,它会先和NameNode进行握手做预注册,接下来注册DataNode到NameNode,然后周期性的发送心跳给NameNode,并处理接收到的response命令。
具体描述步骤5,就是如下代码:
public void run() {LOG.info(this + " starting to offer service");try {while (true) {// init stufftry {// setup storageconnectToNNAndHandshake();break;} catch (IOException ioe) {// Initial handshake, storage recovery or registration failedrunningState = RunningState.INIT_FAILED;if (shouldRetryInit()) {// Retry until all namenode's of BPOS failed initializationLOG.error("Initialization failed for " + this + " "+ ioe.getLocalizedMessage());sleepAndLogInterrupts(5000, "initializing");} else {runningState = RunningState.FAILED;LOG.error("Initialization failed for " + this + ". Exiting. ", ioe);return;}}}runningState = RunningState.RUNNING;if (initialRegistrationComplete != null) {initialRegistrationComplete.countDown();}while (shouldRun()) {try {offerService();} catch (Exception ex) {LOG.error("Exception in BPOfferService for " + this, ex);sleepAndLogInterrupts(5000, "offering service");}}runningState = RunningState.EXITED;} catch (Throwable ex) {LOG.warn("Unexpected exception in block pool " + this, ex);runningState = RunningState.FAILED;} finally {LOG.warn("Ending block pool service for: " + this);cleanUp();}}
下面具体分析一下BPServiceActor线程做的几件事:
1. 发送versionRequest请求给NameNode,来获取NameNode的namespace和版本信息。响应得到一个NamespaceInfo。
2. 利用NamespaceInfo初始化Storage,初始化之前先做格式化format。初始化后生成一个uuid,具体可以看到如下的日志:
2017-02-11 21:59:33,901 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=537369943;bpid=BP-503975772-192.168.0.109-1486821555429;lv=-56;nsInfo=lv=-60;cid=CID-c79cc043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0;bpid=BP-503975772-192.168.0.109-1486821555429;dnuuid=null
2017-02-11 21:59:33,902 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Generated and persisted new Datanode UUID 43ed99d1-20c6-4d71-919c-e9a70cb75c6e
3. 真实握手,发送registerDatanode请求给NameNode。这时NameNode会处理这个请求,利用DataNodeManager来进行registerDatanode。这时在NameNode日志会看到如下的日志:
2017-02-11 21:59:34,090 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* register
Datanode: from DatanodeRegistration(127.0.0.1, datanodeUuid=43ed99d1-20c6-4d71-9
19c-e9a70cb75c6e, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c79c
c043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0) storage 43ed99d1-20c6-4d71-
919c-e9a70cb75c6e
2017-02-11 21:59:34,099 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,100 INFO org.apache.hadoop.net.NetworkTopology: Adding a new
node: /default-rack/127.0.0.1:50010
2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Adding new storage ID DS-7d302778-acd6-4366-be5e-9dbf7ad22c4d for
DN 127.0.0.1:50010
4. 调用offerService方法,开始周期性发送心跳。每个心跳包都包含几个内容:DataNode名字、数据传输端口、总容量和剩余bytes。然后NameNode接受到心跳后开始handleHeartbeat。
至此,整个NameNode和DataNode都开始正常工作,整个HDFS的启动结束。