[关闭]
@yangwenbo 2022-06-20T17:02:25.000000Z 字数 22429 阅读 415

清华大学-FIB实验室

spark集群搭建

一、前期准备工作

1.1 环境详情

名称 IP 说明 jdk版本 hadoop版本 zookeeper版本 scala版本 spark版本 操作系统
hadoop-spark01 192.168.200.43 Master节点 1.8 3.2.0 3.7.0 2.12.16 3.2.0 ubuntu20.04
hadoop-spark02 192.168.200.44 Slave1节点 1.8 3.2.0 3.7.0 2.12.16 3.2.0 ubuntu20.04
hadoop-spark03 192.168.200.45 Slave2节点 1.8 3.2.0 3.7.0 2.12.16 3.2.0 ubuntu20.04

1.2 添加主机名映射(三节点同步进行)

  1. #添加主机名映射
  2. root@hadoop-spark01:~# vim /etc/hosts
  3. root@hadoop-spark01:~# tail -3 /etc/hosts
  4. 192.168.200.43 hadoop-spark01
  5. 192.168.200.44 hadoop-spark02
  6. 192.168.200.45 hadoop-spark03

1.3 创建hadoop用户(三节点同步进行)

  1. #创建hadoop用户
  2. root@hadoop-spark01:~# adduser hadoop --home /home/hadoop
  3. #为新建的hadoop增加管理员权限
  4. root@hadoop-spark01:~# adduser hadoop sudo

1.4 安装SSH、配置SSH无密码登陆

集群、单节点模式都需要用到 SSH 登陆(类似于远程登陆,你可以登录某台 Linux 主机,并且在上面运行命令),Ubuntu 默认已安装了 SSH client,此外还需要安装 SSH server:

  1. #安装openssh-server
  2. hadoop@hadoop-spark01:~$ sudo apt-get -y install openssh-server
  1. #生成密匙
  2. hadoop@hadoop-spark01:~$ ssh-keygen -t rsa
  3. #分发公匙
  4. hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.43
  5. hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.44
  6. hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.45

1.5 下载安装包

进到spark官网

在这需要注意选择合适的安装包,因为我们使用Spark的时候一般都是需要和Hadoop交互的,所以需要下载带有Hadoop依赖的安装包
图片.png-281.3kB

但是这里面只有Spark的最新版本,一般不建议选择最新版本,可以在最新版本往下面回退一两个小版本
图片.png-429.2kB

这个时候就需要选择Hadoop版本对应的Spark安装包,里面Hadoop的版本只有2.7和3.2的,这里选择版本较高的hadoop3.2对应的这个Spark安装包。

hadoop的版本也要跟spark相对应,去hadoop官网选择hadoop3.2
图片.png-352.3kB

这里选择zookeeper作为高可用,进入zookeeper官网下载一个版本
image.png-160.5kB

scala官网下载 SCALA 2.12.16
image.png-185.3kB

二、安装zookeeper集群

2.1 安装Java环境(三节点同步进行)

  1. #解压JDK
  2. hadoop@hadoop-spark01:~$ sudo tar xf jdk-8u162-linux-x64.tar.gz -C /usr/local/
  3. hadoop@hadoop-spark01:~$ sudo mv /usr/local/jdk1.8.0_162 /usr/local/jdk
  1. #添加环境变量
  2. hadoop@hadoop-spark01:~$ vim ~/.bashrc
  3. hadoop@hadoop-spark01:~$ tail -4 ~/.bashrc
  4. #jdk
  5. export JAVA_HOME=/usr/local/jdk
  6. export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
  7. export CLASSPATH=.$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
  8. #使环境变量立即生效
  9. hadoop@hadoop-spark01:~$ source ~/.bashrc
  1. #检查java是否安装成功
  2. hadoop@hadoop-spark01:~$ java -version
  3. java version "1.8.0_162"
  4. Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
  5. Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

2.2 安装zookeeper集群(三节点同步进行)

2.2.1 安装zookeeper
  1. #解压zookeeper
  2. hadoop@hadoop-spark01:~$ sudo tar xf apache-zookeeper-3.7.0-bin.tar.gz -C /usr/local/
  3. hadoop@hadoop-spark01:~$ sudo mv /usr/local/apache-zookeeper-3.7.0-bin /usr/local/zookeeper
  4. hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/zookeeper
2.2.2 创建zookeeper数据目录以及日志存放目录
  1. #创建zookeeper数据目录以及日志存放目录
  2. hadoop@hadoop-spark01:~$ mkdir -p /usr/local/zookeeper/data
  3. hadoop@hadoop-spark01:~$ mkdir -p /usr/local/zookeeper/logs
2.2.3 修改zk的配置文件
  1. #复制 zoo_sample.cfg 文件的并命名为为 zoo.cfg,真正用的的是zoo.cfg
  2. hadoop@hadoop-spark01:~$ cd /usr/local/zookeeper/conf/
  3. hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cat zoo_sample.cfg | egrep -v "^$|^#" >zoo.cfg
  4. hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ vim zoo.cfg
  5. hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cat zoo.cfg
  6. #通信心跳数,Zookeeper服务器与客户端心跳时间,单位毫秒
  7. tickTime=2000
  8. #LF初始通信时限
  9. initLimit=10
  10. #LF同步通信时限
  11. syncLimit=5
  12. #数据文件夹
  13. dataDir=/usr/local/zookeeper/data
  14. #日志文件夹
  15. dataLogDir=/usr/local/zookeeper/logs
  16. #开启四字命令(版本)
  17. 4lw.commands.whitelist=*
  18. #客户端连接端口
  19. clientPort=2181
  20. #配置集群
  21. server.1=192.168.200.43:2888:3888
  22. server.2=192.168.200.44:2888:3888
  23. server.3=192.168.200.45:2888:3888
  1. # 【注意!!!】每台机器上的 myid 是不一样的,server.2 对应的要 echo 2
  2. hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ echo 1 > /usr/local/zookeeper/data/myid
  3. hadoop@hadoop-spark02:/usr/local/zookeeper/conf$ echo 2 > /usr/local/zookeeper/data/myid
  4. hadoop@hadoop-spark03:/usr/local/zookeeper/conf$ echo 3 > /usr/local/zookeeper/data/myid
2.2.4 加入环境变量
  1. hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cd /usr/local/zookeeper/bin/
  2. hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ vim zkEnv.sh
  3. hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ sed -n "32p" zkEnv.sh
  4. export JAVA_HOME=/usr/local/jdk
2.2.5 配置启、停脚本
  1. hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ cd /etc/init.d/
  2. hadoop@hadoop-spark01:/etc/init.d$ sudo vim zookeeper
  3. hadoop@hadoop-spark01:/etc/init.d$ cat zookeeper
  4. #!/bin/bash
  5. #chkconfig:2345 20 90
  6. #description:zookeeper
  7. #processname:zookeeper
  8. export JAVA_HOME=/usr/local/jdk
  9. case $1 in
  10. start) sudo /usr/local/zookeeper/bin/zkServer.sh start;;
  11. stop) sudo /usr/local/zookeeper/bin/zkServer.sh stop;;
  12. status) sudo /usr/local/zookeeper/bin/zkServer.sh status;;
  13. restart) sudo /usr/local/zookeeper/bin/zkServer.sh restart;;
  14. *) echo "require start|stop|status|restart" ;;
  15. esac
  1. #添加权限
  2. hadoop@hadoop-spark01:/etc/init.d$ sudo chmod +x zookeeper
2.2.6 启动zookeeper
  1. #启动zookeeper
  2. hadoop@hadoop-spark01:/etc/init.d$ service zookeeper start
  3. #查看监听端口
  4. hadoop@hadoop-spark01:/etc/init.d$ netstat -anp | grep 2181
  5. tcp6 0 0 :::2181 :::* LISTEN 4702/java
  6. tcp6 0 0 127.0.0.1:50654 127.0.0.1:2181 TIME_WAIT -

三、HDFS-HA集群搭建(三节点同步进行)

3.1 安装Hadoop3.2.0

  1. #安装hadoop
  2. hadoop@hadoop-spark01:~$ sudo tar xf hadoop-3.2.0.tar.gz -C /usr/local/
  3. hadoop@hadoop-spark01:~$ sudo mv /usr/local/hadoop-3.2.0 /usr/local/hadoop
  4. hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/hadoop
  1. #验证Hadoop是否可用
  2. hadoop@hadoop-spark01:~$ /usr/local/hadoop/bin/hadoop version
  3. Hadoop 3.2.0
  4. Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
  5. Compiled by sunilg on 2019-01-08T06:08Z
  6. Compiled with protoc 2.5.0
  7. From source with checksum d3f0795ed0d9dc378e2c785d3668f39
  8. This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar

3.2 配置PATH变量

  1. #添加环境变量
  2. hadoop@hadoop-spark01:~$ vim ~/.bashrc
  3. hadoop@hadoop-spark01:~$ tail -2 ~/.bashrc
  4. #hadoop
  5. export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
  6. #使环境变量立即生效
  7. hadoop@hadoop-spark01:~$ source ~/.bashrc

3.3 配置HA集群

在配置集群/分布式模式时,需要修改/usr/local/hadoop/etc/hadoop目录下的配置文件,这里仅设置正常启动所必须的设置项,包括workerscore-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xml共5个文件。

3.3.1 修改文件workers
  1. hadoop@hadoop-spark01:~$ cd /usr/local/hadoop/etc/hadoop/
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ vim workers
  3. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat workers
  4. hadoop-spark01
  5. hadoop-spark02
  6. hadoop-spark03
3.3.2 修改文件core-site.xml

请把core-site.xml文件修改为如下内容:

  1. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano core-site.xml
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat core-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>fs.defaultFS</name>
  7. <value>hdfs://ns1</value>
  8. </property>
  9. <property>
  10. <name>hadoop.tmp.dir</name>
  11. <value>/usr/local/hadoop/tmp</value>
  12. </property>
  13. <property>
  14. <name>hadoop.http.staticuser.user</name>
  15. <value>hadoop</value>
  16. </property>
  17. <property>
  18. <name>ha.zookeeper.quorum</name>
  19. <value>hadoop-spark01:2181,hadoop-spark02:2181,hadoop-spark03:2181</value>
  20. </property>
  21. <property>
  22. <name>hadoop.proxyuser.hadoop.hosts</name>
  23. <value>*</value>
  24. </property>
  25. <property>
  26. <name>hadoop.proxyuser.hadoop.groups</name>
  27. <value>*</value>
  28. </property>
  29. </configuration>
3.3.3 修改文件hdfs-site.xml
  1. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano hdfs-site.xml
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat hdfs-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>dfs.ha.automatic-failover.enabled.ns</name>
  7. <value>true</value>
  8. </property>
  9. <property>
  10. <name>dfs.replication</name>
  11. <value>3</value>
  12. </property>
  13. <property>
  14. <name>dfs.permissions.enabled</name>
  15. <value>false</value>
  16. </property>
  17. <property>
  18. <name>dfs.nameservices</name>
  19. <value>ns1</value>
  20. </property>
  21. <property>
  22. <name>dfs.blocksize</name>
  23. <value>134217728</value>
  24. </property>
  25. <property>
  26. <name>dfs.ha.namenodes.ns1</name>
  27. <value>nn1,nn2</value>
  28. </property>
  29. <!-- nn1RPC通信地址,nn1所在地址 -->
  30. <property>
  31. <name>dfs.namenode.rpc-address.ns1.nn1</name>
  32. <value>hadoop-spark01:9000</value>
  33. </property>
  34. <!-- nn1http通信地址,外部访问地址 -->
  35. <property>
  36. <name>dfs.namenode.http-address.ns1.nn1</name>
  37. <value>hadoop-spark01:50070</value>
  38. </property>
  39. <!-- nn2RPC通信地址,nn2所在地址 -->
  40. <property>
  41. <name>dfs.namenode.rpc-address.ns1.nn2</name>
  42. <value>hadoop-spark02:9000</value>
  43. </property>
  44. <!-- nn2http通信地址,外部访问地址 -->
  45. <property>
  46. <name>dfs.namenode.http-address.ns1.nn2</name>
  47. <value>hadoop-spark02:50070</value>
  48. </property>
  49. <!-- 指定NameNode的元数据在JournalNode日志上的存放位置(一般和zookeeper部署在一起) -->
  50. <property>
  51. <name>dfs.namenode.shared.edits.dir</name>
  52. <value>qjournal://hadoop-spark01:8485;hadoop-spark02:8485/ns1</value>
  53. </property>
  54. <!-- 指定JournalNode在本地磁盘存放数据的位置 -->
  55. <property>
  56. <name>dfs.journalnode.edits.dir</name>
  57. <value>/usr/local/hadoop/journaldata</value>
  58. </property>
  59. <!--开启namenode失败自动切换-->
  60. <property>
  61. <name>dfs.ha.automatic-failover.enabled.ns1</name>
  62. <value>true</value>
  63. </property>
  64. <!--客户端通过代理访问namenode,访问文件系统,HDFS 客户端与Active 节点通信的Java 类,使用其确定Active 节点是否活跃 -->
  65. <property>
  66. <name>dfs.client.failover.proxy.provider.ns1</name>
  67. <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  68. </property>
  69. <!--这是配置自动切换的方法,有多种使用方法,具体可以看官网,在文末会给地址,这里是远程登录杀死的方法 -->
  70. <property>
  71. <name>dfs.ha.fencing.methods</name>
  72. <value>
  73. sshfence
  74. shell(/bin/true)
  75. </value>
  76. </property>
  77. <!-- 这个是使用sshfence隔离机制时才需要配置ssh免登陆 -->
  78. <property>
  79. <name>dfs.ha.fencing.ssh.private-key-files</name>
  80. <value>/home/hadoop/.ssh/id_rsa</value>
  81. </property>
  82. <!-- 配置sshfence隔离机制超时时间,这个属性同上,如果你是用脚本的方法切换,这个应该是可以不配置的 -->
  83. <property>
  84. <name>dfs.ha.fencing.ssh.connect-timeout</name>
  85. <value>30000</value>
  86. </property>
  87. <!-- 这个是开启自动故障转移,如果你没有自动故障转移,这个可以先不配 -->
  88. <property>
  89. <name>dfs.ha.automatic-failover.enabled</name>
  90. <value>true</value>
  91. </property>
  92. <!-- HDFS文件系统元信息保存目录 -->
  93. <property>
  94. <name>dfs.name.dir</name>
  95. <value>file:/usr/local/hadoop/name</value>
  96. </property>
  97. <!-- HDFS文件系统数据保存目录 -->
  98. <property>
  99. <name>dfs.data.dir</name>
  100. <value>file:/usr/local/hadoop/data</value>
  101. </property>
  102. <property>
  103. <name>dfs.webhdfs.enabled</name>
  104. <value>true</value>
  105. </property>
  106. </configuration>
3.3.4 修改文件mapred-site.xml

把mapred-site.xml文件配置成如下内容:

  1. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano mapred-site.xml
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat mapred-site.xml
  3. .....
  4. <configuration>
  5. <!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
  6. <property>
  7. <name>mapreduce.framework.name</name>
  8. <value>yarn</value>
  9. </property>
  10. <property>
  11. <name>yarn.application.classpath</name>
  12. <value>/usr/local/hadoop/etc/hadoop/mapreduce/*,/usr/local/hadoop/etc/hadoop/mapreduce/lib/*</value>
  13. </property>
  14. </configuration>
3.3.5 修改文件 yarn-site.xml

请把yarn-site.xml文件配置成如下内容:

  1. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano yarn-site.xml
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat yarn-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>yarn.nodemanager.aux-services</name>
  7. <value>mapreduce_shuffle</value>
  8. </property>
  9. <!-- Site specific YARN configuration properties -->
  10. <!--启用resourcemanager ha-->
  11. <!--是否开启RM ha,默认是开启的-->
  12. <property>
  13. <name>yarn.resourcemanager.ha.enabled</name>
  14. <value>true</value>
  15. </property>
  16. <!-- 指定resourcemanager的名字 -->
  17. <property>
  18. <name>yarn.resourcemanager.cluster-id</name>
  19. <value>yrc</value>
  20. </property>
  21. <!-- 使用了2resourcemanager,分别指定Resourcemanager的地址 -->
  22. <property>
  23. <name>yarn.resourcemanager.ha.rm-ids</name>
  24. <value>rm1,rm2</value>
  25. </property>
  26. <!-- 指定rm1的地址 -->
  27. <property>
  28. <name>yarn.resourcemanager.hostname.rm1</name>
  29. <value>hadoop-spark01</value>
  30. </property>
  31. <!-- 指定rm2的地址 -->
  32. <property>
  33. <name>yarn.resourcemanager.hostname.rm2</name>
  34. <value>hadoop-spark02</value>
  35. </property>
  36. <property>
  37. <name>yarn.application.classpath</name>
  38. <value>/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*</value>
  39. </property>
  40. <!--指定zookeeper集群的地址-->
  41. <property>
  42. <name>yarn.resourcemanager.zk-address</name>
  43. <value>hadoop-spark01:2181,hadoop-spark02:2181,hadoop-spark03:2181</value>
  44. </property>
  45. <!--启用自动恢复,当任务进行一半,rm坏掉,就要启动自动恢复,默认是false-->
  46. <property>
  47. <name>yarn.resourcemanager.recovery.enabled</name>
  48. <value>true</value>
  49. </property>
  50. <!-- NodeManager上运行的附属服务,默认是mapreduce_shuffle -->
  51. <property>
  52. <name>yarn.nodemanager.aux-services</name>
  53. <value>mapreduce_shuffle</value>
  54. </property>
  55. <property>
  56. <name>yarn.resourcemanager.address.rm1</name>
  57. <value>hadoop-spark01:8032</value>
  58. </property>
  59. <property>
  60. <name>yarn.resourcemanager.address.rm2</name>
  61. <value>hadoop-spark02:8032</value>
  62. </property>
  63. <property>
  64. <name>yarn.resourcemanager.scheduler.address.rm2</name>
  65. <value>hadoop-spark02:8030</value>
  66. </property>
  67. <property>
  68. <name>yarn.resourcemanager.webapp.address.rm2</name>
  69. <value>hadoop-spark02:8088</value>
  70. </property>
  71. <property>
  72. <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
  73. <value>hadoop-spark02:8031</value>
  74. </property>
  75. <property>
  76. <name>yarn.resourcemanager.admin.address.rm2</name>
  77. <value>hadoop-spark02:8033</value>
  78. </property>
  79. <property>
  80. <name>yarn.resourcemanager.ha.admin.address.rm2</name>
  81. <value>hadoop-spark02:23142</value>
  82. </property>
  83. <property>
  84. <name>yarn.resourcemanager.scheduler.address.rm1</name>
  85. <value>hadoop-spark01:8030</value>
  86. </property>
  87. <property>
  88. <name>yarn.resourcemanager.webapp.address.rm1</name>
  89. <value>hadoop-spark01:8088</value>
  90. </property>
  91. <property>
  92. <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
  93. <value>hadoop-spark01:8031</value>
  94. </property>
  95. <property>
  96. <name>yarn.resourcemanager.admin.address.rm1</name>
  97. <value>hadoop-spark01:8033</value>
  98. </property>
  99. <property>
  100. <name>yarn.resourcemanager.ha.admin.address.rm1</name>
  101. <value>hadoop-spark01:23142</value>
  102. </property>
  103. <!--指定resourcemanager的状态信息存储在zookeeper集群,默认是存放在FileSystem里面。-->
  104. <property>
  105. <name>yarn.resourcemanager.store.class</name>
  106. <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  107. </property>
  108. <property>
  109. <name>yarn.nodemanager.vmem-check-enabled</name>
  110. <value>false</value>
  111. <description>
  112. </description>
  113. </property>
  114. </configuration>
3.3.6 配置hadoop JAVA环境变量
  1. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ vim hadoop-env.sh
  2. hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ sed -n "54p" hadoop-env.sh
  3. export JAVA_HOME=/usr/local/jdk
3.3.7 先启动journalnode
  1. hadoop@hadoop-spark01:~$ hadoop-daemon.sh start journalnode
  2. hadoop@hadoop-spark02:~$ hadoop-daemon.sh start journalnode
  3. hadoop@hadoop-spark03:~$ hadoop-daemon.sh start journalnode
3.3.8 其次需要先在Master节点执行namenode的格式化(只需要执行这一次,后面再启动Hadoop时,不要再次格式化namenode),命令如下:
  1. #格式化namenode
  2. hadoop@hadoop-spark01:~$ hdfs namenode -format
  3. #将格式化后的hadoop文件发送给hadoop-spark02(另一个namenode)
  4. hadoop@hadoop-spark01:~$ scp -r /usr/local/hadoop hadoop@hadoop-spark02:/usr/local/
3.3.9 在hadoop-spark01上格式化ZKFC
  1. hadoop@hadoop-spark01:~$ hdfs zkfc -formatZK

3.4 启动Hadoop

3.4.1 先修改它们的启动文件

启动需要在Master节点上进行,执行如下命令:

  1. #在start-dfs.sh和stop-dfs.sh文件中加上下列代码
  2. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim start-dfs.sh
  3. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -6 start-dfs.sh
  4. HDFS_NAMENODE_USER=hadoop
  5. HDFS_DATANODE_USER=hadoop
  6. HDFS_DATANODE_SECURE_USER=hadoop
  7. HDFS_SECONDARYNAMENODE_USER=hadoop
  8. HDFS_JOURNALNODE_USER=hadoop
  9. HDFS_ZKFC_USER=hadoop
  10. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim stop-dfs.sh
  11. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -6 stop-dfs.sh
  12. HDFS_NAMENODE_USER=hadoop
  13. HDFS_DATANODE_USER=hadoop
  14. HDFS_DATANODE_SECURE_USER=hadoop
  15. HDFS_SECONDARYNAMENODE_USER=hadoop
  16. HDFS_JOURNALNODE_USER=hadoop
  17. HDFS_ZKFC_USER=hadoop
  1. start-yarn.shstop-yarn.sh文件中加上下列代码
  2. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim start-yarn.sh
  3. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -3 start-yarn.sh
  4. YARN_RESOURCEMANAGER_USER=hadoop
  5. HADOOP_SECURE_DN_USER=hadoop
  6. YARN_NODEMANAGER_USER=hadoop
  7. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim stop-yarn.sh
  8. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -3 stop-yarn.sh
  9. YARN_RESOURCEMANAGER_USER=hadoop
  10. HADOOP_SECURE_DN_USER=hadoop
  11. YARN_NODEMANAGER_USER=hadoop
3.4.2 启动hadoop(由Master节点执行)
  1. #启动hadoop
  2. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ start-all.sh
  3. #启动web查看作业的历史运行情况
  4. hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ mr-jobhistory-daemon.sh start historyserver
3.4.3 查看jps
  1. #hadoop-spark01
  2. hadoop@hadoop-spark01:~$ jps
  3. 62704 NameNode
  4. 64068 JobHistoryServer
  5. 63256 DFSZKFailoverController
  6. 64137 Jps
  7. 62106 JournalNode
  8. 63691 NodeManager
  9. 63547 ResourceManager
  10. 62877 DataNode
  11. #hadoop-spark02
  12. hadoop@hadoop-spark02:~$ jps
  13. 61860 NodeManager
  14. 61239 NameNode
  15. 60983 JournalNode
  16. 61368 DataNode
  17. 62153 Jps
  18. 61612 DFSZKFailoverController
  19. 61743 ResourceManager
  20. #hadoop-spark03
  21. hadoop@hadoop-spark03:~$ jps
  22. 60000 DataNode
  23. 60148 NodeManager
  24. 60246 Jps
  25. 59853 JournalNode
3.4.4 验证数据节点是否正常启动

其核心就是Live datanodes不为 0

  1. hadoop@hadoop-spark01:~$ hdfs dfsadmin -report
  2. Configured Capacity: 156133490688 (145.41 GB)
  3. Present Capacity: 107745738752 (100.35 GB)
  4. DFS Remaining: 107745652736 (100.35 GB)
  5. DFS Used: 86016 (84 KB)
  6. DFS Used%: 0.00%
  7. Replicated Blocks:
  8. Under replicated blocks: 0
  9. Blocks with corrupt replicas: 0
  10. Missing blocks: 0
  11. Missing blocks (with replication factor 1): 0
  12. Low redundancy blocks with highest priority to recover: 0
  13. Pending deletion blocks: 0
  14. Erasure Coded Block Groups:
  15. Low redundancy block groups: 0
  16. Block groups with corrupt internal blocks: 0
  17. Missing block groups: 0
  18. Low redundancy blocks with highest priority to recover: 0
  19. Pending deletion blocks: 0
  20. -------------------------------------------------
  21. Live datanodes (3):
  22. Name: 192.168.200.43:9866 (hadoop-spark01)
  23. Hostname: hadoop-spark01
  24. Decommission Status : Normal
  25. Configured Capacity: 52044496896 (48.47 GB)
  26. DFS Used: 28672 (28 KB)
  27. Non DFS Used: 13456187392 (12.53 GB)
  28. DFS Remaining: 35914149888 (33.45 GB)
  29. DFS Used%: 0.00%
  30. DFS Remaining%: 69.01%
  31. Configured Cache Capacity: 0 (0 B)
  32. Cache Used: 0 (0 B)
  33. Cache Remaining: 0 (0 B)
  34. Cache Used%: 100.00%
  35. Cache Remaining%: 0.00%
  36. Xceivers: 1
  37. Last contact: Fri Jun 17 14:13:41 CST 2022
  38. Last Block Report: Fri Jun 17 13:57:02 CST 2022
  39. Num of Blocks: 0
  40. Name: 192.168.200.44:9866 (hadoop-spark02)
  41. Hostname: hadoop-spark02
  42. Decommission Status : Normal
  43. Configured Capacity: 52044496896 (48.47 GB)
  44. DFS Used: 28672 (28 KB)
  45. Non DFS Used: 13455249408 (12.53 GB)
  46. DFS Remaining: 35915087872 (33.45 GB)
  47. DFS Used%: 0.00%
  48. DFS Remaining%: 69.01%
  49. Configured Cache Capacity: 0 (0 B)
  50. Cache Used: 0 (0 B)
  51. Cache Remaining: 0 (0 B)
  52. Cache Used%: 100.00%
  53. Cache Remaining%: 0.00%
  54. Xceivers: 1
  55. Last contact: Fri Jun 17 14:13:41 CST 2022
  56. Last Block Report: Fri Jun 17 13:57:02 CST 2022
  57. Num of Blocks: 0
  58. Name: 192.168.200.45:9866 (hadoop-spark03)
  59. Hostname: hadoop-spark03
  60. Decommission Status : Normal
  61. Configured Capacity: 52044496896 (48.47 GB)
  62. DFS Used: 28672 (28 KB)
  63. Non DFS Used: 13453922304 (12.53 GB)
  64. DFS Remaining: 35916414976 (33.45 GB)
  65. DFS Used%: 0.00%
  66. DFS Remaining%: 69.01%
  67. Configured Cache Capacity: 0 (0 B)
  68. Cache Used: 0 (0 B)
  69. Cache Remaining: 0 (0 B)
  70. Cache Used%: 100.00%
  71. Cache Remaining%: 0.00%
  72. Xceivers: 1
  73. Last contact: Fri Jun 17 14:13:41 CST 2022
  74. Last Block Report: Fri Jun 17 13:57:02 CST 2022
  75. Num of Blocks: 0

也可以在Linux系统的浏览器中输入地址http://hadoop-spark01:50070/,通过 Web 页面看到查看名称节点和数据节点的状态。如果不成功,可以通过启动日志排查原因。

image.png-287.7kB

至此,就顺利完成了Hadoop集群搭建。

四、安装scala(需要评估是否需要)

4.1 安装scala

  1. hadoop@hadoop-spark01:~$ sudo tar xf scala-2.12.16.tgz -C /usr/local/
  2. hadoop@hadoop-spark01:~$ sudo mv /usr/local/scala-2.12.16 /usr/local/scala
  3. hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/scala/

4.2 配置环境变量

  1. hadoop@hadoop-spark01:~$ vim ~/.bashrc
  2. hadoop@hadoop-spark01:~$ tail -3 ~/.bashrc
  3. #scala
  4. export SCALA_HOME=/usr/local/scala
  5. export PATH=$PATH:$SCALA_HOME/bin
  6. #使环境变量立即生效
  7. hadoop@hadoop-spark01:~$ source ~/.bashrc

4.3 查看scala安装的版本是否正确

  1. hadoop@hadoop-spark01:~$ scala -version
  2. Scala code runner version 2.12.16 -- Copyright 2002-2022, LAMP/EPFL and Lightbend, Inc.

4.4 测试scala是否安装成功

  1. hadoop@hadoop-spark01:~$ scala
  2. Welcome to Scala 2.12.16 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
  3. Type in expressions for evaluation. Or try :help.
  4. scala> :quit

五、spark集群搭建(三节点同步进行)

5.1 安装Spark(Spark on Yarn)

  1. hadoop@hadoop-spark01:~$ sudo tar xf spark-3.2.0-bin-hadoop3.2.tgz -C /usr/local/
  2. hadoop@hadoop-spark01:~$ sudo mv /usr/local/spark-3.2.0-bin-hadoop3.2 /usr/local/spark
  3. hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/spark/

5.2 修改Spark的配置文件

5.2.1 修改spark-env.sh文件
  1. hadoop@hadoop-spark01:~$ cd /usr/local/spark/conf/
  2. hadoop@hadoop-spark01:/usr/local/spark/conf$ cp spark-env.sh.template spark-env.sh
  1. hadoop@hadoop-spark01:/usr/local/spark/conf$ vim spark-env.sh
  2. hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -22 spark-env.sh
  3. # jdk
  4. export JAVA_HOME=/usr/local/jdk
  5. # Hadoop目录
  6. export HADOOP_HOME=/usr/local/hadoop
  7. # Hadoop的配置文件目录
  8. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
  9. # YARN 的配置文件目录
  10. export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
  11. # SPARK 的目录
  12. export SPARK_HOME=/usr/local/spark
  13. # SPARK 执行文件目录
  14. export PATH=$SPARK_HOME/bin:$PATH
  15. # Master节点
  16. export SPARK_MASTER_HOST=hadoop-spark01
  17. # 任务提交端口
  18. export SPARK_MASTER_PORT=7077
  19. #每个worker可使用内存量
  20. export SPARK_WORKER_MEMORY=1g
  21. #每个worker可使用核数
  22. export SPARK_WORKER_CORES=1
  23. #修改spark监视窗口的端口
  24. export SPARK_MASTER_WEBUI_PORT=8089
5.2.2 修改worker文件
  1. hadoop@hadoop-spark01:/usr/local/spark/conf$ cp workers.template workers
  2. hadoop@hadoop-spark01:/usr/local/spark/conf$ vim workers
  3. hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -3 workers
  4. #localhost
  5. hadoop-spark02
  6. hadoop-spark03

5.3 配置spark环境变量

  1. hadoop@hadoop-spark01:/usr/local/spark/conf$ vim ~/.bashrc
  2. hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -3 ~/.bashrc
  3. #spark
  4. export SPARK_HOME=/usr/local/spark
  5. export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
  6. hadoop@hadoop-spark01:/usr/local/spark/conf$ source ~/.bashrc

5.4 启动spark

5.4.1 修改启动文件名
  1. #由于spark的启动文件名会与Hadoop集群的启动文件名发生冲突,所以修改spark的启动文件名
  2. hadoop@hadoop-spark01:/usr/local/spark/conf$ cd /usr/local/spark/sbin/
  3. hadoop@hadoop-spark01:/usr/local/spark/sbin$ mv start-all.sh start-all-spark.sh
  4. hadoop@hadoop-spark01:/usr/local/spark/sbin$ mv stop-all.sh stop-all-spark.sh
5.4.2 启动spark(由Master节点执行)
  1. 启动spark集群
  2. hadoop@hadoop-spark01:/usr/local/spark/sbin$ start-all-spark.sh
5.4.3 查看web界面

http://hadoop-spark01:8089/
image.png-304.7kB

六、测试spark与hadoop集群是否配置成功

6.1 创建test.json文件

  1. hadoop@hadoop-spark01:~$ vim test.json
  2. hadoop@hadoop-spark01:~$ cat test.json
  3. {"DEST_COUNTRY_NAME":"United States","ORIGIN_COUNTRY_NAME":"Romania","count":1}

6.2 进入spark终端

  1. hadoop@hadoop-spark01:~$ spark-shell
  2. Setting default log level to "WARN".
  3. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  4. 2022-06-20 16:34:08,277 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  5. Spark context Web UI available at http://hadoop-spark01:4040
  6. Spark context available as 'sc' (master = local[*], app id = local-1655714049138).
  7. Spark session available as 'spark'.
  8. Welcome to
  9. ____ __
  10. / __/__ ___ _____/ /__
  11. _\ \/ _ \/ _ `/ __/ '_/
  12. /___/ .__/\_,_/_/ /_/\_\ version 3.2.0
  13. /_/
  14. Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
  15. Type in expressions to have them evaluated.
  16. Type :help for more information.
  17. scala> val testDF = spark.read.json("file:///home/hadoop/test.json")
  18. testDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
  19. scala> testDF.write.format("parquet").save("/spark-dir/parquet/test")
  20. scala>

终端的页面如下
image.png-25.1kB

访问hadoopUI

image.png-160.4kB

产生以下文件说明spark搭建成功

image.png-191kB

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注