[关闭]
@yangwenbo 2022-07-08T11:49:32.000000Z 字数 25545 阅读 291

清华大学-FIB实验室

spark集群搭建(测试)

一、前期准备工作(四节点同步进行)

1.1 环境详情

名称 IP 说明 jdk版本 hadoop版本 zookeeper版本 spark版本 操作系统
spark01 192.168.200.43 Master节点 1.8 3.2.0 3.7.0 3.2.0 ubuntu22.04
spark02 192.168.200.44 Slave1节点 1.8 3.2.0 3.7.0 3.2.0 ubuntu22.04
spark03 192.168.200.45 Slave2节点 1.8 3.2.0 3.7.0 3.2.0 ubuntu22.04
spark04 192.168.200.46 Slave3节点 1.8 3.2.0 / 3.2.0 ubuntu22.04

1.2 添加主机名映射

  1. #添加主机名映射
  2. root@spark01:~# vim /etc/hosts
  3. root@spark01:~# tail -4 /etc/hosts
  4. 192.168.200.43 spark01
  5. 192.168.200.44 spark02
  6. 192.168.200.45 spark03
  7. 192.168.200.46 spark04

1.3 创建hadoop用户

  1. #创建hadoop用户
  2. root@spark01:~# adduser hadoop --home /home/hadoop
  3. #为新建的hadoop增加管理员权限
  4. root@spark01:~# adduser hadoop sudo

1.4 挂载磁盘

  1. #磁盘格式化
  2. root@spark01:~$ mkfs.ext4 /dev/sdb
  3. root@spark01:~$ mkfs.ext4 /dev/sdc
  4. #创建挂载目录
  5. root@spark01:~# mkdir /data1
  6. root@spark01:~# mkdir /data2
  7. #挂载磁盘
  8. root@spark01:~# mount /dev/sdb /data1/
  9. root@spark01:~# mount /dev/sdc /data2/
  10. #修改目录属组、属主
  11. root@spark01:~# chown hadoop:hadoop /data1
  12. root@spark01:~# chown hadoop:hadoop /data2
  1. #加入开机自挂载
  2. root@spark01:~# vim /etc/fstab
  3. root@spark01:~# tail -2 /etc/fstab
  4. /dev/sdb /data1 ext4 defaults 0 0
  5. /dev/sdc /data2 ext4 defaults 0 0

1.5 安装SSH、配置SSH无密码登陆

集群、单节点模式都需要用到 SSH 登陆(类似于远程登陆,你可以登录某台 Linux 主机,并且在上面运行命令),Ubuntu 默认已安装了 SSH client,此外还需要安装 SSH server:

  1. #安装openssh-server
  2. hadoop@spark01:~$ sudo apt-get -y install openssh-server
  1. #生成密匙
  2. hadoop@spark01:~$ ssh-keygen -t rsa
  3. #分发公匙
  4. hadoop@spark01:~$ ssh-copy-id 192.168.200.43
  5. hadoop@spark01:~$ ssh-copy-id 192.168.200.44
  6. hadoop@spark01:~$ ssh-copy-id 192.168.200.45
  7. hadoop@spark01:~$ ssh-copy-id 192.168.200.46

1.6 下载安装包

进到spark官网

在这需要注意选择合适的安装包,因为我们使用Spark的时候一般都是需要和Hadoop交互的,所以需要下载带有Hadoop依赖的安装包
图片.png-281.3kB

但是这里面只有Spark的最新版本,一般不建议选择最新版本,可以在最新版本往下面回退一两个小版本
图片.png-429.2kB

这个时候就需要选择Hadoop版本对应的Spark安装包,里面Hadoop的版本只有2.7和3.2的,这里选择版本较高的hadoop3.2对应的这个Spark安装包。

hadoop的版本也要跟spark相对应,去hadoop官网选择hadoop3.2
图片.png-352.3kB

这里选择zookeeper作为高可用,进入zookeeper官网下载一个版本
image.png-160.5kB

二、 安装Java环境(四节点同步进行)

  1. #解压JDK
  2. hadoop@spark01:~$ sudo tar xf jdk-8u162-linux-x64.tar.gz -C /usr/local/
  3. hadoop@spark01:~$ sudo mv /usr/local/jdk1.8.0_162 /usr/local/jdk
  1. #添加环境变量
  2. hadoop@spark01:~$ vim ~/.bashrc
  3. hadoop@spark01:~$ tail -4 ~/.bashrc
  4. #jdk
  5. export JAVA_HOME=/usr/local/jdk
  6. export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
  7. export CLASSPATH=.$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
  8. #使环境变量立即生效
  9. hadoop@spark01:~$ source ~/.bashrc
  1. #检查java是否安装成功
  2. hadoop@hadoop-spark01:~$ java -version
  3. java version "1.8.0_162"
  4. Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
  5. Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

三、安装zookeeper集群(三节点同步进行)

3.1 安装zookeeper

  1. #解压zookeeper
  2. hadoop@spark01:~$ sudo tar xf apache-zookeeper-3.7.0-bin.tar.gz -C /usr/local/
  3. hadoop@spark01:~$ sudo mv /usr/local/apache-zookeeper-3.7.0-bin /usr/local/zookeeper
  4. hadoop@spark01:~$ sudo chown -R hadoop:hadoop /usr/local/zookeeper

3.2 创建zookeeper数据目录以及日志存放目录

  1. #创建zookeeper数据目录以及日志存放目录
  2. hadoop@spark01:~$ mkdir -p /usr/local/zookeeper/data
  3. hadoop@spark01:~$ mkdir -p /usr/local/zookeeper/logs

3.3 修改zk的配置文件

  1. #复制 zoo_sample.cfg 文件的并命名为为 zoo.cfg,真正用的的是zoo.cfg
  2. hadoop@spark01:~$ cd /usr/local/zookeeper/conf/
  3. hadoop@spark01:/usr/local/zookeeper/conf$ cat zoo_sample.cfg | egrep -v "^$|^#" >zoo.cfg
  4. hadoop@spark01:/usr/local/zookeeper/conf$ vim zoo.cfg
  5. hadoop@spark01:/usr/local/zookeeper/conf$ cat zoo.cfg
  6. #通信心跳数,Zookeeper服务器与客户端心跳时间,单位毫秒
  7. tickTime=2000
  8. #LF初始通信时限
  9. initLimit=10
  10. #LF同步通信时限
  11. syncLimit=5
  12. #数据文件夹
  13. dataDir=/usr/local/zookeeper/data
  14. #日志文件夹
  15. dataLogDir=/usr/local/zookeeper/logs
  16. #开启四字命令(版本)
  17. 4lw.commands.whitelist=*
  18. #客户端连接端口
  19. clientPort=2181
  20. #配置集群
  21. server.1=192.168.200.43:2888:3888
  22. server.2=192.168.200.44:2888:3888
  23. server.3=192.168.200.45:2888:3888
  1. # 【注意!!!】每台机器上的 myid 是不一样的,server.2 对应的要 echo 2
  2. hadoop@spark01:/usr/local/zookeeper/conf$ echo 1 > /usr/local/zookeeper/data/myid
  3. hadoop@spark02:/usr/local/zookeeper/conf$ echo 2 > /usr/local/zookeeper/data/myid
  4. hadoop@spark03:/usr/local/zookeeper/conf$ echo 3 > /usr/local/zookeeper/data/myid

3.4 加入环境变量

  1. hadoop@spark01:/usr/local/zookeeper/conf$ cd /usr/local/zookeeper/bin/
  2. hadoop@spark01:/usr/local/zookeeper/bin$ vim zkEnv.sh
  3. hadoop@spark01:/usr/local/zookeeper/bin$ sed -n "32p" zkEnv.sh
  4. export JAVA_HOME=/usr/local/jdk

3.5 配置启、停脚本

  1. hadoop@spark01:/usr/local/zookeeper/bin$ cd /etc/init.d/
  2. hadoop@spark01:/etc/init.d$ sudo vim zookeeper
  3. hadoop@spark01:/etc/init.d$ cat zookeeper
  4. #!/bin/bash
  5. #chkconfig:2345 20 90
  6. #description:zookeeper
  7. #processname:zookeeper
  8. export JAVA_HOME=/usr/local/jdk
  9. case $1 in
  10. start) sudo /usr/local/zookeeper/bin/zkServer.sh start;;
  11. stop) sudo /usr/local/zookeeper/bin/zkServer.sh stop;;
  12. status) sudo /usr/local/zookeeper/bin/zkServer.sh status;;
  13. restart) sudo /usr/local/zookeeper/bin/zkServer.sh restart;;
  14. *) echo "require start|stop|status|restart" ;;
  15. esac
  1. #添加权限
  2. hadoop@spark01:/etc/init.d$ sudo chmod +x zookeeper

3.6 启动zookeeper

  1. #启动zookeeper
  2. hadoop@spark01:/etc/init.d$ service zookeeper start
  3. #查看监听端口
  4. hadoop@spark01:~$ netstat -anp | grep 2181
  5. tcp6 0 0 :::2181 :::* LISTEN

四、HDFS-HA集群搭建(四节点同步进行)

4.1 安装Hadoop3.2.0

  1. #安装hadoop
  2. hadoop@spark01:~$ sudo tar xf hadoop-3.2.0.tar.gz -C /usr/local/
  3. hadoop@spark01:~$ sudo mv /usr/local/hadoop-3.2.0 /usr/local/hadoop
  4. hadoop@spark01:~$ sudo chown -R hadoop:hadoop /usr/local/hadoop
  5. hadoop@spark01:~$ sudo chmod -R g+w /usr/local/hadoop/
  1. #验证Hadoop是否可用
  2. hadoop@spark01:~$ /usr/local/hadoop/bin/hadoop version
  3. Hadoop 3.2.0
  4. Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
  5. Compiled by sunilg on 2019-01-08T06:08Z
  6. Compiled with protoc 2.5.0
  7. From source with checksum d3f0795ed0d9dc378e2c785d3668f39
  8. This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar

4.2 配置PATH变量

  1. #添加环境变量
  2. hadoop@spark01:~$ vim ~/.bashrc
  3. hadoop@spark01:~$ tail -2 ~/.bashrc
  4. #hadoop
  5. export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
  6. #使环境变量立即生效
  7. hadoop@spark01:~$ source ~/.bashrc

4.3 配置HA集群

在配置集群/分布式模式时,需要修改/usr/local/hadoop/etc/hadoop目录下的配置文件,这里仅设置正常启动所必须的设置项,包括workerscore-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xml共5个文件。

4.3.1 修改文件workers
  1. hadoop@spark01:~$ cd /usr/local/hadoop/etc/hadoop/
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ vim workers
  3. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ cat workers
  4. spark01
  5. spark02
  6. spark03
  7. spark04
4.3.2 修改文件core-site.xml

请把core-site.xml文件修改为如下内容:

  1. spark04hadoop@spark01:/usr/local/hadoop/etc/hadoop$ nano core-site.xml
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ cat core-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>fs.defaultFS</name>
  7. <value>hdfs://ns1</value>
  8. </property>
  9. <property>
  10. <name>hadoop.tmp.dir</name>
  11. <value>/usr/local/hadoop/tmp</value>
  12. </property>
  13. <property>
  14. <name>hadoop.http.staticuser.user</name>
  15. <value>hadoop</value>
  16. </property>
  17. <property>
  18. <name>ha.zookeeper.quorum</name>
  19. <value>spark01:2181,spark02:2181,spark03:2181</value>
  20. </property>
  21. <property>
  22. <name>hadoop.proxyuser.hadoop.hosts</name>
  23. <value>*</value>
  24. </property>
  25. <property>
  26. <name>hadoop.proxyuser.hadoop.groups</name>
  27. <value>*</value>
  28. </property>
  29. <!--设置用户组默认权限-->
  30. <property>
  31. <name>fs.permissions.umask-mode</name>
  32. <value>002</value>
  33. </property>
  34. </configuration>
4.3.3 修改文件hdfs-site.xml
  1. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ nano hdfs-site.xml
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ cat hdfs-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>dfs.ha.automatic-failover.enabled.ns</name>
  7. <value>true</value>
  8. </property>
  9. <property>
  10. <name>dfs.replication</name>
  11. <value>3</value>
  12. </property>
  13. <property>
  14. <name>dfs.permissions.enabled</name>
  15. <value>false</value>
  16. </property>
  17. <property>
  18. <name>dfs.nameservices</name>
  19. <value>ns1</value>
  20. </property>
  21. <property>
  22. <name>dfs.blocksize</name>
  23. <value>134217728</value>
  24. </property>
  25. <property>
  26. <name>dfs.ha.namenodes.ns1</name>
  27. <value>nn1,nn2</value>
  28. </property>
  29. <!-- nn1RPC通信地址,nn1所在地址 -->
  30. <property>
  31. <name>dfs.namenode.rpc-address.ns1.nn1</name>
  32. <value>spark01:9000</value>
  33. </property>
  34. <!-- nn1http通信地址,外部访问地址 -->
  35. <property>
  36. <name>dfs.namenode.http-address.ns1.nn1</name>
  37. <value>spark01:50070</value>
  38. </property>
  39. <!-- nn2RPC通信地址,nn2所在地址 -->
  40. <property>
  41. <name>dfs.namenode.rpc-address.ns1.nn2</name>
  42. <value>spark02:9000</value>
  43. </property>
  44. <!-- nn2http通信地址,外部访问地址 -->
  45. <property>
  46. <name>dfs.namenode.http-address.ns1.nn2</name>
  47. <value>spark02:50070</value>
  48. </property>
  49. <!-- 指定NameNode的元数据在JournalNode日志上的存放位置(一般和zookeeper部署在一起) -->
  50. <property>
  51. <name>dfs.namenode.shared.edits.dir</name>
  52. <value>qjournal://spark01:8485;spark02:8485/ns1</value>
  53. </property>
  54. <!-- 指定JournalNode在本地磁盘存放数据的位置 -->
  55. <property>
  56. <name>dfs.journalnode.edits.dir</name>
  57. <value>/usr/local/hadoop/journaldata</value>
  58. </property>
  59. <!--开启namenode失败自动切换-->
  60. <property>
  61. <name>dfs.ha.automatic-failover.enabled.ns1</name>
  62. <value>true</value>
  63. </property>
  64. <!--客户端通过代理访问namenode,访问文件系统,HDFS 客户端与Active 节点通信的Java 类,使用其确定Active 节点是否活跃 -->
  65. <property>
  66. <name>dfs.client.failover.proxy.provider.ns1</name>
  67. <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  68. </property>
  69. <!--这是配置自动切换的方法,有多种使用方法,具体可以看官网,在文末会给地址,这里是远程登录杀死的方法 -->
  70. <property>
  71. <name>dfs.ha.fencing.methods</name>
  72. <value>
  73. sshfence
  74. shell(/bin/true)
  75. </value>
  76. </property>
  77. <!-- 这个是使用sshfence隔离机制时才需要配置ssh免登陆 -->
  78. <property>
  79. <name>dfs.ha.fencing.ssh.private-key-files</name>
  80. <value>/home/hadoop/.ssh/id_rsa</value>
  81. </property>
  82. <!-- 配置sshfence隔离机制超时时间,这个属性同上,如果你是用脚本的方法切换,这个应该是可以不配置的 -->
  83. <property>
  84. <name>dfs.ha.fencing.ssh.connect-timeout</name>
  85. <value>30000</value>
  86. </property>
  87. <!-- 这个是开启自动故障转移,如果你没有自动故障转移,这个可以先不配 -->
  88. <property>
  89. <name>dfs.ha.automatic-failover.enabled</name>
  90. <value>true</value>
  91. </property>
  92. <!-- HDFS文件系统元信息保存目录 -->
  93. <property>
  94. <name>dfs.name.dir</name>
  95. <value>file:/data1/name,/data2/name</value>
  96. </property>
  97. <!-- HDFS文件系统数据保存目录 -->
  98. <property>
  99. <name>dfs.data.dir</name>
  100. <value>file:/data1/data,/data2/data</value>
  101. </property>
  102. <property>
  103. <name>dfs.webhdfs.enabled</name>
  104. <value>true</value>
  105. </property>
  106. <!--系统空间容量预留-->
  107. <property>
  108. <name>dfs.datanode.du.reserved</name>
  109. <value>1073741824</value>
  110. </property>
  111. <!--HDFS 副本存放磁盘负载-->
  112. <property>
  113. <name>dfs.datanode.fsdataset.volume.choosing.policy</name>
  114. <value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
  115. </property>
  116. <!--是否在HDFS中开启权限检查-->
  117. <property>
  118. <name>dfs.permissions.enabled</name>
  119. <value>true</value>
  120. </property>
  121. <!--是否在hdfs开启acl,默认为false-->
  122. <property>
  123. <name>dfs.namenode.acls.enabled</name>
  124. <value>true</value>
  125. </property>
  126. </configuration>
4.3.4 修改文件mapred-site.xml

把mapred-site.xml文件配置成如下内容:

  1. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ nano mapred-site.xml
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ cat mapred-site.xml
  3. ......
  4. <configuration>
  5. <!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
  6. <property>
  7. <name>mapreduce.framework.name</name>
  8. <value>yarn</value>
  9. </property>
  10. <property>
  11. <name>yarn.application.classpath</name>
  12. <value>/usr/local/hadoop/etc/hadoop/mapreduce/*,/usr/local/hadoop/etc/hadoop/mapreduce/lib/*</value>
  13. </property>
  14. </configuration>
4.3.5 修改文件 yarn-site.xml

请把yarn-site.xml文件配置成如下内容:

  1. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ nano yarn-site.xml
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ cat yarn-site.xml
  3. ......
  4. <configuration>
  5. <property>
  6. <name>yarn.nodemanager.aux-services</name>
  7. <value>mapreduce_shuffle</value>
  8. </property>
  9. <!-- Site specific YARN configuration properties -->
  10. <!--启用resourcemanager ha-->
  11. <!--是否开启RM ha,默认是开启的-->
  12. <property>
  13. <name>yarn.resourcemanager.ha.enabled</name>
  14. <value>true</value>
  15. </property>
  16. <!-- 指定resourcemanager的名字 -->
  17. <property>
  18. <name>yarn.resourcemanager.cluster-id</name>
  19. <value>yrc</value>
  20. </property>
  21. <!-- 使用了2resourcemanager,分别指定Resourcemanager的地址 -->
  22. <property>
  23. <name>yarn.resourcemanager.ha.rm-ids</name>
  24. <value>rm1,rm2</value>
  25. </property>
  26. <!-- 指定rm1的地址 -->
  27. <property>
  28. <name>yarn.resourcemanager.hostname.rm1</name>
  29. <value>spark01</value>
  30. </property>
  31. <!-- 指定rm2的地址 -->
  32. <property>
  33. <name>yarn.resourcemanager.hostname.rm2</name>
  34. <value>spark02</value>
  35. </property>
  36. <property>
  37. <name>yarn.application.classpath</name>
  38. <value>/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*</value>
  39. </property>
  40. <!--指定zookeeper集群的地址-->
  41. <property>
  42. <name>yarn.resourcemanager.zk-address</name>
  43. <value>spark01:2181,spark02:2181,spark03:2181</value>
  44. </property>
  45. <!--启用自动恢复,当任务进行一半,rm坏掉,就要启动自动恢复,默认是false-->
  46. <property>
  47. <name>yarn.resourcemanager.recovery.enabled</name>
  48. <value>true</value>
  49. </property>
  50. <!-- NodeManager上运行的附属服务,默认是mapreduce_shuffle -->
  51. <property>
  52. <name>yarn.nodemanager.aux-services</name>
  53. <value>mapreduce_shuffle</value>
  54. </property>
  55. <property>
  56. <name>yarn.resourcemanager.address.rm1</name>
  57. <value>spark01:8032</value>
  58. </property>
  59. <property>
  60. <name>yarn.resourcemanager.address.rm2</name>
  61. <value>spark02:8032</value>
  62. </property>
  63. <property>
  64. <name>yarn.resourcemanager.scheduler.address.rm2</name>
  65. <value>spark02:8030</value>
  66. </property>
  67. <property>
  68. <name>yarn.resourcemanager.webapp.address.rm2</name>
  69. <value>spark02:8088</value>
  70. </property>
  71. <property>
  72. <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
  73. <value>spark02:8031</value>
  74. </property>
  75. <property>
  76. <name>yarn.resourcemanager.admin.address.rm2</name>
  77. <value>spark02:8033</value>
  78. </property>
  79. <property>
  80. <name>yarn.resourcemanager.ha.admin.address.rm2</name>
  81. <value>spark02:23142</value>
  82. </property>
  83. <property>
  84. <name>yarn.resourcemanager.scheduler.address.rm1</name>
  85. <value>spark01:8030</value>
  86. </property>
  87. <property>
  88. <name>yarn.resourcemanager.webapp.address.rm1</name>
  89. <value>spark01:8088</value>
  90. </property>
  91. <property>
  92. <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
  93. <value>spark01:8031</value>
  94. </property>
  95. <property>
  96. <name>yarn.resourcemanager.admin.address.rm1</name>
  97. <value>spark01:8033</value>
  98. </property>
  99. <property>
  100. <name>yarn.resourcemanager.ha.admin.address.rm1</name>
  101. <value>spark01:23142</value>
  102. </property>
  103. <!--指定resourcemanager的状态信息存储在zookeeper集群,默认是存放在FileSystem里面。-->
  104. <property>
  105. <name>yarn.resourcemanager.store.class</name>
  106. <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  107. </property>
  108. <property>
  109. <name>yarn.nodemanager.vmem-check-enabled</name>
  110. <value>false</value>
  111. <description>
  112. </description>
  113. </property>
  114. <!--磁盘空间利用率的最大百分比,在该百分比之后,磁盘被标记为坏-->
  115. <property>
  116. <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
  117. <value>95.0</value>
  118. </property>
  119. <!-- 每个节点可用内存,单位MB,默认是8g -->
  120. <property>
  121. <name>yarn.nodemanager.resource.memory-mb</name>
  122. <value>3072</value>
  123. </property>
  124. <!-- 单个任务可申请最少内存,默认1024MB -->
  125. <property>
  126. <name>yarn.scheduler.minimum-allocation-mb</name>
  127. <value>1024</value>
  128. </property>
  129. <!-- 单个任务可申请最大内存,默认8192MB -->
  130. <property>
  131. <name>yarn.scheduler.maximum-allocation-mb</name>
  132. <value>3072</value>
  133. </property>
  134. <!-- 可以分配给容器的CPU核数 -->
  135. <property>
  136. <name>yarn.nodemanager.resource.cpu-vcores</name>
  137. <value>4</value>
  138. </property>
  139. <!-- 单个任务最小可申请的虚拟核心数,默认为1 -->
  140. <property>
  141. <name>yarn.scheduler.minimum-allocation-vcores</name>
  142. <value>1</value>
  143. </property>
  144. <!-- 单个任务最大可申请的虚拟核心数,默认为4 -->
  145. <property>
  146. <name>yarn.scheduler.maximum-allocation-vcores</name>
  147. <value>4</value>
  148. </property>
  149. </configuration>
4.3.6 配置hadoop JAVA环境变量
  1. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ vim hadoop-env.sh
  2. hadoop@spark01:/usr/local/hadoop/etc/hadoop$ sed -n "54p" hadoop-env.sh
  3. export JAVA_HOME=/usr/local/jdk
4.3.7 先启动journalnode
  1. hadoop@spark01:~$ hadoop-daemon.sh start journalnode
  2. hadoop@spark02:~$ hadoop-daemon.sh start journalnode
  3. hadoop@spark03:~$ hadoop-daemon.sh start journalnode
  4. hadoop@spark04:~$ hadoop-daemon.sh start journalnode
4.3.8 其次需要先在Master节点执行namenode的格式化(只需要执行这一次,后面再启动Hadoop时,不要再次格式化namenode),命令如下:
  1. #格式化namenode
  2. hadoop@spark01:~$ hdfs namenode -format
  3. #将格式化后的hadoop文件发送给hadoop-spark02(另一个namenode)
  4. hadoop@spark01:~$ scp -r /usr/local/hadoop hadoop@spark02:/usr/local/
4.3.9 在hadoop-spark01上格式化ZKFC
  1. hadoop@spark01:~$ hdfs zkfc -formatZK

4.4 启动Hadoop

4.4.1 先修改它们的启动文件

启动需要在Master节点上进行,执行如下命令:

  1. #在start-dfs.sh和stop-dfs.sh文件中加上下列代码
  2. hadoop@spark01:~$ cd /usr/local/hadoop/sbin/
  3. hadoop@spark01:/usr/local/hadoop/sbin$ vim start-dfs.sh
  4. hadoop@spark01:/usr/local/hadoop/sbin$ tail -6 start-dfs.sh
  5. HDFS_NAMENODE_USER=hadoop
  6. HDFS_DATANODE_USER=hadoop
  7. HDFS_DATANODE_SECURE_USER=hadoop
  8. HDFS_SECONDARYNAMENODE_USER=hadoop
  9. HDFS_JOURNALNODE_USER=hadoop
  10. HDFS_ZKFC_USER=hadoop
  11. hadoop@spark01:/usr/local/hadoop/sbin$ vim stop-dfs.sh
  12. hadoop@spark01:/usr/local/hadoop/sbin$ tail -6 stop-dfs.sh
  13. HDFS_NAMENODE_USER=hadoop
  14. HDFS_DATANODE_USER=hadoop
  15. HDFS_DATANODE_SECURE_USER=hadoop
  16. HDFS_SECONDARYNAMENODE_USER=hadoop
  17. HDFS_JOURNALNODE_USER=hadoop
  18. HDFS_ZKFC_USER=hadoop
  1. #在start-yarn.sh和stop-yarn.sh文件中加上下列代码
  2. hadoop@spark01:/usr/local/hadoop/sbin$ vim start-yarn.sh
  3. hadoop@spark01:/usr/local/hadoop/sbin$ tail -3 start-yarn.sh
  4. YARN_RESOURCEMANAGER_USER=hadoop
  5. HADOOP_SECURE_DN_USER=hadoop
  6. YARN_NODEMANAGER_USER=hadoop
  7. hadoop@spark01:/usr/local/hadoop/sbin$ vim stop-yarn.sh
  8. hadoop@spark01:/usr/local/hadoop/sbin$ tail -3 stop-yarn.sh
  9. YARN_RESOURCEMANAGER_USER=hadoop
  10. HADOOP_SECURE_DN_USER=hadoop
  11. YARN_NODEMANAGER_USER=hadoop
4.4.2 启动hadoop(由Master节点执行)
  1. #启动hadoop
  2. hadoop@spark01:/usr/local/hadoop/sbin$ start-all.sh
  3. #启动web查看作业的历史运行情况
  4. hadoop@spark01:/usr/local/hadoop/sbin$ mr-jobhistory-daemon.sh start historyserver
4.4.3 查看jps
  1. #spark01
  2. hadoop@spark01:/usr/local/hadoop/sbin$ jps
  3. 2832 DataNode
  4. 2705 NameNode
  5. 4118 Jps
  6. 4055 JobHistoryServer
  7. 3543 ResourceManager
  8. 2361 JournalNode
  9. 3675 NodeManager
  10. 3211 DFSZKFailoverController
  11. #spark02
  12. hadoop@spark02:/usr/local/hadoop/sbin$ jps
  13. 3024 NodeManager
  14. 3176 Jps
  15. 2506 NameNode
  16. 2586 DataNode
  17. 2826 DFSZKFailoverController
  18. 2330 JournalNode
  19. 2940 ResourceManager
  20. #spark03
  21. hadoop@spark03:/usr/local/hadoop/sbin$ jps
  22. 2325 JournalNode
  23. 2747 Jps
  24. 2636 NodeManager
  25. 2447 DataNode
  26. #spark04
  27. hadoop@spark04:/usr/local/hadoop/sbin$ jps
  28. 2772 JournalNode
  29. 3146 Jps
  30. 3037 NodeManager
  31. 2895 DataNode
4.4.4 验证数据节点是否正常启动

其核心就是Live datanodes不为 0

  1. hadoop@spark01:/usr/local/hadoop/sbin$ hdfs dfsadmin -report
  2. Configured Capacity: 207955689472 (193.67 GB)
  3. Present Capacity: 148957691904 (138.73 GB)
  4. DFS Remaining: 148957593600 (138.73 GB)
  5. DFS Used: 98304 (96 KB)
  6. DFS Used%: 0.00%
  7. Replicated Blocks:
  8. Under replicated blocks: 0
  9. Blocks with corrupt replicas: 0
  10. Missing blocks: 0
  11. Missing blocks (with replication factor 1): 0
  12. Low redundancy blocks with highest priority to recover: 0
  13. Pending deletion blocks: 0
  14. Erasure Coded Block Groups:
  15. Low redundancy block groups: 0
  16. Block groups with corrupt internal blocks: 0
  17. Missing block groups: 0
  18. Low redundancy blocks with highest priority to recover: 0
  19. Pending deletion blocks: 0
  20. -------------------------------------------------
  21. Live datanodes (4):
  22. Name: 192.168.200.46:9866 (spark01)
  23. Hostname: spark01
  24. Decommission Status : Normal
  25. Configured Capacity: 51988922368 (48.42 GB)
  26. DFS Used: 24576 (24 KB)
  27. Non DFS Used: 12092940288 (11.26 GB)
  28. DFS Remaining: 37221879808 (34.67 GB)
  29. DFS Used%: 0.00%
  30. DFS Remaining%: 71.60%
  31. Configured Cache Capacity: 0 (0 B)
  32. Cache Used: 0 (0 B)
  33. Cache Remaining: 0 (0 B)
  34. Cache Used%: 100.00%
  35. Cache Remaining%: 0.00%
  36. Xceivers: 1
  37. Last contact: Sun Jun 26 22:58:08 CST 2022
  38. Last Block Report: Sun Jun 26 22:54:49 CST 2022
  39. Num of Blocks: 0
  40. Name: 192.168.200.47:9866 (spark02)
  41. Hostname: spark02
  42. Decommission Status : Normal
  43. Configured Capacity: 51988922368 (48.42 GB)
  44. DFS Used: 24576 (24 KB)
  45. Non DFS Used: 12083855360 (11.25 GB)
  46. DFS Remaining: 37230964736 (34.67 GB)
  47. DFS Used%: 0.00%
  48. DFS Remaining%: 71.61%
  49. Configured Cache Capacity: 0 (0 B)
  50. Cache Used: 0 (0 B)
  51. Cache Remaining: 0 (0 B)
  52. Cache Used%: 100.00%
  53. Cache Remaining%: 0.00%
  54. Xceivers: 1
  55. Last contact: Sun Jun 26 22:58:07 CST 2022
  56. Last Block Report: Sun Jun 26 22:54:51 CST 2022
  57. Num of Blocks: 0
  58. Name: 192.168.200.48:9866 (spark03)
  59. Hostname: spark03
  60. Decommission Status : Normal
  61. Configured Capacity: 51988922368 (48.42 GB)
  62. DFS Used: 24576 (24 KB)
  63. Non DFS Used: 12107624448 (11.28 GB)
  64. DFS Remaining: 37207195648 (34.65 GB)
  65. DFS Used%: 0.00%
  66. DFS Remaining%: 71.57%
  67. Configured Cache Capacity: 0 (0 B)
  68. Cache Used: 0 (0 B)
  69. Cache Remaining: 0 (0 B)
  70. Cache Used%: 100.00%
  71. Cache Remaining%: 0.00%
  72. Xceivers: 1
  73. Last contact: Sun Jun 26 22:58:07 CST 2022
  74. Last Block Report: Sun Jun 26 22:54:43 CST 2022
  75. Num of Blocks: 0
  76. Name: 192.168.200.49:9866 (spark04)
  77. Hostname: spark04
  78. Decommission Status : Normal
  79. Configured Capacity: 51988922368 (48.42 GB)
  80. DFS Used: 24576 (24 KB)
  81. Non DFS Used: 12017266688 (11.19 GB)
  82. DFS Remaining: 37297553408 (34.74 GB)
  83. DFS Used%: 0.00%
  84. DFS Remaining%: 71.74%
  85. Configured Cache Capacity: 0 (0 B)
  86. Cache Used: 0 (0 B)
  87. Cache Remaining: 0 (0 B)
  88. Cache Used%: 100.00%
  89. Cache Remaining%: 0.00%
  90. Xceivers: 1
  91. Last contact: Sun Jun 26 22:58:07 CST 2022
  92. Last Block Report: Sun Jun 26 22:54:43 CST 2022
  93. Num of Blocks: 0

也可以在Linux系统的浏览器中输入地址http://spark01:50070/,通过 Web 页面看到查看名称节点和数据节点的状态。如果不成功,可以通过启动日志排查原因。

图片.png-331.6kB

至此,就顺利完成了Hadoop集群搭建。

五、spark集群搭建(四节点同步进行)

5.1 安装Spark(Spark on Yarn)

  1. hadoop@spark01:~$ sudo tar xf spark-3.2.0-bin-hadoop3.2.tgz -C /usr/local/
  2. hadoop@spark01:~$ sudo mv /usr/local/spark-3.2.0-bin-hadoop3.2 /usr/local/spark
  3. hadoop@spark01:~$ sudo chown -R hadoop:hadoop /usr/local/spark/

5.2 修改Spark的配置文件

5.2.1 修改spark-env.sh文件
  1. hadoop@spark01:~$ cd /usr/local/spark/conf/
  2. hadoop@spark01:/usr/local/spark/conf$ cp spark-env.sh.template spark-env.sh
  1. hadoop@spark01:/usr/local/spark/conf$ vim spark-env.sh
  2. hadoop@spark01:/usr/local/spark/conf$ tail -22 spark-env.sh
  3. # jdk
  4. export JAVA_HOME=/usr/local/jdk
  5. # Hadoop目录
  6. export HADOOP_HOME=/usr/local/hadoop
  7. # Hadoop的配置文件目录
  8. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
  9. # YARN 的配置文件目录
  10. export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
  11. # SPARK 的目录
  12. export SPARK_HOME=/usr/local/spark
  13. # SPARK 执行文件目录
  14. export PATH=$SPARK_HOME/bin:$PATH
  15. # Master节点
  16. export SPARK_MASTER_HOST=spark01
  17. # 任务提交端口
  18. export SPARK_MASTER_PORT=7077
  19. #修改spark监视窗口的端口
  20. export SPARK_MASTER_WEBUI_PORT=8089
5.2.2 修改worker文件
  1. hadoop@spark01:/usr/local/spark/conf$ cp workers.template workers
  2. hadoop@spark01:/usr/local/spark/conf$ vim workers
  3. hadoop@spark01:/usr/local/spark/conf$ tail -4 workers
  4. spark01
  5. spark02
  6. spark03
  7. spark04

5.3 配置spark环境变量

  1. hadoop@spark01:/usr/local/spark/conf$ vim ~/.bashrc
  2. hadoop@spark01:/usr/local/spark/conf$ tail -3 ~/.bashrc
  3. #spark
  4. export SPARK_HOME=/usr/local/spark
  5. export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
  6. hadoop@spark01:/usr/local/spark/conf$ source ~/.bashrc

5.4 启动spark

5.4.1 修改启动文件名
  1. #由于spark的启动文件名会与Hadoop集群的启动文件名发生冲突,所以修改spark的启动文件名
  2. hadoop@spark01:/usr/local/spark/conf$ cd /usr/local/spark/sbin/
  3. hadoop@spark01:/usr/local/spark/sbin$ mv start-all.sh start-all-spark.sh
  4. hadoop@spark01:/usr/local/spark/sbin$ mv stop-all.sh stop-all-spark.sh
5.4.2 启动spark(由Master节点执行)
  1. 启动spark集群
  2. hadoop@hadoop-spark01:/usr/local/spark/sbin$ start-all-spark.sh
5.4.3 查看web界面

http://spark01:8089/

图片.png-420.7kB

六、测试spark与hadoop集群是否配置成功

6.1 创建test.json文件

  1. hadoop@spark01:~$ vim test.json
  2. hadoop@spark01:~$ cat test.json
  3. {"DEST_COUNTRY_NAME":"United States","ORIGIN_COUNTRY_NAME":"Romania","count":1}

6.2 进入spark终端

  1. hadoop@spark01:~$ spark-shell
  2. Setting default log level to "WARN".
  3. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  4. 2022-06-27 11:39:33,573 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  5. Spark context Web UI available at http://spark01:4040
  6. Spark context available as 'sc' (master = local[*], app id = local-1656301174643).
  7. Spark session available as 'spark'.
  8. Welcome to
  9. ____ __
  10. / __/__ ___ _____/ /__
  11. _\ \/ _ \/ _ `/ __/ '_/
  12. /___/ .__/\_,_/_/ /_/\_\ version 3.2.0
  13. /_/
  14. Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
  15. Type in expressions to have them evaluated.
  16. Type :help for more information.
  17. scala>
  1. scala> val testDF = spark.read.json("file:///home/hadoop/test.json")
  2. testDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
  3. scala> testDF.write.format("parquet").save("/spark-dir/parquet/test")

终端的页面如下
image.png-25.1kB

访问hadoopUI

image.png-160.4kB

产生以下文件说明spark搭建成功

image.png-191kB

七、创建hadoop/spark用户

  1. #创建用户
  2. hadoop@spark01:~$ sudo adduser usera --home /home/usera
  3. hadoop@spark01:~$ sudo adduser userb --home /home/userb
  4. hadoop@spark01:~$ sudo adduser userc --home /home/userc
  5. #加入hadoop用户组
  6. hadoop@spark01:~$ sudo gpasswd -a usera hadoop
  7. hadoop@spark01:~$ sudo gpasswd -a userb hadoop
  8. hadoop@spark01:~$ sudo gpasswd -a userc hadoop
  9. #在hdfs上/user目录下创建用户目录
  10. hadoop@spark01:~$ hdfs dfs -mkdir -p /user/usera
  11. hadoop@spark01:~$ hdfs dfs -mkdir -p /user/userb
  12. hadoop@spark01:~$ hdfs dfs -mkdir -p /user/userc
  13. #hdfs用户目录授权
  14. hadoop@spark01:~$ hdfs dfs -chown -R usera:usera /user/usera
  15. hadoop@spark01:~$ hdfs dfs -chown -R userb:usera /user/userb
  16. hadoop@spark01:~$ hdfs dfs -chown -R userc:usera /user/userc
  17. hadoop@spark01:~$ hdfs dfs -chmod -R 770 /user/usera
  18. hadoop@spark01:~$ hdfs dfs -chmod -R 770 /user/userb
  19. hadoop@spark01:~$ hdfs dfs -chmod -R 770 /user/userc
  20. #copy环境变量文件到新用户下
  21. hadoop@spark01:~$ sudo cp /home/hadoop/.bashrc /home/usera/
  22. hadoop@spark01:~$ sudo cp /home/hadoop/.bashrc /home/userb/
  23. hadoop@spark01:~$ sudo cp /home/hadoop/.bashrc /home/userc/
  24. #使环境变量生效
  25. hadoop@spark01:~$ su - usera
  26. hadoop@spark01:~$ source .bashrc
  27. hadoop@spark01:~$ su - userb
  28. hadoop@spark01:~$ source .bashrc
  29. hadoop@spark01:~$ su - userc
  30. hadoop@spark01:~$ source .bashrc

八、设置开机自启动

8.1 配置rc.local软连接

  1. root@spark02:~# ln -fs /lib/systemd/system/rc-local.service /etc/systemd/system/rc-local.service

8.2 修改文件内容

  1. #在文件末尾增加
  2. root@spark02:~# vim /etc/systemd/system/rc-local.service
  3. root@spark02:~# tail -3 /etc/systemd/system/rc-local.service
  4. [Install]
  5. WantedBy=multi-user.target
  6. Alias=rc-local.service

8.3 创建/etc/rc.local文件

  1. #创建/etc/rc.local文件
  2. root@spark02:~# touch /etc/rc.local
  3. #添加执行权限
  4. root@spark02:~# chmod +x /etc/rc.local

8.4 然后就可以配置了

  1. vim /etc/rc.local
  1. ##spark01
  2. #启动zookeeper
  3. sudo -u hadoop sh -c '/usr/local/zookeeper/bin/zkServer.sh start'
  4. #启hadoop
  5. sudo -u hadoop sh -c '/usr/local/hadoop/sbin/start-all.sh'
  6. #启动web查看作业的历史运行情况
  7. sudo -u hadoop sh -c '/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver'
  8. #启动spark
  9. sudo -u hadoop sh -c '/usr/local/spark/sbin/start-all-spark.sh'
  1. ##spark02
  2. #启动zookeeper
  3. sudo -u hadoop sh -c '/usr/local/zookeeper/bin/zkServer.sh start'
  1. ##spark03
  2. #启动zookeeper
  3. sudo -u hadoop sh -c '/usr/local/zookeeper/bin/zkServer.sh start'
  1. ##spark04

九、解决pyspark启动报错问题

9.1 报错一

  1. #报错
  2. WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  1. #解决(无需重启)
  2. hadoop@spark01:~$ vim /usr/local/spark/conf/spark-env.sh
  3. hadoop@spark01:~$ tail -1 /usr/local/spark/conf/spark-env.sh
  4. export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native

9.2 报错二

  1. #报错
  2. WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
  1. #官网相关说明
  2. To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer toSpark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.
  1. #解决(无需重启)
  2. #本地spark jar包需要上传到hdfs
  3. hadoop@spark01:~$ hdfs dfs -mkdir -p /system/spark-jars
  4. hadoop@spark01:~$ hdfs dfs -put /usr/local/spark/jars/* /system/spark-jars/
  5. hadoop@spark01:~$ hdfs dfs -chmod -R 755 /system/spark-jars/
  6. #修改spark-default.conf配置文件
  7. hadoop@spark01:~$ cd /usr/local/spark/conf/
  8. hadoop@spark01:/usr/local/spark/conf$ cp spark-defaults.conf.template spark-defaults.conf
  9. hadoop@spark01:/usr/local/spark/conf$ vim spark-defaults.conf
  10. hadoop@spark01:/usr/local/spark/conf$ tail -1 spark-defaults.conf
  11. spark.yarn.jars hdfs://spark01:9000//system/spark-jars/*
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注