@yangwenbo
2022-06-20T17:02:25.000000Z
字数 22429
阅读 415
清华大学-FIB实验室
名称 | IP | 说明 | jdk版本 | hadoop版本 | zookeeper版本 | scala版本 | spark版本 | 操作系统 |
---|---|---|---|---|---|---|---|---|
hadoop-spark01 | 192.168.200.43 | Master节点 | 1.8 | 3.2.0 | 3.7.0 | 2.12.16 | 3.2.0 | ubuntu20.04 |
hadoop-spark02 | 192.168.200.44 | Slave1节点 | 1.8 | 3.2.0 | 3.7.0 | 2.12.16 | 3.2.0 | ubuntu20.04 |
hadoop-spark03 | 192.168.200.45 | Slave2节点 | 1.8 | 3.2.0 | 3.7.0 | 2.12.16 | 3.2.0 | ubuntu20.04 |
#添加主机名映射
root@hadoop-spark01:~# vim /etc/hosts
root@hadoop-spark01:~# tail -3 /etc/hosts
192.168.200.43 hadoop-spark01
192.168.200.44 hadoop-spark02
192.168.200.45 hadoop-spark03
#创建hadoop用户
root@hadoop-spark01:~# adduser hadoop --home /home/hadoop
#为新建的hadoop增加管理员权限
root@hadoop-spark01:~# adduser hadoop sudo
集群、单节点模式都需要用到 SSH 登陆(类似于远程登陆,你可以登录某台 Linux 主机,并且在上面运行命令),Ubuntu 默认已安装了 SSH client,此外还需要安装 SSH server:
#安装openssh-server
hadoop@hadoop-spark01:~$ sudo apt-get -y install openssh-server
#生成密匙
hadoop@hadoop-spark01:~$ ssh-keygen -t rsa
#分发公匙
hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.43
hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.44
hadoop@hadoop-spark01:~$ ssh-copy-id 192.168.200.45
进到spark官网
在这需要注意选择合适的安装包,因为我们使用Spark的时候一般都是需要和Hadoop交互的,所以需要下载带有Hadoop依赖的安装包
但是这里面只有Spark的最新版本,一般不建议选择最新版本,可以在最新版本往下面回退一两个小版本
这个时候就需要选择Hadoop版本对应的Spark安装包,里面Hadoop的版本只有2.7和3.2的,这里选择版本较高的hadoop3.2对应的这个Spark安装包。
hadoop的版本也要跟spark相对应,去hadoop官网选择hadoop3.2
这里选择zookeeper作为高可用,进入zookeeper官网下载一个版本
去scala官网下载 SCALA 2.12.16
#解压JDK
hadoop@hadoop-spark01:~$ sudo tar xf jdk-8u162-linux-x64.tar.gz -C /usr/local/
hadoop@hadoop-spark01:~$ sudo mv /usr/local/jdk1.8.0_162 /usr/local/jdk
#添加环境变量
hadoop@hadoop-spark01:~$ vim ~/.bashrc
hadoop@hadoop-spark01:~$ tail -4 ~/.bashrc
#jdk
export JAVA_HOME=/usr/local/jdk
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
export CLASSPATH=.$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
#使环境变量立即生效
hadoop@hadoop-spark01:~$ source ~/.bashrc
#检查java是否安装成功
hadoop@hadoop-spark01:~$ java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
#解压zookeeper
hadoop@hadoop-spark01:~$ sudo tar xf apache-zookeeper-3.7.0-bin.tar.gz -C /usr/local/
hadoop@hadoop-spark01:~$ sudo mv /usr/local/apache-zookeeper-3.7.0-bin /usr/local/zookeeper
hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/zookeeper
#创建zookeeper数据目录以及日志存放目录
hadoop@hadoop-spark01:~$ mkdir -p /usr/local/zookeeper/data
hadoop@hadoop-spark01:~$ mkdir -p /usr/local/zookeeper/logs
#复制 zoo_sample.cfg 文件的并命名为为 zoo.cfg,真正用的的是zoo.cfg
hadoop@hadoop-spark01:~$ cd /usr/local/zookeeper/conf/
hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cat zoo_sample.cfg | egrep -v "^$|^#" >zoo.cfg
hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ vim zoo.cfg
hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cat zoo.cfg
#通信心跳数,Zookeeper服务器与客户端心跳时间,单位毫秒
tickTime=2000
#LF初始通信时限
initLimit=10
#LF同步通信时限
syncLimit=5
#数据文件夹
dataDir=/usr/local/zookeeper/data
#日志文件夹
dataLogDir=/usr/local/zookeeper/logs
#开启四字命令(版本)
4lw.commands.whitelist=*
#客户端连接端口
clientPort=2181
#配置集群
server.1=192.168.200.43:2888:3888
server.2=192.168.200.44:2888:3888
server.3=192.168.200.45:2888:3888
# 【注意!!!】每台机器上的 myid 是不一样的,server.2 对应的要 echo 2
hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ echo 1 > /usr/local/zookeeper/data/myid
hadoop@hadoop-spark02:/usr/local/zookeeper/conf$ echo 2 > /usr/local/zookeeper/data/myid
hadoop@hadoop-spark03:/usr/local/zookeeper/conf$ echo 3 > /usr/local/zookeeper/data/myid
hadoop@hadoop-spark01:/usr/local/zookeeper/conf$ cd /usr/local/zookeeper/bin/
hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ vim zkEnv.sh
hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ sed -n "32p" zkEnv.sh
export JAVA_HOME=/usr/local/jdk
hadoop@hadoop-spark01:/usr/local/zookeeper/bin$ cd /etc/init.d/
hadoop@hadoop-spark01:/etc/init.d$ sudo vim zookeeper
hadoop@hadoop-spark01:/etc/init.d$ cat zookeeper
#!/bin/bash
#chkconfig:2345 20 90
#description:zookeeper
#processname:zookeeper
export JAVA_HOME=/usr/local/jdk
case $1 in
start) sudo /usr/local/zookeeper/bin/zkServer.sh start;;
stop) sudo /usr/local/zookeeper/bin/zkServer.sh stop;;
status) sudo /usr/local/zookeeper/bin/zkServer.sh status;;
restart) sudo /usr/local/zookeeper/bin/zkServer.sh restart;;
*) echo "require start|stop|status|restart" ;;
esac
#添加权限
hadoop@hadoop-spark01:/etc/init.d$ sudo chmod +x zookeeper
#启动zookeeper
hadoop@hadoop-spark01:/etc/init.d$ service zookeeper start
#查看监听端口
hadoop@hadoop-spark01:/etc/init.d$ netstat -anp | grep 2181
tcp6 0 0 :::2181 :::* LISTEN 4702/java
tcp6 0 0 127.0.0.1:50654 127.0.0.1:2181 TIME_WAIT -
#安装hadoop
hadoop@hadoop-spark01:~$ sudo tar xf hadoop-3.2.0.tar.gz -C /usr/local/
hadoop@hadoop-spark01:~$ sudo mv /usr/local/hadoop-3.2.0 /usr/local/hadoop
hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/hadoop
#验证Hadoop是否可用
hadoop@hadoop-spark01:~$ /usr/local/hadoop/bin/hadoop version
Hadoop 3.2.0
Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
Compiled by sunilg on 2019-01-08T06:08Z
Compiled with protoc 2.5.0
From source with checksum d3f0795ed0d9dc378e2c785d3668f39
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar
#添加环境变量
hadoop@hadoop-spark01:~$ vim ~/.bashrc
hadoop@hadoop-spark01:~$ tail -2 ~/.bashrc
#hadoop
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
#使环境变量立即生效
hadoop@hadoop-spark01:~$ source ~/.bashrc
在配置集群/分布式模式时,需要修改/usr/local/hadoop/etc/hadoop
目录下的配置文件,这里仅设置正常启动所必须的设置项,包括workers 、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml共5个文件。
hadoop@hadoop-spark01:~$ cd /usr/local/hadoop/etc/hadoop/
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ vim workers
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat workers
hadoop-spark01
hadoop-spark02
hadoop-spark03
请把core-site.xml文件修改为如下内容:
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano core-site.xml
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat core-site.xml
......
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-spark01:2181,hadoop-spark02:2181,hadoop-spark03:2181</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano hdfs-site.xml
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat hdfs-site.xml
......
<configuration>
<property>
<name>dfs.ha.automatic-failover.enabled.ns</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址,nn1所在地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>hadoop-spark01:9000</value>
</property>
<!-- nn1的http通信地址,外部访问地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>hadoop-spark01:50070</value>
</property>
<!-- nn2的RPC通信地址,nn2所在地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>hadoop-spark02:9000</value>
</property>
<!-- nn2的http通信地址,外部访问地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>hadoop-spark02:50070</value>
</property>
<!-- 指定NameNode的元数据在JournalNode日志上的存放位置(一般和zookeeper部署在一起) -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-spark01:8485;hadoop-spark02:8485/ns1</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/hadoop/journaldata</value>
</property>
<!--开启namenode失败自动切换-->
<property>
<name>dfs.ha.automatic-failover.enabled.ns1</name>
<value>true</value>
</property>
<!--客户端通过代理访问namenode,访问文件系统,HDFS 客户端与Active 节点通信的Java 类,使用其确定Active 节点是否活跃 -->
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!--这是配置自动切换的方法,有多种使用方法,具体可以看官网,在文末会给地址,这里是远程登录杀死的方法 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
shell(/bin/true)
</value>
</property>
<!-- 这个是使用sshfence隔离机制时才需要配置ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<!-- 配置sshfence隔离机制超时时间,这个属性同上,如果你是用脚本的方法切换,这个应该是可以不配置的 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!-- 这个是开启自动故障转移,如果你没有自动故障转移,这个可以先不配 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- HDFS文件系统元信息保存目录 -->
<property>
<name>dfs.name.dir</name>
<value>file:/usr/local/hadoop/name</value>
</property>
<!-- HDFS文件系统数据保存目录 -->
<property>
<name>dfs.data.dir</name>
<value>file:/usr/local/hadoop/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
把mapred-site.xml文件配置成如下内容:
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano mapred-site.xml
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat mapred-site.xml
.....
<configuration>
<!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop/mapreduce/*,/usr/local/hadoop/etc/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
请把yarn-site.xml文件配置成如下内容:
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ nano yarn-site.xml
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ cat yarn-site.xml
......
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- Site specific YARN configuration properties -->
<!--启用resourcemanager ha-->
<!--是否开启RM ha,默认是开启的-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- 指定resourcemanager的名字 -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yrc</value>
</property>
<!-- 使用了2个resourcemanager,分别指定Resourcemanager的地址 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!-- 指定rm1的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop-spark01</value>
</property>
<!-- 指定rm2的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop-spark02</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*</value>
</property>
<!--指定zookeeper集群的地址-->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop-spark01:2181,hadoop-spark02:2181,hadoop-spark03:2181</value>
</property>
<!--启用自动恢复,当任务进行一半,rm坏掉,就要启动自动恢复,默认是false-->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!-- NodeManager上运行的附属服务,默认是mapreduce_shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>hadoop-spark01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>hadoop-spark02:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>hadoop-spark02:8030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop-spark02:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm2</name>
<value>hadoop-spark02:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm2</name>
<value>hadoop-spark02:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm2</name>
<value>hadoop-spark02:23142</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>hadoop-spark01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop-spark01:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm1</name>
<value>hadoop-spark01:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm1</name>
<value>hadoop-spark01:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm1</name>
<value>hadoop-spark01:23142</value>
</property>
<!--指定resourcemanager的状态信息存储在zookeeper集群,默认是存放在FileSystem里面。-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>
</description>
</property>
</configuration>
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ vim hadoop-env.sh
hadoop@hadoop-spark01:/usr/local/hadoop/etc/hadoop$ sed -n "54p" hadoop-env.sh
export JAVA_HOME=/usr/local/jdk
hadoop@hadoop-spark01:~$ hadoop-daemon.sh start journalnode
hadoop@hadoop-spark02:~$ hadoop-daemon.sh start journalnode
hadoop@hadoop-spark03:~$ hadoop-daemon.sh start journalnode
#格式化namenode
hadoop@hadoop-spark01:~$ hdfs namenode -format
#将格式化后的hadoop文件发送给hadoop-spark02(另一个namenode)
hadoop@hadoop-spark01:~$ scp -r /usr/local/hadoop hadoop@hadoop-spark02:/usr/local/
hadoop@hadoop-spark01:~$ hdfs zkfc -formatZK
启动需要在Master节点上进行,执行如下命令:
#在start-dfs.sh和stop-dfs.sh文件中加上下列代码
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim start-dfs.sh
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -6 start-dfs.sh
HDFS_NAMENODE_USER=hadoop
HDFS_DATANODE_USER=hadoop
HDFS_DATANODE_SECURE_USER=hadoop
HDFS_SECONDARYNAMENODE_USER=hadoop
HDFS_JOURNALNODE_USER=hadoop
HDFS_ZKFC_USER=hadoop
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim stop-dfs.sh
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -6 stop-dfs.sh
HDFS_NAMENODE_USER=hadoop
HDFS_DATANODE_USER=hadoop
HDFS_DATANODE_SECURE_USER=hadoop
HDFS_SECONDARYNAMENODE_USER=hadoop
HDFS_JOURNALNODE_USER=hadoop
HDFS_ZKFC_USER=hadoop
在start-yarn.sh和stop-yarn.sh文件中加上下列代码
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim start-yarn.sh
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -3 start-yarn.sh
YARN_RESOURCEMANAGER_USER=hadoop
HADOOP_SECURE_DN_USER=hadoop
YARN_NODEMANAGER_USER=hadoop
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ vim stop-yarn.sh
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ tail -3 stop-yarn.sh
YARN_RESOURCEMANAGER_USER=hadoop
HADOOP_SECURE_DN_USER=hadoop
YARN_NODEMANAGER_USER=hadoop
#启动hadoop
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ start-all.sh
#启动web查看作业的历史运行情况
hadoop@hadoop-spark01:/usr/local/hadoop/sbin$ mr-jobhistory-daemon.sh start historyserver
#hadoop-spark01
hadoop@hadoop-spark01:~$ jps
62704 NameNode
64068 JobHistoryServer
63256 DFSZKFailoverController
64137 Jps
62106 JournalNode
63691 NodeManager
63547 ResourceManager
62877 DataNode
#hadoop-spark02
hadoop@hadoop-spark02:~$ jps
61860 NodeManager
61239 NameNode
60983 JournalNode
61368 DataNode
62153 Jps
61612 DFSZKFailoverController
61743 ResourceManager
#hadoop-spark03
hadoop@hadoop-spark03:~$ jps
60000 DataNode
60148 NodeManager
60246 Jps
59853 JournalNode
其核心就是Live datanodes
不为 0
hadoop@hadoop-spark01:~$ hdfs dfsadmin -report
Configured Capacity: 156133490688 (145.41 GB)
Present Capacity: 107745738752 (100.35 GB)
DFS Remaining: 107745652736 (100.35 GB)
DFS Used: 86016 (84 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 192.168.200.43:9866 (hadoop-spark01)
Hostname: hadoop-spark01
Decommission Status : Normal
Configured Capacity: 52044496896 (48.47 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 13456187392 (12.53 GB)
DFS Remaining: 35914149888 (33.45 GB)
DFS Used%: 0.00%
DFS Remaining%: 69.01%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Jun 17 14:13:41 CST 2022
Last Block Report: Fri Jun 17 13:57:02 CST 2022
Num of Blocks: 0
Name: 192.168.200.44:9866 (hadoop-spark02)
Hostname: hadoop-spark02
Decommission Status : Normal
Configured Capacity: 52044496896 (48.47 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 13455249408 (12.53 GB)
DFS Remaining: 35915087872 (33.45 GB)
DFS Used%: 0.00%
DFS Remaining%: 69.01%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Jun 17 14:13:41 CST 2022
Last Block Report: Fri Jun 17 13:57:02 CST 2022
Num of Blocks: 0
Name: 192.168.200.45:9866 (hadoop-spark03)
Hostname: hadoop-spark03
Decommission Status : Normal
Configured Capacity: 52044496896 (48.47 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 13453922304 (12.53 GB)
DFS Remaining: 35916414976 (33.45 GB)
DFS Used%: 0.00%
DFS Remaining%: 69.01%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Jun 17 14:13:41 CST 2022
Last Block Report: Fri Jun 17 13:57:02 CST 2022
Num of Blocks: 0
也可以在Linux系统的浏览器中输入地址http://hadoop-spark01:50070/,通过 Web 页面看到查看名称节点和数据节点的状态。如果不成功,可以通过启动日志排查原因。
至此,就顺利完成了Hadoop集群搭建。
hadoop@hadoop-spark01:~$ sudo tar xf scala-2.12.16.tgz -C /usr/local/
hadoop@hadoop-spark01:~$ sudo mv /usr/local/scala-2.12.16 /usr/local/scala
hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/scala/
hadoop@hadoop-spark01:~$ vim ~/.bashrc
hadoop@hadoop-spark01:~$ tail -3 ~/.bashrc
#scala
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
#使环境变量立即生效
hadoop@hadoop-spark01:~$ source ~/.bashrc
hadoop@hadoop-spark01:~$ scala -version
Scala code runner version 2.12.16 -- Copyright 2002-2022, LAMP/EPFL and Lightbend, Inc.
hadoop@hadoop-spark01:~$ scala
Welcome to Scala 2.12.16 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
Type in expressions for evaluation. Or try :help.
scala> :quit
hadoop@hadoop-spark01:~$ sudo tar xf spark-3.2.0-bin-hadoop3.2.tgz -C /usr/local/
hadoop@hadoop-spark01:~$ sudo mv /usr/local/spark-3.2.0-bin-hadoop3.2 /usr/local/spark
hadoop@hadoop-spark01:~$ sudo chown -R hadoop:hadoop /usr/local/spark/
hadoop@hadoop-spark01:~$ cd /usr/local/spark/conf/
hadoop@hadoop-spark01:/usr/local/spark/conf$ cp spark-env.sh.template spark-env.sh
hadoop@hadoop-spark01:/usr/local/spark/conf$ vim spark-env.sh
hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -22 spark-env.sh
# jdk
export JAVA_HOME=/usr/local/jdk
# Hadoop目录
export HADOOP_HOME=/usr/local/hadoop
# Hadoop的配置文件目录
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
# YARN 的配置文件目录
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
# SPARK 的目录
export SPARK_HOME=/usr/local/spark
# SPARK 执行文件目录
export PATH=$SPARK_HOME/bin:$PATH
# Master节点
export SPARK_MASTER_HOST=hadoop-spark01
# 任务提交端口
export SPARK_MASTER_PORT=7077
#每个worker可使用内存量
export SPARK_WORKER_MEMORY=1g
#每个worker可使用核数
export SPARK_WORKER_CORES=1
#修改spark监视窗口的端口
export SPARK_MASTER_WEBUI_PORT=8089
hadoop@hadoop-spark01:/usr/local/spark/conf$ cp workers.template workers
hadoop@hadoop-spark01:/usr/local/spark/conf$ vim workers
hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -3 workers
#localhost
hadoop-spark02
hadoop-spark03
hadoop@hadoop-spark01:/usr/local/spark/conf$ vim ~/.bashrc
hadoop@hadoop-spark01:/usr/local/spark/conf$ tail -3 ~/.bashrc
#spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
hadoop@hadoop-spark01:/usr/local/spark/conf$ source ~/.bashrc
#由于spark的启动文件名会与Hadoop集群的启动文件名发生冲突,所以修改spark的启动文件名
hadoop@hadoop-spark01:/usr/local/spark/conf$ cd /usr/local/spark/sbin/
hadoop@hadoop-spark01:/usr/local/spark/sbin$ mv start-all.sh start-all-spark.sh
hadoop@hadoop-spark01:/usr/local/spark/sbin$ mv stop-all.sh stop-all-spark.sh
启动spark集群
hadoop@hadoop-spark01:/usr/local/spark/sbin$ start-all-spark.sh
hadoop@hadoop-spark01:~$ vim test.json
hadoop@hadoop-spark01:~$ cat test.json
{"DEST_COUNTRY_NAME":"United States","ORIGIN_COUNTRY_NAME":"Romania","count":1}
hadoop@hadoop-spark01:~$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-06-20 16:34:08,277 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://hadoop-spark01:4040
Spark context available as 'sc' (master = local[*], app id = local-1655714049138).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val testDF = spark.read.json("file:///home/hadoop/test.json")
testDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> testDF.write.format("parquet").save("/spark-dir/parquet/test")
scala>
终端的页面如下
访问hadoopUI
产生以下文件说明spark搭建成功