@zqbinggong
2018-06-12T20:16:42.000000Z
字数 7073
阅读 1853
hadoop
《权威指南》
! All pictures are screenshots from the book 'Hadoop: The Definitive Guide, Fourth Edititon, by Tom White(O'Reilly).Copyright©2015TomWhite, 978-1-491-90163-2'
Hadoop中的组件是通过Hadoop自己配置的API来配置的;Configuration通过从资源(即使用简单结构定义名值对的XML文件)中读取其属性值
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));//定义xml中没有的属性,并指定了默认值为wide
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");//override1中的相同属
<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
问题:开发Hadoop应用时,需要经常在本地运行和集群运行之间切换
解决方法:是Hadoop配置文件包含每个集群的连接设置,并且在运行Hadoop应用或工具时指定使用哪一种连接
最好的做法: 将这些文件放置在Hadoop安装目录树之外,方便切换同时避免重复和丢失设置信息
hadoop fs -conf conf/hadoop-localhost.xml -ls
ToolRunner.run(new ConfigurationPrinter(), args)
public static int run(Configuration conf,Tool tool,String[] args)
Runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. Uses the given Configuration, or builds one if null. Sets the Tool's configuration with the possibly modified version of the conf.
1. hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml
2. hadoop ConfigurationPrinter -D color=yellow
public class ConfigurationPrinter extends Configured implements Tool {
static {
// 注意configuration默认载入的是core-default.xml和core-site.xml
Configuration.addDefaultResource("hdfs-default.xml");
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();//configurable接口中的方法,configured是该接口的一个最简单实现
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
System.exit(exitCode);
}
}
目的: 写一个作业驱动程序(job driver),然后在开发机器上使用测试数据运行它
public class MaxTemperatureDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}
除了灵活的配置选项可以使应用程序实现Tool还可以插入任意Configuration来增加可测试性
@Test
public void test() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "file:///");
conf.set("mapreduce.framework.name", "local");
conf.setInt("mapreduce.task.io.sort.mb", 1);
Path input = new Path("input/ncdc/micro");
Path output = new Path("output");
FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true); // delete old output
MaxTemperatureDriver driver = new MaxTemperatureDriver();
driver.setConf(conf);
int exitCode = driver.run(new String[] {
input.toString(), output.toString() });
assertThat(exitCode, is(0));
checkOutput(conf, output);//该方法会对实际输出和预计输出进行比较(预计输出从何而来呢?)
}
由Hadoop jar <jar>
设置的用户客户端类路径包括:
On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in separate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH . HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver JVM, which submits the job.
用以处理作业的库依赖的操作(corresponding options for including library dependencies for a
job):
用户的JAR文件被添加到客户端和任务类路径的后面,这意味着,如果Hadoop使用的库版本和你的代码使用的不同或不相容,则可能会发生冲突,此时需要调整任务类路径的次序以让你的类被先提取出来
We unset the HADOOP_CLASSPATH environment variable because we don’t have any third-party dependencies for this job. If it were left set to target/classes/ (from earlier in the chapter), Hadoop wouldn’t be able to find the job JAR; it would load the MaxTempera tureDriver class from target/classes rather than the JAR, and the job would fail.
unset HADOOP_CLASSPATH
hadoop jar hadoop-examples.jar v2.MaxTemperatureDriver \
-conf conf/hadoop-cluster.xml input/ncdc/all max-temp
问题: 如何将数据处理问题转化为MR模型
Hadoop自带的ChainReducer可以将很多mapper连接成一个mapper
问题: 当一个工作流中的作业不止一个时,如何管理这些作业按顺序进行
主要解决途径是考虑是否存在一个线性的作业链或一个更复杂的作业有向无环图
1. 对于线性链表, 最简单的方法就是一个接一个地运行作业:
JobClient.runJob(conf1)
JobClient.runJob(conf2)
Apache Oozie是一个运行由相互依赖的作业组成的工作流。由两部分组成: