@tsing1226 2015-12-12T16:26:02.000000Z 字数 1516 阅读 1070

Hive高级进阶--企业优化

1、为什么sql有的执行MR，而有的确没有

在配置文件hive-site.xml文件中

<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description> Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more: SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
  </description>
</property>

hive.fetch.task.conversion的值为more值，SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns）查询不经过MapReduce。开发时候一般设置成more，查询结果如下所示：

2、执行计划

EXPLAIN [EXTENDED|DEPENDENCY] query

3、大表拆分和中间结果集利用

一个日志文件中，每一行记录，会有很多很多字段，实际分析中，常常使用少数几个字段。

解决方案：将原始的表中数据，依据业务需求提取出要分析的字段，数据放入到对于的业务表（子表）中，实际的业务针对业务表进行分析。

问题：在实际中，有些业务处理，会有共同数据集，比如：用户表、订单表、商品表，三个表需要进行join的操作，产生一个结果集，会有很多的业务是针对此jion结果集进行分析。

解决方案：将众多的业务中相同的中间结果集，抽取到一个Hive中的表中，各个业务对业务进行分析。

4、外部表与分区表

结合使用；分级分区
union：多个表的结果合并

5、数据存储

存储格式（textfile、orcfile、parquet）
数据压缩（snappy）

6、SQL优化

优化语句；
Join filter

7、MapReduce

JVM重用

mapreduce.job.jvm.numtasks

reduce数目

mapreduce.job.reduces

推测执行

mapreduce.map.speculative                          true
hive.mapred.reduce.tasks.speculative.execution     true
mapreduce.reduce.speculative                       true

一般在开发的时候，设置为false。

map数目

hive.merge.size.per.task   256000000

并行执行

<property>
  <name>hive.exec.parallel.thread.number</name>
  <value>8</value>
  <description>How many jobs at most can be executed in parallel</description>
</property>
<property>
  <name>hive.exec.parallel</name>
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>