@zhou-si
        
        2016-12-08T02:40:44.000000Z
        字数 7903
        阅读 2175
    hive
大数据时代来了……各类培训班鱼龙混杂……我分享我快乐!share your  knowledge with the world!
啰嗦的话:学习大数据的很多框架基本上都要从编译开始,当然如果选择的是cdh版本会省下这个步骤,但是对于学习而言,掌握编译会让你对它更了解,一些新的版本在cdh出现前你就可以尝鲜。
3.1 环境准备,centos6.x或7.x系统,要能联网(推荐nat方式),关闭防火墙,安装软件:
yum install -y svn ncurses-devel gcc* lzo-devel zlib-devel autoconf automake libtool cmake openssl-devel  
安装jdk,这个大家随意,1.7+的都可以,最好配置下全局环境变量 export JAVA_HOME=xxxxz
安装maven,推荐使用最新版3.3.9(目前),修改maven镜像配置:在<mirrors><mirrors/>中添加:
<mirror>
<id>nexus-osc</id>
<mirrorOf>central</mirrorOf>
<name>Nexus osc</name>
<url>https://repo.maven.apache.org/maven2</url>
</mirror>
<!--注意之前的oschina已经关闭了,编译时会报警告连接失败
<mirror>
<id>CN</id>
<mirrorOf>central</mirrorOf>
<name>OSChina Central</name>
<url>http://maven.oschina.net/content/groups/public/</url>
</mirror>
-->
 最好配置下全局环境变量vi /etc/profile  export MAVEN_HOME=xxxxx
 下载hive源码包:http://mirrors.cnnic.cn/apache/hive/hive-2.1.0/  既然是学习,建议用最高版,踩坑既收获
 解压后进入源码目录执行命令
 mvn clean install  -Pdist -DskipTests -Dhadoop-23.version=2.7.3 -Dspark.version=2.0.2
 解释一下:编译好后落盘,跳过测试,指定hadoop版本为2.7.3(目前最新版),spark版本2.0.2(目前最新版),需要提醒下大家,这条命令绝对会报错!org/apache/hive/spark/client/RemoteDriver.java这个类中导入了个org.apache.spark.JavaSparkListener。在个人看来这是个bug
 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project spark-client: Compilation failure: Compilation failure:
[ERROR] /opt/modules/apache-hive-2.1.0-src/spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java:[46,24] cannot find symbol
[ERROR] symbol: class JavaSparkListener
[ERROR] location: package org.apache.spark
[ERROR] /opt/modules/apache-hive-2.1.0-src/spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java:[444,40] cannot find symbol
[ERROR] symbol: class JavaSparkListener
[ERROR] location: class org.apache.hive.spark.client.RemoteDriver
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluen ... ojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :spark-client
JavaSparkListener这个类是2.1版本的hive新引入的,可能会在2.2版本删了它,个人认为
解决办法:1.指定spark版本为1.6.x版本此报错就ok
         2.在github上下载最新的hive源码包https://github.com/apache/hive解压编译报错消失
         打开后发现RemoteDriver类中的引入包中JavaSparkListener这个类消失了
最后生成的tar包在packaging/target里   2.2.0名字长这样:apache-hive-2.2.0-SNAPSHOT-bin.tar.gz
更多报错不会是什么疑难杂症,无非是网络带宽,maven配置
2.1解压hive包,修改配置文件
tar -zxvf apache-hive-2.2.0-SNAPSHOT-bin.tar.gz
cd apache-hive-2.2.0-SNAPSHOT-bin/conf
cp hive-default.xml.template hive-site.xml
vi hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
<!-- Hive Execution Parameters -->
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
<property>
<name>hive.server2.long.polling.timeout</name>
<value>5000</value>
<description>Time in milliseconds that HiveServer2 will wait, before responding to asynchronous calls that use long polling</description>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>com.cloudera.archive.slave02</value>
<description>Bind host on which to run the HiveServer2 Thrift interface.
Can be overridden by setting $HIVE_SERVER2_THRIFT_BIND_HOST</description>
</property>
<property>
<name>hive.hiveserver2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://com.cloudera.archive.slave02:3306/hive22db?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<!-- 输出压缩 -->
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
</property>
<!-- 中间压缩,在shuffle阶段会快一些 -->
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>
<!-- mapreduce接触时合并小文件 default false -->
<property>
<name>hive.merge.mapredfiles</name>
<value>true</value>
</property>
<!-- This flag should be set to true to enable vectorized mode of query execution. The default value is false. -->
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
</property>
<!-- Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, for example jobs processing different source tables before a join. As of Hive 0.14, also applies to move tasks that can run in parallel, for example moving files to insert targets during multi-insert -->
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<!-- Chooses execution engine. Options are: mr (Map reduce, default), tez (Tez execution, for Hadoop 2 only), or spark (Spark execution, for Hive 1.1.0 onward). -->
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
</configuration>
vi hvie-env.sh 添加一下hadoop和hive的相关配置
export HADOOP_HOME=/opt/modules/hadoop-2.7.3
export HADOOP_CONF_DIR=/opt/modules/hadoop-2.7.3/etc/hadoop
export HIVE_CONF_DIR=/opt/modules/apache-hive-2.2.0-SNAPSHOT-bin/conf
当然,你如果只是想入门能够使用没必要这么多参数,只需要最基本的这四个参数就ok了:javax.jdo.option.ConnectionURL,javax.jdo.option.ConnectionDriverName,javax.jdo.option.ConnectionUserName,javax.jdo.option.ConnectionPassword
你如果细心的话能够发现,上面参数中的注释上有很重要的话,其实在hive2.x版本已经建议使用tez或spark作为执行引擎,如果只想用mr,请使用hive1.x版本
那四个参数,其实你也可以不配置,hive默认的元数据存储库是derby,这个数据库只支持单session,企业中我想没有人会用,自己测试是随意的,我还是建议把hive元数据库设置为mysql数据库
2.2安装并设置mysql
可以使用rpm包安装,好处是可以自定义安装版本
建议使用yum安装,方便,快捷,mysql并不需要最高版
yum install -y mysql-server
/usr/bin/mysqladmin -uroot password '123456'(mysqladmin -u root password 123456)
vi /etc/my.cnf 加上default-character-set=utf8
chkconfig mysqld on
chkconfig mysqld --list
mysqld 0:off 1:off 2:on 3:on 4:on 5:on 6:off 2345为on表示ok,开机自启
service mysqld start
设置元数据库
mysql> create database hive;
设置hive字符集
mysql> alter database hive character set latin1;
给机器的root用户放开权限
mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
mysql> grant all privileges on *.* to 'root'@'hostname' identified by '123456' with grant option;
mysql> grant all privileges on *.* to 'root'@'ip' identified by '123456' with grant option;
mysql> flush privileges;
生成驱动包并拷贝到hive安装目录的lib中
使用yum安装mysql connector
yum install -y mysql-connector-java
cp /usr/share/java/mysql-connector-java-5.1.17.jar /usr/local/hive/lib
命令行中键入hive,如果默认使用mr作为执行引擎,或者切换执行引擎为mr
hive (default)> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
这是官方的建议,所以就不言而喻了,大家提前接触肯定有好处的,自己动手,丰衣足食 
也可能会打印警告:
which: no hbase in xxx
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
第一行是无关紧要的,除非你真的要使用hbase,至于beeline后续会介绍,先来个基准测试,建张表
后续更新……
insert overwrite table tb_name
select site,product_id,case when site='ae' and wish_product_id is null then product_id else wish_product_id end as wish_product_id from tb_name;