[关闭]
@sasaki 2016-04-21T16:49:22.000000Z 字数 18459 阅读 5250

构建Flume、Kafka日志收集与分析系统

BigData


版本控制

  1. @Title 构建FlumeKafka日志收集与分析系统
  2. @Version v1.0
  3. @Timestamp 2016-02-18 16:20
  4. @Author Nicholas
  5. @Mail redskirt@outlook.com

Flume和Kafka整合

  1. 下载flume-kafka-plus: https://github.com/beyondj2ee/flumeng-kafka-plugin
  2. 提取插件中的flume-conf.properties文件,将改后的配置文件放进flume-ng/conf目录下

启动Flume,对监控的目录进行日志收集并导入Kafka

  1. [root@master conf]# flume-ng agent -n producer -c conf -f /etc/flume-ng/conf/flume-kafka.properties -Dflume.root.logger=INFO,console

自定义Kafka Consumer将消息导入HDFS
运行HadoopConsumer jar,参数分别为 zookeeper,groupId,topicId,注意groupId,topicId要与flume-kafka.properties配置文件中producer.sinks.sink1.custom.topic.name一致。

  1. [root@master ~]# java -cp /usr/application/tmp/search-query-mq-0.0.1-SNAPSHOT.jar org.redskirt.sq.mq.comsumer.HadoopConsumer 192.168.53.230:2181 topic1 topic1
  2. [root@master tmp]# hadoop fs -tail /user/root/search-query/date=2016-03-25/data

Screen Shot 2016-03-25 at 4.45.59 PM.png-1233.1kB

Screen Shot 2016-03-25 at 4.53.37 PM.png-382.7kB

建立带Partition的Hive表

hive通过partitions将表粗粒度划分为不同的目录来提高查询的效率,例如包含时间戳的日志文件,如果我们按照时间来把日志文件分在不同的目录下,那么相同日期的记录就会存储在同一个分区目录下面,就可以更高效率地查询特定某个时间的记录。

  1. CREATE EXTERNAL TABLE data(time varchar(8), userid varchar(30), query string, pagerank int, clickrank int, site string)
  2. PARTITIONED BY (date string)
  3. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

通过partitioned by声明的字段表面上和在普通的column表面没什么不同,不同之处在于,表并不存储通过partitioned by声明的字段,而是将不同字段的数据放在partitioned字段目录下面,通过路径来获得partitioned字段的值。所以在我们想partitioned 表中加载数据时,需要指明partitioned 字段的值。

  1. hive> desc data;
  2. OK
  3. time varchar(8)
  4. userid varchar(30)
  5. query string
  6. pagerank int
  7. clickrank int
  8. site string
  9. date string
  10. # Partition Information
  11. # col_name data_type comment
  12. date string
  13. Time taken: 0.107 seconds, Fetched: 12 row(s)
  14. # 向data表中装载数据
  15. LOAD DATA INPATH '/user/root/search-query/date=2016-03-25/data'
  16. INTO TABLE data
  17. PARTITION(date='2016-03-25');
  18. # 同样,再制作两份数据,模拟产生两个日期的Partition数据,装入Hive表中
  19. LOAD DATA INPATH '/user/root/search-query/date=2016-03-26/data'
  20. INTO TABLE data
  21. PARTITION(date='2016-03-26');
  22. LOAD DATA INPATH '/user/root/search-query/date=2016-03-27/data'
  23. INTO TABLE data
  24. PARTITION(date='2016-03-27');

模拟产生三天的日志记录,分别在三个Partitions,效果如下
Screen Shot 2016-03-25 at 11.15.45 PM.png-114.4kB

查看一个table的partitioned 情况

  1. hive> show partitions data;
  2. OK
  3. date=2016-03-25
  4. date=2016-03-26
  5. date=2016-03-27
  6. Time taken: 0.085 seconds, Fetched: 3 row(s)

外部表,删除表操作后数据还存在。

  1. hive> drop table if exists data;
  2. OK
  3. Time taken: 0.217 seconds
  4. [root@master data]# hadoop fs -tail /user/hive/warehouse/data/date=2016-03-25/data
  5. spbbs.asp?boardid=22&id=452584&page=&star=3
  6. 23:59:59 6722332983806055 [全国公安通缉犯名单] 1 4 www.dream88.cn/dispost.asp?BoardID=14&PostID=331
  7. 23:59:59 6391786550589877 [福利彩票] 8 1 www.zhcw.com/
  8. 23:59:59 06550975741118936 [堰塞湖的形成] 2 5 news.sina.com.cn/c/2008-05-28/045015632057.shtml
  9. 23:59:59 045933662148281984 [徐娜是谁] 3 2

Hive DDL语句

  1. CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  2. (col_name data_type, ...)
  3. [PARTITIONED BY (col_name data_type, ...)]
  4. [CLUSTERED BY (col_name, col_name, ...) [SORTED BY
  5. (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  6. [SKEWED BY (col_name, col_name, ...)]
  7. [ [ROW FORMAT row_format] [STORED AS file_format] ]
  8. [LOCATION hdfs_path]

Sqoop将Hive数据导出MySql

  1. # 列出MySql中所有数据库
  2. [root@master data]# sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --password root
  3. Warning: /opt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
  4. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  5. 16/03/25 23:46:47 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.8
  6. 16/03/25 23:46:47 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
  7. 16/03/25 23:46:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
  8. information_schema
  9. cm
  10. cm_
  11. hive
  12. mysql
  13. search_query
  14. test
  15. # MySql中建立表结构
  16. DROP TABLE IF EXISTS `date_2016_03_25`;
  17. CREATE TABLE `date_2016_03_25` (
  18. `id` int(11) NOT NULL AUTO_INCREMENT,
  19. `time` time DEFAULT NULL,
  20. `userid` varchar(30) COLLATE utf8_bin DEFAULT NULL,
  21. `query` varchar(50) COLLATE utf8_bin DEFAULT NULL,
  22. `pagerank` int(2) DEFAULT NULL,
  23. `clickrank` int(2) DEFAULT NULL,
  24. `site` varchar(100) COLLATE utf8_bin DEFAULT NULL,
  25. `date` date DEFAULT NULL,
  26. PRIMARY KEY (`id`)
  27. ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
  28. SET FOREIGN_KEY_CHECKS = 1;
  29. # 将Hive中的表数据导出到MySql中
  30. [root@master data]# sqoop export --connect jdbc:mysql://master:3306/search_query --username root --password root --table date_2016_03_25 --input-fields-terminated-by '\t' --export-dir /user/hive/warehouse/data/date=2016-03-25/data

处理表数据

select语法

  1. SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  2. FROM table_reference
  3. [WHERE where_condition]
  4. [GROUP BY col_list]
  5. [CLUSTER BY col_list
  6. | [DISTRIBUTE BY col_list] [SORT BY col_list]
  7. ]
  8. [LIMIT number]
  1. 统计日期为2016-03-25的每个关键词的搜索量,并按搜索量降序排列,结果存入表keyword_20160325中

    1. hive> create table keyword_20160325 as select query,count(query) as count from data where data.date='2016-03-25' group by data.query order by count desc;
    2. hive> desc keyword_20160325;
    3. OK
    4. query string
    5. count bigint
    6. Time taken: 0.082 seconds, Fetched: 2 row(s)
    7. hive> select * from keyword_20160325 limit 10;
    8. OK
    9. [哄抢救灾物资] 133812
    10. [汶川地震原因] 117532
    11. [封杀莎朗斯通] 25298
    12. [一个暗娼的自述] 19516
    13. [广州军区司令员] 17322
    14. [暗娼李湘] 17168
    15. [成都警方扫黄现场] 10742
    16. [百度] 9916
    17. [尼泊尔地图] 9772
    18. [现役解放军中将名单] 9442
    19. Time taken: 0.066 seconds, Fetched: 10 row(s)
    20. # MySql中查询
    21. mysql> desc date_2016_03_25;
    22. +-----------+-------------+------+-----+---------+-------+
    23. | Field | Type | Null | Key | Default | Extra |
    24. +-----------+-------------+------+-----+---------+-------+
    25. | time | varchar(30) | YES | | NULL | |
    26. | userid | varchar(30) | YES | | NULL | |
    27. | query | text | YES | | NULL | |
    28. | pagerank | int(2) | YES | | NULL | |
    29. | clickrank | int(2) | YES | | NULL | |
    30. | site | text | YES | | NULL | |
    31. +-----------+-------------+------+-----+---------+-------+
    32. 6 rows in set (0.00 sec)
    33. mysql> select query, count(query) as count from date_2016_03_25 group by query order by count desc limit 10;
    34. +-------------------------------+--------+
    35. | query | count |
    36. +-------------------------------+--------+
    37. | [哄抢救灾物资] | 133453 |
    38. | [汶川地震原因] | 122856 |
    39. | [封杀莎朗斯通] | 27381 |
    40. | [一个暗娼的自述] | 16460 |
    41. | [暗娼李湘] | 15675 |
    42. | [广州军区司令员] | 13092 |
    43. | [尼泊尔地图] | 11516 |
    44. | [百度] | 10181 |
    45. | [吕秀莲到大陆] | 9795 |
    46. | [杨丞琳辱华惨痛下场] | 9598 |
    47. +-------------------------------+--------+
    48. 10 rows in set (34.40 sec)
  2. 统计每秒访问量TPS,按访问量降序排列,并把结果输出到表time_20150325中

    1. hive> select time, count(time) as count from data where data.date='2016-03-25' group by time order by count desc limit 10;
    2. ...
    3. OK
    4. 15:38:11 116
    5. 21:32:17 112
    6. 16:32:51 112
    7. 17:35:17 112
    8. 15:34:37 110
    9. 16:35:53 110
    10. 21:09:29 110
    11. 17:41:19 108
    12. 16:15:39 108
    13. 15:16:05 106
    14. Time taken: 59.953 seconds, Fetched: 10 row(s)
    15. hive> create table time_20160325 as select time, count(time) as count from data where data.date='2016-03-25' group by time order by count desc;
  3. 创建一个表用于存放今天与昨天的关键词搜索量和增量及其增量比率,表数据位于 '/user/hive/warehouse/log_analysis/keyword_20160325_20160326'

    1. hive> create external table if not exists keyword_20160325_20160326(keyword string, count int, increment int, incrementrate double) location '/user/hive/warehouse/log_analysis/keyword_20160325_20160326';
    2. OK
    3. Time taken: 0.083 seconds
  4. 关联两天关键词统计结果表keyword_20160325与keyword_20160326,统计今天与昨天同时出现的关键词的搜索次数,今天相对昨天的增量和增量比率,并按增量比率降序排列,结果输出到keyword_20160325_20160326表中

    1. # 先建立keyword_20160326表并填充数据
    2. hive> create table keyword_20160326 as select query,count(query) as count from data where data.date='2016-03-26' group by data.query order by count desc;
    3. insert overwrite table keyword_20160325_20160326 select cur.query,cur.count,cur.count - yes.count as increment,(cur.count - yes.count)/yes.count as incrementrate from keyword_20160325 cur join keyword_20160326 as yes on (cur.query = yes.query) order by incrementrate desc;
    4. hive> select cur.query,cur.count,cur.count - yes.count as increment,(cur.count - yes.count)/yes.count as incrementrate from keyword_20160326 cur join keyword_20160325 as yes on (cur.query = yes.query) order by incrementrate desc limit 5;
    5. ...
    6. Total MapReduce CPU Time Spent: 12 seconds 610 msec
    7. OK
    8. [哄抢救灾物资] 66906 -66906 -0.5
    9. [汶川地震原因] 58766 -58766 -0.5
    10. [封杀莎朗斯通] 12649 -12649 -0.5
    11. [一个暗娼的自述] 9758 -9758 -0.5
    12. [广州军区司令员] 8661 -8661 -0.5
    13. Time taken: 37.813 seconds, Fetched: 5 row(s)
  5. 统计date='2016-03-25'网站的PV(Page View)量,即所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次

    1. hive> select count(1) as pv from data where date='2016-03-25';
    2. ...
    3. OK
    4. 3448528
    5. Time taken: 28.189 seconds, Fetched: 1 row(s)
  6. 查看前10的搜索结果URL访问

    1. hive> select site as request,count(site) as numrequest from data where date='2016-03-25' group by site order by numrequest desc limit 10;
    2. ...
    3. OK
    4. news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml 61724
    5. news.21cn.com/zhuanti/domestic/08dizhen/2008/05/19/4733406.shtml 38506
    6. pic.news.mop.com/gs/2008/0528/12985.shtml 33100
    7. www.tudou.com/programs/view/2F3E6SGHFLA/ 29150
    8. bjyouth.ynet.com/view.jsp?oid=40472396 24904
    9. www.17tech.com/news/20080531107270.shtml 22760
    10. www.baidu.com/ 12086
    11. www.taihainet.com/news/military/jslwt/2007-06-27/141544.shtml 8622
    12. www.17tech.com/news/20080531107274.shtml 7628
    13. news.vnet.cn/photo/292_6.html 7574
    14. Time taken: 84.934 seconds, Fetched: 10 row(s)
  7. 获取搜索量topN的用户其搜索词的具体统计情况

    1. hive> select userid,count(userid) as numuser from data where date='2016-03-25' group by userid order by numuser desc;

导出数据到MySQL,当然数据库表要先存在,否则会报错,此错误的原因为sqoop解析文件的字段与MySql数据库的表的字段对应不上造成的。因此需要在执行的时候给sqoop增加参数,告诉sqoop文件的分隔符,使它能够正确的解析文件字段。hive默认的字段分隔符为'\001',sqoop的默认分隔符是 ','。

  1. Error: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
  2. The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
  3. at org.apache.sqoop.mapreduce.ExportOutputFormat.getRecordWriter(ExportOutputFormat.java:79)
  4. at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
  5. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
  6. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
  7. at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  8. at java.security.AccessController.doPrivileged(Native Method)
  9. at javax.security.auth.Subject.doAs(Subject.java:415)
  10. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
  11. at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  12. Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
  13. The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
  14. # 解决:将localhost改为IP或主机名master
  15. # 又遇到以下问题
  16. Error: java.io.IOException: Can't export data, please check failed map task logs
  17. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
  18. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
  19. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
  20. at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
  21. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
  22. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
  23. at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  24. at java.security.AccessController.doPrivileged(Native Method)
  25. at javax.security.auth.Subject.doAs(Subject.java:415)
  26. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
  27. at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  28. Caused by: java.lang.RuntimeException: Can't parse input data: '20:57:15 6401633234309676 [比亚迪f1] 2 jsp.auto.sohu.com/view/subbrand-sbid-279443.html'
  29. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:507)
  30. at date_2016_03_25.parse(date_2016_03_25.java:420)
  31. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
  32. ... 10 more
  33. Caused by: java.lang.NumberFormatException: For input string: "20:57:15 6401633234309676 [比亚迪f1] 2 3 jsp.auto.sohu.com/view/subbrand-sbid-279443.html"
  34. at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  35. at java.lang.Integer.parseInt(Integer.java:492)
  36. at java.lang.Integer.valueOf(Integer.java:582)
  37. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:469)
  38. ... 12 more
  39. # 遇到如下很问题
  40. Error: java.io.IOException: Can't export data, please check failed map task logs
  41. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
  42. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
  43. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
  44. at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
  45. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
  46. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
  47. at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  48. at java.security.AccessController.doPrivileged(Native Method)
  49. at javax.security.auth.Subject.doAs(Subject.java:415)
  50. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
  51. at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  52. Caused by: java.lang.RuntimeException: Can't parse input data: 'download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html'
  53. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:464)
  54. at date_2016_03_25.parse(date_2016_03_25.java:382)
  55. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
  56. ... 10 more
  57. Caused by: java.util.NoSuchElementException
  58. at java.util.ArrayList$Itr.next(ArrayList.java:834)
  59. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:459)
  60. ... 12 more

解决上一问题后继续尝试导出,此时Map任务可以运行,但是中途报错如下,

  1. Error: java.io.IOException: Can't export data, please check failed map task logs
  2. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
  3. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
  4. at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
  5. at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
  6. at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
  7. at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
  8. at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  9. at java.security.AccessController.doPrivileged(Native Method)
  10. at javax.security.auth.Subject.doAs(Subject.java:415)
  11. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
  12. at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  13. Caused by: java.lang.RuntimeException: Can't parse input data: '家电维修(中、高级)技能培训项目任务书]'
  14. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:421)
  15. at date_2016_03_25.parse(date_2016_03_25.java:344)
  16. at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
  17. ... 10 more
  18. Caused by: java.lang.NumberFormatException: For input string: "家电维修(中、高级)技能培训项目任务书]"
  19. at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  20. at java.lang.Integer.parseInt(Integer.java:492)
  21. at java.lang.Integer.valueOf(Integer.java:582)
  22. at date_2016_03_25.__loadFromFields(date_2016_03_25.java:408)
  23. ... 12 more

查看原数据匹配家电维修(中、高级)技能培训项目任务书]的部分如下

  1. [root@master ~]# hadoop fs -text /user/hive/warehouse/data/date=2016-03-25/data |grep 家电维修(中、高级)技能培训项目任务书]
  2. 21:59:14 8160510088862797 [1 家电维修(中、高级)技能培训项目任务书] 4 1 www.tech.net.cn/tab/tab200404023.doc
  3. 22:03:48 8160510088862797 [1 家电维修(中、高级)技能培训项目任务书] 8 2 www.sict.edu.cn/deng/file/1\xa8\xb2?\xa8\xb0\xa8\xba?\xa1\xa4?D??oD\xa1\xea?\xa1\xa7\xa8\xa6\xa8\xa8???t/1\xa8\xb2?\xa8\xb0\xa8\xba?\xa1\xa4?D???\xa6\xcc\xa8\xa8?\xa1\xe3\xa8\xb0\xa6\xcc?oD\xa1\xea?\xa1\xa7\xa8\xa6\xa8\xa8\xa8\xa8???\xa8\xba\xa8\xa6.doc

可知道原来这里的[1、 家电维修(中、高级)技能培训项目任务书]部分中间多了一个Tab格,导致数据被Tab分隔时影响到读取。
解决:该问题不会影响Map任务的继续运行,为了省时间也可忽略异常继续导出任务。另外,只能将源数据GET至本地将其中有不规范Tab的行删除或替换成' '(空格)后重新导入HDFS。

导出成功后在MySql中查看数据发现中文乱码问题

  1. mysql> select * from date_2016_03_25 limit 10;
  2. +----------+-------------------+-------------+----------+-----------+----------------------------------------------------+
  3. | time | userid | query | pagerank | clickrank | site |
  4. +----------+-------------------+-------------+----------+-----------+----------------------------------------------------+
  5. | 18:04:18 | 5853765130623645 | [?????] | 2 | 3 | pics.skyhits.com/HtmlPage/2005-06-10/XZ_4332_1.htm |
  6. | 18:04:18 | 17840627555496796 | [???????] | 3 | 10 | www.xflove.com/MusicPlay/10/b797670a2ddd3378.html |
  7. | 18:04:18 | 38336111772650055 | [?????] | 6 | 8 | ks.cn.yahoo.com/question/1306060803181.html |
  8. | 18:04:18 | 677837215166377 | [????????] | 1 | 1 | zhidao.baidu.com/question/46344034 |
  9. | 18:04:19 | 07869829328567912 | [??????] | 3 | 3 | www.17tech.com/news/20080531107270.shtml |
  10. | 18:04:19 | 7764140658031031 | [?????????] | 2 | 2 | www.xici.net/b321606/d37714654.htm |
  11. | 18:04:19 | 11911560432779528 | [????] | 1 | 3 | mogi.moliyo.com/ |
  12. | 18:04:19 | 7462123601217611 | [????] | 1 | 1 | oneone258.blog.sohu.com/66590341.html |
  13. | 18:04:19 | 3425538072999014 | [?????] | 2 | 2 | you.video.sina.com.cn/b/7065867-1198766455.html |
  14. | 18:04:19 | 2949443329241388 | [?????] | 5 | 12 | iask.sina.com.cn/b/9406403.html |
  15. +----------+-------------------+-------------+----------+-----------+----------------------------------------------------+
  16. 10 rows in set (0.00 sec)

解决
更改my.cnf文件,添加以下内容

  1. [root@master ~]# vim /etc/my.cnf
  2. [mysql]
  3. default-character-set=utf8
  4. [client]
  5. default-character-set=utf8
  6. [mysqld]
  7. default-character-set=utf8
  8. character_set_server=utf8
  9. [root@Hadoop48 ~]# service mysqld restart
  10. # 查看字符集设置情况
  11. mysql> \s
  12. --------------
  13. mysql Ver 14.14 Distrib 5.1.73, for redhat-linux-gnu (x86_64) using readline 5.1
  14. Connection id: 16
  15. Current database:
  16. Current user: root@localhost
  17. SSL: Not in use
  18. Current pager: stdout
  19. Using outfile: ''
  20. Using delimiter: ;
  21. Server version: 5.1.73 Source distribution
  22. Protocol version: 10
  23. Connection: Localhost via UNIX socket
  24. Server characterset: utf8
  25. Db characterset: utf8
  26. Client characterset: utf8
  27. Conn. characterset: utf8
  28. UNIX socket: /var/lib/mysql/mysql.sock
  29. Uptime: 13 sec
  30. Threads: 15 Questions: 253 Slow queries: 0 Opens: 24 Flush tables: 1 Open tables: 17 Queries per second avg: 19.461
  31. --------------
  32. mysql> show variables like "char%";
  33. +--------------------------+----------------------------+
  34. | Variable_name | Value |
  35. +--------------------------+----------------------------+
  36. | character_set_client | utf8 |
  37. | character_set_connection | utf8 |
  38. | character_set_database | utf8 |
  39. | character_set_filesystem | binary |
  40. | character_set_results | utf8 |
  41. | character_set_server | utf8 |
  42. | character_set_system | utf8 |
  43. | character_sets_dir | /usr/share/mysql/charsets/ |
  44. +--------------------------+----------------------------+
  45. 8 rows in set (0.00 sec)

再次查询表数据,中文正常。

  1. mysql> select * from date_2016_03_25 limit 3;
  2. +----------+-------------------+----------------------+----------+-----------+--------------------------------------------------------------------------------+
  3. | time | userid | query | pagerank | clickrank | site |
  4. +----------+-------------------+----------------------+----------+-----------+--------------------------------------------------------------------------------+
  5. | 00:00:00 | 2982199073774412 | [360安全卫士] | 8 | 3 | download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html |
  6. | 00:00:00 | 07594220010824798 | [哄抢救灾物资] | 1 | 1 | news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml |
  7. | 00:00:00 | 5228056822071097 | [75810部队] | 14 | 5 | www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5 |
  8. +----------+-------------------+----------------------+----------+-----------+--------------------------------------------------------------------------------+
  9. 3 rows in set (0.00 sec)

Hive Partition与Bucket

Partition:

Bucket:

  1. hive> create external table log_in(time varchar(8), userid varchar(30), query string, pagerank int, clickrank int, site string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location 'hdfs://192.168.53.230:8020/user/root/hive/external';

过程参考链接

Kafka学习干货:http://www.cnblogs.com/cxzdy/p/4980350.html

Kafka实战-Flume到Kafka http://www.open-open.com/lib/view/open1435884136903.html

高可用Hadoop平台-Flume NG实战图解篇 http://www.cnblogs.com/smartloli/p/4468708.html

http://www.cnblogs.com/xfly/p/3825804.html

http://blog.csdn.net/xiao_jun_0820/article/details/41576999

解决Kafka Producer消息的字符类型转换问题
http://blog.csdn.net/u014373825/article/details/42711191

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注