Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。 Hive是由Facebook贡献给Apache的开源项目,这个工具可以说是完全为DBA而生,它的的目标,是希望能让精通SQL但不熟悉JAVA编程的工程师,在HADOOP的大潮中不至于下岗待业,即使完全不懂JAVA,也能在HDFS数据分析中继续发挥光和热。Hive是做什么呢,个人理解可以将其视为一个SQL语言的解释器,它能将DBA提交的SQL语句,转换成能够在HADOOP上执行的M-R作业,对于DBA或前端用户来说,不必再将精力花在编写M-R应用上,直接借助SQL的易用性来实现大规模数据的查询和分析。
与Hadoop类似,Hive也有三种运行模式: 1.内嵌模式:将元数据保存在本地内嵌的Derby数据库中,这得使用Hive最简单的方式,不过使用内嵌模式的话,缺点也比较明显,因为一个内嵌的Derby数据库每次只能访问一个数据文件,这也就意味着不支持多会话连接。这种情况应对本地测试可能都有所不足,仅供初学者熟悉应用Hive; 2.本地模式:这种模式是将元数据库保存在本地的独立数据库中(比如说MySQL),这样就能够支持多会话和多用户连接。——mysql可以是本地、也可以是远程 3.远程模式:hive服务和metastore在不同的进程内。
1、安装前准备:
安装好hadoop集群(hive1.2.1、hadoop2.5.2)
2、安装hive
1)下载hive包:
http://apache.fayea.com/hive/
2)解压:
tar -xvzf apache-hive-1.2.1-bin.tar.gz
3)创建环境变量:
#HADOOP VARIABLES START
export HADOOP_HOME=/usr/local/hadoop-2.5.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
#HADOOP VARIABLES END
#hbase
export HBASE_HOME=/usr/local/hbase-1.0.0/
#hive
export Hive_HOME=/usr/local/hive/
export PATH=$PATH:$HBASE_HOME/bin:$HIVE_HOME/bin
4)内嵌模式:
该模式无需修改内容配置文件
# cp -p hive-default.xml.template hive-default.xml
# cp -p hive-default.xml.template hive-site.xml
在HDFS上建立/tmp和/user/hive/warehouse目录,并赋予组用户写权限。这是Hive默认的数据文件存放目录,在hive-site.xml文件中为默认配置。
# su - hadoop
$ hadoop dfs -mkdir /tmp
$ hadoop dfs -mkdir /user/hive/warehouse
$ hadoop dfs -chmod g+w /tmp
$ hadoop dfs -chmod g+w /user/hive/warehouse
$ hadoop dfs -ls /
drwxrwxr-x - hadoop supergroup 0 2014-06-17 18:57 /tmp
drwxr-xr-x - hadoop supergroup 0 2014-06-17 19:02 /user
drwxr-xr-x - hadoop supergroup 0 2014-06-15 19:31 /usr
$ hadoop dfs -ls /user/hive/
drwxrwxr-x - hadoop supergroup 0 2014-06-17 19:02 /user/hive/warehouse
//启动
$ hive
Logging initialized using configuration in file:/usr/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201406171916_734435947.txt
hive> show tables;
OK
Time taken: 7.157 seconds
hive> quit;
5)本地独立模式(在内嵌模式上配置)
安装、配置mysql:
# yum install mysql mysql-server //安装mysql
# service mysqld start
# mysql -u root //添加数据库及用户
mysql> create database hive;
Query OK, 1 row affected (0.00 sec)
mysql> grant all on hive.* to 'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
mysql> \q
Bye
配置hive-site.xml文件,用于连接mysql
<property>
<name>javax.jdo.option.ConnectionURL</name> //所连接的MySQL数据库实例
<value>jdbc:mysql://localhost:3306/hive</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name> //连接的MySQL数据库驱动
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name> //连接的MySQL数据库用户名
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name> //连接的MySQL数据库密码
<value>hive</value>
<description>password to use against metastore database</description>
</property>
启动:
$ hive //启动
Logging initialized using configuration in jar:file:/usr/hive/lib/hive-common-0.8.1.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201406172021_1374786590.txt
hive> show tables;
OK
Time taken: 5.527 seconds
hive> quit;
6)远程模式:
。。。
3、 hive启动报错: Found class jline.Terminal, but interface was expected
[ERROR] Terminal initialization failed; falling back to unsupported
Java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
at org.apache.Hadoop.Hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
原因:
hadoop目录下存在老版本jline:
/hadoop-2.5.2/share/hadoop/yarn/lib:
-rw-r--r-- 1 root root 87325 Mar 10 18:10 jline-0.9.94.jar
解决:
cp /hive/apache-hive-1.1.0-bin/lib/jline-2.12.jar /hadoop-2.5.2/share/hadoop/yarn/lib
4、测试:
1)创建表:
hive> create table test(id INT,str STRING)
> row format delimited
> fields terminated by ','
> stored as textfile;
Time taken: 0.15 seconds
hive> show tables;
OK
test
Time taken: 1.15 seconds
2)加载本地数据到hive:
hive> load data local inpath '/home/hadoop/data_test.txt'
> overwrite into table test;
Copying data from file:/home/hadoop/data_test.txt
Copying file: file:/home/hadoop/data_test.txt
Loading data to table default.test
OK
Time taken: 4.322 seconds
3)查询前10行:
hive> select * from test limit 10;
OK
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
Time taken: 0.869 seconds
4)查询该文件中存在多少条数据,这时hive将执行一个map-reduce的过程来计算该值
hive> select count(1) from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201406180238_0001, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201406180238_0001
Kill Command = /usr/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=http://192.168.2.101:9001 -kill job_201406180238_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-06-18 02:39:43,858 Stage-1 map = 0%, reduce = 0%
2014-06-18 02:39:54,964 Stage-1 map = 100%, reduce = 0%
2014-06-18 02:40:04,078 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201406180238_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 HDFS Read: 33586560 HDFS Write: 8 SUCESS
Total MapReduce CPU Time Spent: 0 msec
OK
4798080
Time taken: 35.687 seconds
参考:http://hatech.blog.51cto.com/8360868/1427748
|