Hive读取json数据源

1.数据格式如下图

2.创建表并加载数据

hive (test)> create table rating_json(json string);
hive (test)> load data local inpath '/home/hadoop/testdata/json/rating.json' into table rating_json;
Loading data to table test.rating_json
Table test.rating_json stats: [numFiles=1, totalSize=63602280]
OK
Time taken: 0.68 seconds

2.使用build-in 函数json_tuple

hive (test)> desc function json_tuple;
OK
tab_name
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
Time taken: 0.004 seconds, Fetched: 1 row(s)

hive (test)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json limit 10;
OK
movie_id rate time user_id
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.026 seconds, Fetched: 10 row(s)

3.将json的数据中的time解析出年、月、日，时、分、时间（yyyy-MM-dd hh:mm:ss）
此操作涉及到Hive的一些内置函数：cast(time as bigint)、from_unixtime()、 year()等

hive (test)> select movie_id,rate,time,user_id,
           > year(from_unixtime(cast(time as bigint))) as year,
           > month(from_unixtime(cast(time as bigint))) as month,
           > day(from_unixtime(cast(time as bigint))) as day,
           > hour(from_unixtime(cast(time as bigint))) as hour,
           > day(from_unixtime(cast(time as bigint))) as minitue,
           > from_unixtime(cast(time as bigint)) as tx
           > from (
           > select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json
           > ) tmp
           > limit 10;
OK
movie_id rate time user_id year month day hour minitue tx
1193 5 978300760 1 2001 1 1 6 1 2001-01-01 06:12:40
661 3 978302109 1 2001 1 1 6 1 2001-01-01 06:35:09
914 3 978301968 1 2001 1 1 6 1 2001-01-01 06:32:48
3408 4 978300275 1 2001 1 1 6 1 2001-01-01 06:04:35
2355 5 978824291 1 2001 1 7 7 7 2001-01-07 07:38:11
1197 3 978302268 1 2001 1 1 6 1 2001-01-01 06:37:48
1287 5 978302039 1 2001 1 1 6 1 2001-01-01 06:33:59
2804 5 978300719 1 2001 1 1 6 1 2001-01-01 06:11:59
594 4 978302268 1 2001 1 1 6 1 2001-01-01 06:37:48
919 4 978301368 1 2001 1 1 6 1 2001-01-01 06:22:48
Time taken: 0.047 seconds, Fetched: 10 row(s)
hive (test)>

4.创建大宽表rating_width

hive (test)> create table rating_width as 
           > select movie_id,rate,time,user_id,
           > year(from_unixtime(cast(time as bigint))) as year,
           > month(from_unixtime(cast(time as bigint))) as month,
           > day(from_unixtime(cast(time as bigint))) as day,
           > hour(from_unixtime(cast(time as bigint))) as hour,
           > day(from_unixtime(cast(time as bigint))) as minitue,
           > from_unixtime(cast(time as bigint)) as tx
           > from (
           > select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json
           > ) tmp
           > limit 10;
Query ID = hadoop_20191016115050_66b482d4-fc51-4c8b-a7f9-7eda6689c15d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1570764617006_0010, Tracking URL = http://hadoop002:18088/proxy/application_1570764617006_0010/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job  -kill job_1570764617006_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-10-16 11:50:37,742 Stage-1 map = 0%,  reduce = 0%
2019-10-16 11:50:42,955 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.82 sec
2019-10-16 11:50:48,090 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.1 sec
MapReduce Total cumulative CPU time: 3 seconds 100 msec
Ended Job = job_1570764617006_0010
Moving data to: hdfs://hadoop002:8020/user/hive/warehouse/test.db/rating_width
Table test.rating_width stats: [numFiles=1, numRows=10, totalSize=516, rawDataSize=506]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.1 sec   HDFS Read: 15952 HDFS Write: 590 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 100 msec
OK
movie_id rate time user_id year month day hour minitue tx
Time taken: 17.194 seconds
hive (test)> select * from rating_width limit 10;
OK
rating_width.movie_id rating_width.rate rating_width.time rating_width.user_id rating_width.year rating_width.montrating_width.day rating_width.hour rating_width.minitue rating_width.tx
919 4 978301368 1 2001 1 1 6 1 2001-01-01 06:22:48
594 4 978302268 1 2001 1 1 6 1 2001-01-01 06:37:48
2804 5 978300719 1 2001 1 1 6 1 2001-01-01 06:11:59
1287 5 978302039 1 2001 1 1 6 1 2001-01-01 06:33:59
1197 3 978302268 1 2001 1 1 6 1 2001-01-01 06:37:48
2355 5 978824291 1 2001 1 7 7 7 2001-01-07 07:38:11
3408 4 978300275 1 2001 1 1 6 1 2001-01-01 06:04:35
914 3 978301968 1 2001 1 1 6 1 2001-01-01 06:32:48
661 3 978302109 1 2001 1 1 6 1 2001-01-01 06:35:09
1193 5 978300760 1 2001 1 1 6 1 2001-01-01 06:12:40
Time taken: 0.033 seconds, Fetched: 10 row(s)
hive (test)>

6.后续的基于rating_width做一些统计

Hive读取json数据源

浏览过的版块