Hive 的collect_set使用详解
有这么一需求,在Hive中求出一个数据表中在某天内首次登陆的人;可以借助collect_set来处理sql:
select count(a.id) from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';
上面中的
1. select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
会按照id分组,因为一个id可能对应一天也可能对应多天,对应多天表示有多天都有登陆,所以一个id会对应多个日期time,通过collect_set会把每个id所对应的日期构建成一个以逗号分隔的数组返回。上述SQL返回:
1. 123@163.com | ["20150620","20150619"] |
2. | abc@163.com | ["20150816"] |
3. | cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
4. | 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
5. | 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
6. | ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
7. | 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
8. | 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
9. | 18697@qq.com | ["20150902"] |
10. | 1905@qq.com | ["20150709"]
所以我们就可以按照这个返回的数组做文章,即为
where size(a.t)=1 and a.t[0]='20150906';
表示某id所对应的数组长度为1 并且第一个时间为20150906的id表示为该天首次登陆。
总结:
1. Hive不允许直接访问非group by字段;
2. 对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
3.使用数字下标,可以直接访问数组中的元素;
1.case的用法
)格式1
case col
when value then ''
when value then ''
else
''
end
)格式2
case
when col='value' then ''
when col='value' then ''
else
''
end
)举例
2.cast
作用:转换
)格式
cast(col as type)
)举例
create table catstest as select empno,ename,cast(sal as string) salary from emp;
3.unix_timestamp
作用:记录时间格式
统计从1970年开始到现在的秒数
关键字:Hive Join、Hive LEFT|RIGTH|FULL OUTER JOIN、Hive LEFT SEMI JOIN、Hive Cross Join
Hive中除了支持和传统数据库中一样的内关联、左关联、右关联、全关联,还支持LEFT SEMI JOIN和CROSS JOIN,但这两种JOIN类型也可以用前面的代替。
注意:Hive中Join的关联键必须在ON ()中指定,不能在where中指定,否则就会先做笛卡尔积,再过滤。
数据准备:
1. hive> desc lxw1234_a;
2. OK
3. id string
4. name string
5. Time taken: 0.094 seconds, Fetched: 2 row(s)
6. hive> select * from lxw1234_a;
7. OK
8. 1 zhangsan
9. 2 lisi
10. 3 wangwu
11. Time taken: 0.116 seconds, Fetched: 3 row(s)
12. hive> desc lxw1234_b;
13. OK
14. id string
15. age int
16. Time taken: 0.159 seconds, Fetched: 2 row(s)
17. hive> select * from lxw1234_b;
18. OK
19. 1 30
20. 2 29
21. 4 21
22. Time taken: 0.09 seconds, Fetched: 3 row(s)
只返回能关联上的结果。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a join lxw1234_b b ON (a.id = b.id);
2. --执行结果
1 zhangsan 30
2 lisi 29
以LEFT [OUTER] JOIN关键字前面的表作为主表,和其他表进行关联,返回记录和主表的记录数一致,关联不上的字段置为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a left join lxw1234_b b ON (a.id = b.id);
2. --执行结果:
1 zhangsan 30
2 lisi 29
3 wangwu NULL
和左外关联相反,以RIGTH [OUTER] JOIN关键词后面的表作为主表,和前面的表做关联,返回记录数和主表一致,关联不上的字段为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a RIGHT OUTER JOIN lxw1234_b b ON (a.id = b.id);
2. --执行结果:
1 zhangsan 30
2 lisi 29
NULL NULL 21
以两个表的记录为基准,返回两个表的记录去重之和,关联不上的字段为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
注意:FULL JOIN时候,Hive不会使用MapJoin来优化。
1. SELECT a.id,
2. a.name,
3. b.age
4. FROM lxw1234_a a
5. FULL OUTER JOIN lxw1234_b b
6. ON (a.id = b.id);
7.
8. --执行结果:
9. 1 zhangsan 30
10. 2 lisi 29
11. 3 wangwu NULL
12. NULL NULL 21
13.
以LEFT SEMI JOIN关键字前面的表为主表,返回主表的KEY也在副表中的记录。
1. SELECT a.id,
2. a.name
3. FROM lxw1234_a a
4. LEFT SEMI JOIN lxw1234_b b
5. ON (a.id = b.id);
6.
7. --执行结果:
8. 1 zhangsan
9. 2 lisi
11. --等价于:
12. SELECT a.id, a.name FROM lxw1234_a a WHERE a.id IN (SELECT id FROM lxw1234_b);
18. --也等价于:
19. SELECT a.id, a.name FROM lxw1234_a a join lxw1234_b b ON (a.id = b.id);
25. --也等价于:
26. SELECT a.id, a.name FROM lxw1234_a a WHERE EXISTS (SELECT 1 FROM lxw1234_b b WHERE a.id = b.id);
返回两个表的笛卡尔积结果,不需要指定关联键。
1. SELECT a.id,
2. a.name,
3. b.age
4. FROM lxw1234_a a
5. CROSS JOIN lxw1234_b b;
6.
7. --执行结果:
8. 1 zhangsan 30
9. 1 zhangsan 29
10. 1 zhangsan 21
11. 2 lisi 30
12. 2 lisi 29
13. 2 lisi 21
14. 3 wangwu 30
15. 3 wangwu 29
16. 3 wangwu 21
17.
Hive中的JOIN类型基本就是上面这些,至于JOIN时候使用哪一种,完全得根据实际的业务需求来定,但起码你要搞清楚这几种关联类型会返回什么样的结果。
除非特殊需求,并且数据量不是特别大的情况下,才可以慎用CROSS JOIN,否则,很难跑出正确的结果,或者JOB压根不能执行完。
经验告诉我,Hive中只要是涉及到两个表关联,首先得了解一下数据,看是否存在多对多的关联。
Hive条件判断 if,coalesce,case…when
· Hive条件判断 if coalesce case when
· 事例及说明
· IF Test Condition True Value False Value
· COALESCE value1value2
· CASE Statement
hive中可能会遇到根据判断不同值,产生对应结果的场景,有三种实现方式:
·
IF( Test Condition, True Value, False Value )
·
·
COALESCE( value1,value2,… )
·
·
CASE Statement
·
该语法只能用来判断单个条件,例如:
select pd,
if(ps_t='常年',1,0) as pt,
ps_t
from dgp limit 100;
该函数用来获取参数列表中的首个非空值,若均为NULL,则返回NULL,例如:
select
coalesce(NULL,NULL,5,NULL,1,0) as pt;
该语法可以与某字段多个比较值的判断,并分别产生不同结果,与其他语言中case语法相似,例如:
select pd,
case pst
when "常年" then 1
when "非常年" then 0
else 0
end
as pt
from dgp limit 100;
或:
select pd,
case
when pst="常年" then 1
when pst="非常年" then 0
else 0
end
as pt
from dgp limit 100;
这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。
数据准备:
[sql] view plain copy 2015-03,2015-03-10,cookie1 2015-03,2015-03-10,cookie5 2015-03,2015-03-12,cookie7 2015-04,2015-04-12,cookie3 2015-04,2015-04-13,cookie2 2015-04,2015-04-13,cookie4 2015-04,2015-04-16,cookie4 2015-03,2015-03-10,cookie2 2015-03,2015-03-10,cookie3 2015-04,2015-04-12,cookie5 2015-04,2015-04-13,cookie6 2015-04,2015-04-15,cookie3 2015-04,2015-04-15,cookie2 2015-04,2015-04-16,cookie1 CREATE EXTERNAL TABLE lxw1234 ( month STRING, day STRING, cookieid STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/lxw11/'; hive> select * from lxw1234; OK 2015-03 2015-03-10 cookie1 2015-03 2015-03-10 cookie5 2015-03 2015-03-12 cookie7 2015-04 2015-04-12 cookie3 2015-04 2015-04-13 cookie2 2015-04 2015-04-13 cookie4 2015-04 2015-04-16 cookie4 2015-03 2015-03-10 cookie2 2015-03 2015-03-10 cookie3 2015-04 2015-04-12 cookie5 2015-04 2015-04-13 cookie6 2015-04 2015-04-15 cookie3 2015-04 2015-04-15 cookie2 2015-04 2015-04-16 cookie1在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM lxw1234 GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID; month day uv GROUPING__ID ------------------------------------------------ 2015-03 NULL 5 1 2015-04 NULL 6 1 NULL 2015-03-10 4 2 NULL 2015-03-12 1 2 NULL 2015-04-12 2 2 NULL 2015-04-13 3 2 NULL 2015-04-15 2 2 NULL 2015-04-16 2 2 等价于 SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day 再如: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM lxw1234 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID; month day uv GROUPING__ID ------------------------------------------------ 2015-03 NULL 5 1 2015-04 NULL 6 1 NULL 2015-03-10 4 2 NULL 2015-03-12 1 2 NULL 2015-04-12 2 2 NULL 2015-04-13 3 2 NULL 2015-04-15 2 2 NULL 2015-04-16 2 2 2015-03 2015-03-10 4 3 2015-03 2015-03-12 1 3 2015-04 2015-04-12 2 3 2015-04 2015-04-13 3 3 2015-04 2015-04-15 2 3 2015-04 2015-04-16 2 3 等价于 SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day 其中的 GROUPING__ID ,表示结果属于哪一个分组集合。根据GROUP BY的维度的所有组合进行聚合。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM lxw1234 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; month day uv GROUPING__ID -------------------------------------------- NULL NULL 7 0 2015-03 NULL 5 1 2015-04 NULL 6 1 NULL 2015-04-12 2 2 NULL 2015-04-13 3 2 NULL 2015-04-15 2 2 NULL 2015-04-16 2 2 NULL 2015-03-10 4 2 NULL 2015-03-12 1 2 2015-03 2015-03-10 4 3 2015-03 2015-03-12 1 3 2015-04 2015-04-16 2 3 2015-04 2015-04-12 2 3 2015-04 2015-04-13 3 3 2015-04 2015-04-15 2 3 等价于 SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以month维度进行层级聚合: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM lxw1234 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; month day uv GROUPING__ID --------------------------------------------------- NULL NULL 7 0 2015-03 NULL 5 1 2015-04 NULL 6 1 2015-03 2015-03-10 4 3 2015-03 2015-03-12 1 3 2015-04 2015-04-12 2 3 2015-04 2015-04-13 3 3 2015-04 2015-04-15 2 3 2015-04 2015-04-16 2 3 可以实现这样的上钻过程: 月天的UV->月的UV->总UV [sql] view plain copy --把month和day调换顺序,则以day维度进行层级聚合: SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM lxw1234 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; day month uv GROUPING__ID ------------------------------------------------------- NULL NULL 7 0 2015-04-13 NULL 3 1 2015-03-12 NULL 1 1 2015-04-15 NULL 2 1 2015-03-10 NULL 4 1 2015-04-16 NULL 2 1 2015-04-12 NULL 2 1 2015-04-12 2015-04 2 3 2015-03-10 2015-03 4 3 2015-03-12 2015-03 1 3 2015-04-13 2015-04 3 3 2015-04-15 2015-04 2 3 2015-04-16 2015-04 2 3 可以实现这样的上钻过程: 天月的UV->天的UV->总UV (这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)这种函数,需要结合实际场景和数据去使用和研究,只看说明的话,很难理解。
官网的介绍: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup