select rank, isonline, count(*) from city group by rank, isonline;
将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)
group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多): SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs; a 苹果 5 a 橙子 3 a 苹果 2 b 烧鸡 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1其过程如下图:
默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.
在上文中给出的只是在Map Aggr为true时的执行计划,而Group by的执行计划细分的话有很多种。
在介绍之前,首先明白几个术语: - (No) Map Aggr:hive.map.aggr配置变量的缩写,可以设置为true或false,代表了map端是否开启聚合功能。 - (No) Skew: hive.groupby.skewindata配置变量的缩写,可以设置为true或false,表示是否是倾斜表等。 - Rollup: rollup比较神奇,rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。 如:
select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with rollup;等价以下sql语句:
select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());在此不做Rollup的进一步说明。
基于Map Aggr与Skew的取值,可以产生四种不同的执行计划,再加上Rollup可以拓展到6种,下文对这六种做简要的介绍。
该执行计划可以说就是上文group by 单字段中给出的那种情况,在map端多了一个聚合操作。
执行过程如下(为了原汁原味,此处就给出英文了): - Mapper: *Hash-based group by operator to perform partial aggregations *Reduce sink operator, performs some partial aggregations - Reducer: *MergePartial (list-based) group by operator to perform final aggregations
为了处理一些额外的数据行,该种执行计划中,只在Map端的hash-based aggregation operator有更改。 执行过程如下: - Mapper 1: *Hash-based group by operator to perform partial aggregations *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 1: *Partials (list-based) group by operator to perform further partial aggregations
Mapper 2: *Reduce sink operator, performs some partial aggregations
Reducer 2: *Final (list-based) group by operator to perform final aggregations
需要注意的是,如果没有Group by keys或者是distinct keys, Reduce 1 和Mapper 2将不会执行,且Mapper 1中的reduce sink operator 也不会spray.
该种执行计划中,有没有No Rollup其实是没有什么影响的,代表的是一个意思,即就是No Map Aggr & No Skew的情况,其执行过程如下:
Mapper: *Reduce sink operator, performs some partial aggregations
Reducer: *Complete (list-based) group by operator to perform all aggregations
说明:尽管此条中是No Map Aggr,在使用With Rollup时, hiveconf:hive.map.aggr要为true。 我的理解是No Map Aggr表示语句中没有需要在Map端聚合的语句,如Count等。
其执行过程如下:
Mapper 1: *Reduce sink operator, does not perform any partial aggregations
Reducer 1: *Hash-based group by operator, much like the one used in the mappers of previous cases
Mapper 2: *Reduce sink operator, performs some partial aggregations
Reducer 2: *MergePartial (list-based) group by operator to perform remaining aggregations
在该执行计划下,list-based group by operator 做了更改,以适应rollup使用时对额外行的处理。
Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 1: *Partial1 (list-based) group by operator to perform partial aggregations, it makes use of the new list-based group by operator implementation for rollup if necessary
Mapper 2: *Reduce sink operator, performs some partial aggregations
Reducer 2: *Final (list-based) group by operator to perform remaining aggregations
Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 1: *Hash-based group by operator, much like the one used in the mappers of previous cases
Mapper 2: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 2: *Partials (list-based) group by operator to perform further partial aggregations
Mapper 3: *Reduce sink operator, performs some partial aggregations
Reducer 3: *Final (list-based) group by operator to perform final aggregations
注意:如果没有Group by keys或者是distinct keys, Reduce 2 和Mapper 3将不会执行,且Mapper 2中的reduce sink operator 也不会spray.
Also, note that the reason for Mapper 2 spraying is that if the skew in the data existed in a column that is not immediately nulled by the rollup (e.g. if we the group by keys are columns g1, g2, g3 in that order, we are concerned with the case where the skew exists in column g1 or g2) the skew may continue to exist after the hash aggregation, so we spray.