Hive Group By的实现原理

xiaoxiao2021-02-28 56

group by多字段

select rank, isonline, count(*) from city group by rank, isonline;

将GroupBy的字段组合为map的输出key值，利用MapReduce的排序，在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下（当然这里只是说明Reduce端的非Hash聚合过程）

group by单字段

group by 单字段和多字段时的不同在于key上，以如下例子为例（出处太多）： SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

hive> SELECT * FROM logs; a 苹果 5 a 橙子 3 a 苹果 2 b 烧鸡 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1

其过程如下图：

默认设置了hive.map.aggr=true，所以会在mapper端先group by一次，最后再把结果merge起来，为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash，reducer是mergepartial。如果把hive.map.aggr=false，那将groupby放到reducer才做，他的mode是complete.

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 扫描表 alias: logs Select Operator //选择字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //这里是因为默认设置了hive.map.aggr=true，会在mapper先做一次聚合，减少reduce需要处理的数据 aggregations: expr: sum(count) //聚集函数 bucketGroup: false keys: //键 expr: uid type: string mode: hash //hash方式，processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //输出key，value给reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial //合并值 outputColumnNames: _col0, _col1 Select Operator //选择字段 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //输出到文件 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1

细分Group by 的多种执行计划

在上文中给出的只是在Map Aggr为true时的执行计划,而Group by的执行计划细分的话有很多种。

术语说明

在介绍之前，首先明白几个术语： - (No) Map Aggr：hive.map.aggr配置变量的缩写，可以设置为true或false，代表了map端是否开启聚合功能。 - (No) Skew: hive.groupby.skewindata配置变量的缩写，可以设置为true或false，表示是否是倾斜表等。 - Rollup: rollup比较神奇，rollup可以实现从右到做递减多级的统计，显示统计某一层次结构的聚合。如：

select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with rollup;

等价以下sql语句：

select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

在此不做Rollup的进一步说明。

基于Map Aggr与Skew的取值，可以产生四种不同的执行计划，再加上Rollup可以拓展到6种，下文对这六种做简要的介绍。

Map Aggr & No Skew:

该执行计划可以说就是上文group by 单字段中给出的那种情况，在map端多了一个聚合操作。

执行过程如下（为了原汁原味，此处就给出英文了）： - Mapper: *Hash-based group by operator to perform partial aggregations *Reduce sink operator, performs some partial aggregations - Reducer: *MergePartial (list-based) group by operator to perform final aggregations

Map Aggr & Skew

为了处理一些额外的数据行，该种执行计划中，只在Map端的hash-based aggregation operator有更改。执行过程如下： - Mapper 1: *Hash-based group by operator to perform partial aggregations *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 1: *Partials (list-based) group by operator to perform further partial aggregations

Mapper 2: *Reduce sink operator, performs some partial aggregations

Reducer 2: *Final (list-based) group by operator to perform final aggregations

需要注意的是，如果没有Group by keys或者是distinct keys, Reduce 1 和Mapper 2将不会执行，且Mapper 1中的reduce sink operator 也不会spray.

No Map Aggr & No Skew & No Rollup

该种执行计划中，有没有No Rollup其实是没有什么影响的，代表的是一个意思，即就是No Map Aggr & No Skew的情况，其执行过程如下：

Mapper: *Reduce sink operator, performs some partial aggregations

Reducer: *Complete (list-based) group by operator to perform all aggregations

No Map Aggr & No Skew & With Rollup

说明：尽管此条中是No Map Aggr，在使用With Rollup时， hiveconf:hive.map.aggr要为true。我的理解是No Map Aggr表示语句中没有需要在Map端聚合的语句，如Count等。

其执行过程如下：

Mapper 1: *Reduce sink operator, does not perform any partial aggregations

Reducer 1: *Hash-based group by operator, much like the one used in the mappers of previous cases

Mapper 2: *Reduce sink operator, performs some partial aggregations

Reducer 2: *MergePartial (list-based) group by operator to perform remaining aggregations

No Map Aggr & Skew & (No Distinct or No Rollup)

在该执行计划下，list-based group by operator 做了更改，以适应rollup使用时对额外行的处理。

Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 1: *Partial1 (list-based) group by operator to perform partial aggregations, it makes use of the new list-based group by operator implementation for rollup if necessary

Mapper 2: *Reduce sink operator, performs some partial aggregations

Reducer 2: *Final (list-based) group by operator to perform remaining aggregations

No Map Aggr & Skew & Distinct & Rollup

Mapper 1: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 1: *Hash-based group by operator, much like the one used in the mappers of previous cases

Mapper 2: *Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 2: *Partials (list-based) group by operator to perform further partial aggregations

Mapper 3: *Reduce sink operator, performs some partial aggregations

Reducer 3: *Final (list-based) group by operator to perform final aggregations

注意：如果没有Group by keys或者是distinct keys, Reduce 2 和Mapper 3将不会执行，且Mapper 2中的reduce sink operator 也不会spray.

Also, note that the reason for Mapper 2 spraying is that if the skew in the data existed in a column that is not immediately nulled by the rollup (e.g. if we the group by keys are columns g1, g2, g3 in that order, we are concerned with the case where the skew exists in column g1 or g2) the skew may continue to exist after the hash aggregation, so we spray.

转载请注明原文地址: https://www.6miu.com/read-2619664.html

技术

最新回复(0)