orcFile split和读数据原理总结（hive0.13）

xiaoxiao2021-02-28 116

官网关于orcfile的介绍

背景

Hive的rcfile格式已经使用多年，但是，它会将所有的列都当做二进制来处理，没有与类型挂钩。因此，Hive0.11版本引入orcFile。OrcFile有以下几点好处：

每个task只生成一个文件，减轻hdfs压力保存列类型，支持datetime, decimal和负责类型(struct, list, map, and union)文件中保存轻量级索引跳过不需的row groupseek到指定的row根据列类型进行压缩整数类型：run-length encodingstring类型：dictionary encoding不同的recordReader并发读同一文件split时，无需扫描标记可以限制读写占用的内存使用pb存放元数据，支持添加和移除列

结构

（图片来源：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC）

orc dump工具

// Hive version 0.11 through 0.14: hive --orcfiledump <location-of-orc-file> // Hive version 0.15 and later: hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file> // Hive version 1.2.0 and later: hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file> // Hive version 1.3.0 and later: hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] [--backup-path <new-path>] <location-of-orc-file-or-directory>

配置

KEYDefaultNotesorc.compressZLIB压缩算法，NONE/ZLIB/SNAPPYorc.compress.size262,144每个压缩块大小，也是压缩保存stripe数据缓存大小orc.stripe.size67,108,864stripe大小orc.row.index.stride10,000索引数据间隔行（必须>=1000），即每10,000行数据，建一次索引，也是划分rowGroup的依据orc.create.indextrue是否建行级索引

split读取原理

涉及配置

hive.optimize.index.filter 默认值：false意义：是否使用索引优化物理执行计划是否将条件下推到TableScanOperator中（读取数据、做split时会使用此条件信息）orcFile需要设置为true，才能获取到过滤条件，进行stripe过滤hive.exec.orc.zerocopy 默认：false读取orc文件时，是否使用0拷贝hive.input.format 默认：CombineHiveInputFormat当使用combine方式时，会将小文件进行合并，但是不会用到OrcInputFormat的过滤stripe机制当使用org.apache.hadoop.hive.ql.io.HiveInputFormat，会调用OrcInputFormat的getSplits方法，过滤不符合要求的stripe

开启条件及优缺点这里只讨论非combine方式的split个读取方式。

触发条件： set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;（必选）set hive.optimize.index.filter=true;（可选）是否条件下推到TS，进行条件过滤，建议开启set hive.exec.orc.zerocopy=true;（可选）读取orc文件，是否使用0拷贝，建议开启上述3个配置都开启情况优点：做split时：可以将不符合条件的stripe提前过滤，减少map个数读取时：可以直接跳过不符合条件的rowGroup，无需读取多余的数据缺点：不会combine，有可能会因为小文件过多，导致map数过多依赖用户where条件，如果where条件过滤的数据不是很多，可能不会过滤stripe，导致map数过多（同时增加额外的计算，导致性能有所下降）

原理介绍

split 步骤1：stripe1，设置offset1和end1步骤2：stripe2被过滤条件过滤，stripe1则会产生一个split步骤3：stripe3，设置offset2和end2步骤4：stripe4和stripe3处于不同的block，stripe3则会产生一个split，offset和end分别指向stripe4的开始和结束位置步骤5：stripe5，offset不变，end指向stripe5的结束位置步骤6：stripe6，此时(end4-offset4)>maxSplitSize，stripe4/5/6则会产生一个split步骤7：stripe7，到达文件结束，stripe7产生一个split读取读取footer：获取列信息、索引位置信息、数据位置信息等读取indexData 根据orc.row.index.stride的值，划分rowGroup，每个rowGroup的索引数据条数为orc.row.index.stride的值根据索引数据的信息（max/min)，判断每个rowGroup是否满足下推的where条件，实际读取数据时进行skip读取实际数据读取每列的数据，当遇到被过滤的rowGroup时，会skip掉，减少读取的数据量

优缺点

优点可以提前过滤无需的stripe，减少split个数读取时，可以过滤不满足条件的rowGroup，减少读取数缺点做split时，stripe不会合并，有可能导致split数比combine方式更多也有可能数据量少的split数比数据量多的split数多

测试结果

stripeSize为128M

sql1 select log_date,log_time,hh24,area_country,area_prov,area_city from tbl_orc_128M where dt='20161109' and hh24='19' andchannel_id=179569143limit 100;combine方式 map数：1310会进行列skip Reading ORC rows from hdfs://bipcluster/bip/external_table/xx/tbl_orc_128M/dt=20161109/000856_0 with {include: [true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, false, false, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], offset: 0, length: 225585161}combine方式+条件下推 map数：1310会进行列skip会进行rowGroup的skip 非combine方式 map数：1747会进行列skip非combine方式+条件下推 map数：43会进行列skip会进行rowGroup的skip： sql2 select log_date,log_time,hh24,area_country,area_prov,area_city from tbl_orc_128M where dt='20161109' and hh24='19' limit 100;combine方式 map数：1310会进行列skipcombine方式+条件下推 map数：1310会进行列skip会进行rowGroup的skip非combine方式 map数：1747会进行列skip非combine方式+条件下推 map数：1747会进行列skip会进行rowGroup的skip：

stripeSize为64M

sql1 select log_date,log_time,hh24,area_country,area_prov,area_city from tbl_orc_64M where dt='20161109' and hh24='19' andchannel_id=179569143limit 100;combine方式 map数：1448会进行列skipcombine方式+条件下推 map数：1448会进行列skip会进行rowGroup的skip非combine方式 map数：3494会进行列skip非combine方式+条件下推 map数：0sql2 select log_date,log_time,hh24,area_country,area_prov,area_city from tbl_orc_64M where dt='20161109' and hh24='19' limit 100;combine方式 map数：1448会进行列skipcombine方式+条件下推 map数：1448会进行列skip会进行rowGroup的skip非combine方式 map数：3494会进行列skip非combine方式+条件下推 map数：3494会进行列skip会进行rowGroup的skip：

参考文档

orc和parquet比较

转载请注明原文地址: https://www.6miu.com/read-60886.html

技术

最新回复(0)