json数据压缩成.gz文件上传到hdfs,通过hive映射,并进行后续的spark跑算法
1,在hdfs上的数据格式化后的形式,根据需求我们只要_source里的数据
{ "_index": "operatelog", "_type": "operatelog", "_id": "5938ca71-6057-40b5-a739-5687ae360640", "_source": { "id": "5938ca71-6057-40b5-a739-5687ae360640", "appId": "1000", "actionId": "1005", "code": "0", "sn": "50631032", "logLevel": "DEBUG" }, "sort": [1500349702325] }2,这就简单了,没有复杂的格式,建表
create external table operatelog( id string, appId string, actionId string, code string, sn string, `logLevel` string ) partitioned by (etlyear string ,etlmonth string,etlday string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE ;此时,如果遇到FAILED: ParseException line 1:34 cannot recognize input near 'loglevel' 'string' ',' in column specification (state=42000,code=40000),这个是因为loglevel是hive的关键字段,此时用 ``飘号 (键盘左上角跟~重合的键)盖住就行
3,如果加上飘号之后建表还是报错
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
这是说明,环境里没有格式化json的JsonSerDe类,所以此时应该将JsonSerDe加到hive环境里
命令:
0: jdbc:hive2://192.168.18.14:10000> add jar ${HIVE_HOME}hcatalog/share/hcatalog/hive-hcatalog-core-0.13.1-cdh5.3.2.jar
至于${HIVE_HOME}路径是什么,可以通过echo ${HIVE_HOME} 查看,然后补全就行
4,然后再次建表,此时应该就没问题了,直接映射,select看是否完美就行。
但是,敲黑板划重点了啊,如果是有复杂结构的json格式的数据,就比如下列数据
"_source": { "id": "184TCA876817", "sn": "184TCA876817", "param": ["LdTms|38|2.1.7", "富掌柜|42|1.5.9", "消息中心|22|2.0.2", "KunTengConver|1|1.0", ] }表结构如下
create external table apkversion( id string, sn string, `_source` array<string> ) partitioned by (etlyear string ,etlmonth string,etlday string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE ;建表,映射都OK,但是你在select * from apkversion limit 5; 时,你又发现一个坑
0: jdbc:hive2://192.168.18.14:10000> select * from apkversion limit 1; org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start of Array expected (state=,code=0)
这是因为JsonSerDe不支持复杂格式,比如array,就问你难受不?
至今我还没解决这个坑,如果有哪个大神解决了,请给小弟留言,小弟谢过了