6miu盘搜

hive内部表，外部表，分区表，桶表，函数及自定义函数

xiaoxiao2021-02-28 28

Hive的概念： Hive是大数据领域中的数据仓库的基础框架 Hive是一个SQL解析引擎 Hive操作方式： hive -S -f hive-srcipt.hql > result.csv hive -S -e "" > result.csv Hive表的操作：创建的方式和目前mysql一致数据的加载方式：load data local inpath 'linux_path' into table tableName; 数据类型：符合数据类型： array ------>java中的array|arrayList 调用方式arr[index] index从0开始 map ------>java中的map 调用方式map["key"] struct ------>java中的object 调用方式obj.property 分隔符：行的默认分隔符：\n 列的默认分隔符：\001 ---->ctrl+v ctrl+a array的默认分隔符：\002 ---->ctrl+v ctrl+b map中的k和v之间的默认分隔符：\003 ---->ctrl+v ctrl+c 创建一张完整的包涵了各种数据类型的表，以及有相应自定义分隔符的表的定义： create table emp( empid int, ename string, salary float, subordinate array<int>, tax map<string, float>, address struct<province:string, city:string, zip:int> ) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n'; 常见的分隔符：| , ## \t \001 \002 \003 不要出现：& $ 两种数据加载的模式：读模式写模式 hive采用的是读模式 ----------------------------------------------------------------------------------------------- Hive表的分类分为两种表：内部表和外部表分为两种功能表：分区表和桶表内部表所谓的内部表，指的是表的数据的存在与否受到了表定义(元数据)的影响，当删除表定义的时候，表中的数据随之一并被删除。内部表，MANAGED_TABLE，也罢内部表称之为受控表外部表所谓的外部表，指的是表的数据的存在不受表定义(元数据)的影响，当删除表定义的时候，表中的数据依然存在，表中的数据相当于是表对数据的一个引用，当删除表的时候，删除了表和数据之间的一个引用关系。外部表，EXTERNAL_TABLE 外部表的DDL： create external table t5_external( id int ); 删除外部表 drop table t5_external; 加载数据：（不建议）load data local inpath '/opt/data/hive/hive-t5-external.txt' into table t5_external; 引用的方式： alter table t5_external set location 'hdfs_path'; eg: alter table t5_external set location '/input/hive'; 在创建表的时候指定数据： create external table t5_external_1( id int ) location '/input/hive'; 作用：可以提高数据的安全性通过对数据的引用，达到多部门共同使用同一份数据，而不用做冗余在hive和外部框架进行集成的时候，多用外部表去关联操作相应框架，比如hbase 内部表和外部表之间进行转换：内部表----->外部表 alter table t set tblproperties('EXTERNAL'='TRUE'); 外部表----->内部表 alter table t set tblproperties('EXTERNAL'='FALSE'); 分区表：为了提高单表的处理能力，满足只操作一部分数据的需求，而不全量进行操作，所以产生了分区表。分区需要指定分区字段及其对应的分区值。 DDL 创建： create table t6_partition( id int, name string, birthday date, online boolean ) partitioned by(dt date comment "partition field day time"); 查看分区： show partitions t6_partition; 增加分区： alter table t6_partition add partition(dt="2017-07-20"); 删除分区： alter table t6_partition drop partition(dt="2017-07-20"); DML 加载数据： load data local inpath '/opt/data/hive/hive-t6-partion.txt' into table t6_partition partition(dt='2017-07-19'); DQL 查询操作和之前的一模一样，将分区字段作为表中的一列来进行对待即可。如果有多个统计维度的时候，可以采用多个分区来设置 create table t6_partition_1( id int, name string, birthday date, online boolean ) partitioned by(year int, class string); 桶表：（分区表搞不定分区数据量大小相差不大）（本地模式不起作用）在hive中，将数据进行打散的另一种方式就是分桶，将数据打散到一个个的文件，这一个个文件称之为桶文件，在创建表的时候，需要预先设定需要几个桶来进行存储数据，同时需要明确指定是按照哪一个字段进行分桶操作。本地模式不起作用 DDL： create table t7_bucket(id int) clustered by (id) into 3 buckets; 查看信息：desc extended t7_bucket DML: 加载数据：不要执行load 字段类型要一致 insert into t7_bucket select id from t5_external_1; 作用：（图）提高多表关联的查询效率做数据的抽样采集视图：创建一个视图： create view t8_view as select * from t1; 执行查询操作和操作table一样的复制一张表(复制的表的结构)： create table tt like t1; Hive的数据加载方式总结 []==>可选，<> ==>必须加载 load load data [local] inpath 'path' [overwrite] into table [partition_psc]; local：有==>从linux本地加载数据无==>从hdfs加载数据，相当于h7dfs dfs -mv 移动数据 overwrite 有==>覆盖掉表中原来的数据无==>在原来的基础上追加新的数据从其他表加载 insert <overwrite|into> t_des select [...] from t_src [...]; overwrite 有==>覆盖掉表中原来的数据无==>在原来的基础上追加新的数据 ==>会转化成为MR执行需要注意的地方：t_des中列要和select [...] from t_src这里面的[...]一一对应起来。创建表的时候加载 create table t_des as select [...] from t_src [...]; 这样会创建一张表，表结构为select [...] from t_src中的[...] 动态分区的加载快速复制表结构 create table t_d_partition like t_partition_1; hive (default)> show partitions t_partition_1; OK partition year=2015/class=bigdata year=2015/class=linux year=2016/class=bigdata year=2016/class=linux 要将2016的数据都到入到t_d_partition的相关的分区中 insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016; 要将t_partition_1中所有数据都到入到t_d_partition的相关的分区中 insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1; 从hdfs上面删除的数据，并没有删除表结构，我们show partitions t_d_partition;是从metastore中查询出来的内容，如果你之手动删除的hdfs上面数据，它的元数据信息依然在。需要将hive.exec.dynamic.partition.mode设置为nonstrict <property> <name>hive.exec.max.dynamic.partitions</name> <value>1000</value> <description>Maximum number of dynamic partitions allowed to be created in total.</description> </property> import import table stu from '/data/stu'; 导出 1°、在hdfs的直接上操作 hadoop fs -cp src_uri dest_uri 或者 hive> export table tblName to 'hdfs_uri'; 2°、在终端使用directory insert overwrite [local] directory 'linux_fs_path' select ...from... where ...; Hive的函数&自定义函数查看Hive提供的函数 show functions; 查看具体的函数： desc function [extended] func_name; count if case when split explode row_number collect_set 使用case when对结果进行分类输出 select id, case id when 1 then "TEACHER" when 2 then "STUDENT" when 3 then "SALER" when 4 then "OFFICER" else "OTHER" end from t7_bucket; 使用hql编写wordcount的案例(alter table t rename to t_wc;修改表名) 列名line hello you hello me hello xiaomei ----> count(1) group by line ----> hello you hello me hello xiaomei ------------ hello you.split(" ") ---->["hello", "you"] ---->hello you step 1.将表中某一列的每一行数据使用split函数切割成一个个的单词 select split(line, " ") from t_wc; ["hello","you"] ["hello","xiao","mei"] ["mei","i","hate","you"] step 2.需要将这个数组中的数据，立起来，相当于行列的转置 select explode(split(line, " ")) as word from t_wc; hello you hello xiao mei mei i hate you step 3.在step2的基础之上统计，每一个单词出现的次数 select w.word, count(w.word) as count from (select explode(split(line, " ")) as word from t_wc) w group by w.word order by count desc; 在工作过程中，尽量少用order by进行排序， sort by column ---->按照column进行局部排序 distribute by column 按照某一列进行分开 distribute by column sort by column;这个就和order by是等效的 -----上述的简写方式 cluster by column; 说明这种排序默认就是asc，不支持在最后面跟asc和desc 对表中的数据进行分组排序 row_number 根据员工、部分、薪资，这三张表， 1、分组显示每一个部分员工的信息（启动显示部分名称，员工姓名，员工性别[男|女]，员工薪资），同时分组按照员工薪资降序排序 select e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary, row_number() over(partition by e.deptid order by s.salary desc) rank from t_dept d left join t_employee e on d.id = e.deptid left join t_salary s on e.id = s.empid where s.salary is not null; 2、获取显示部门薪资top2的员工信息 select tmp.* from (select e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary, row_number() over(partition by e.deptid order by s.salary desc) rank from t_dept d left join t_employee e on d.id = e.deptid left join t_salary s on e.id = s.empid where s.salary is not null) tmp where tmp.rank < 3; 当Hive提供的函数满足不了业务需求的时候，就需要自定义函数来满足业务需要(UDF User Define Function, UDTF User Define Table Function, UDAF User Define Aggregation Function)。 UDF:一个输入，一个输出，比如year，date_sub UDAF：多个输入，一个输出，比如count，sum，avg UDTF：操作行列，比如explode 步骤： 1、编写一个类继承UDF 2、覆盖其中的evaluate()方法 3、打成jar包，上传到部署hive的及其上面 4、添加进hive的运行时classpath hive> add jar /opt/jars/hive/zc.jar 5、创建一个临时函数并调用之 create temporary function z_c as 'com.uplooking.bigdata.udf.ZodiacAndConstellationUDF'; 6、销毁临时函数(只在当前会话hive客户端有效)---》一般创建永久函数：extends UDFE ，在相应地方进行注册 hive项目重新编译打包。 maven依赖： <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hive-api.version>2.1.0</hive-api.version> <hadoop-api.version>2.6.4</hadoop-api.version> <hadoop-core.version>1.2.1</hadoop-core.version> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop-api.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>${hadoop-api.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>${hadoop-core.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-serde</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-service</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-metastore</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-common</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-cli</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive-api.version}</version> </dependency> <dependency> <groupId>org.apache.thrift</groupId> <artifactId>libfb303</artifactId> <version>0.9.0</version> </dependency> </dependencies> Hive的JDBC 6个步骤： 1、注册驱动 2、获取Connection 3、获得执行器Statment 4、执行SQL 5、封装结果集ResultSet 6、关闭资源(rs.close(), st.close(), con.close()) 小的操作技巧：在hive终端可以执行hdfs shell 和linux的命令： hdfs shell操作方式： hive> dfs xxxx; eg. hive> dfs -ls /input; linux命令： hive> ! linux_cmd; eg.hive> !ls;

转载请注明原文地址: https://www.6miu.com/read-1100277.html

技术

最新回复(0)