Hive学习笔记02

分类: IT文章 • 2024-09-15 12:31:25

1. Hive基本操作

　　a. DML操作

load：加载时只是把数据文件移动到hive表对应的位置。
- loacl如果指定了就在本地的文件系统，local会将本地文件系统的文件复制到目标文件系统中。
- overwrite：如果使用了overwrite目标表或者分区中的内容就会被覆盖。

load data local inpath 'path' into table tb_load1;

View Code

insert

insert overwrite table stu_buck 
select * from student cluster by(Sno);

View Code

select
- order by：全局排序
- sort by：局部排序，在数据进reduce之前进行排序。如果reducetask任务数大于1不保证全局有序。
- distribute by（字段）：根据指定字段数据分发到不同的reduce，分发算法是hash散列。
- Cluster by（字段）：除了具有Distribute by的功能外，还会进行排序。如果distribute by 和order by 是一个字段等同于Cluster by（字段）

2. Hive Join

不支持等值连接，因为等值连接很难转换为mapreduce任务。

https://www.cnblogs.com/yiwanfan/p/9628235.html

3. Hive函数入门

　　a. 普通函数

　　https://www.cnblogs.com/kimbo/p/6288516.html

　　b. 用户自定义函数

当内置函数无法满足业务需求时，此时就可以考虑使用用户自定义函数。

自定义函数分为三种：

UDF：普通函数（一进一出）
UDAF：聚合函数（多进一出）
UDTF：表生成函数（一进多出）

UDF开发实例：

新建Java项目：添加依赖 hive-exec-1.2.1.jar 和 hadoop-common-2.7.4.jar 依赖

新建类继承UDF，并重载evaluate，在里面实现业务逻辑

打成jar包

添加jar包到hive的classpath：hive>add jar /home/hadoop/udf.jar;

创建临时函数与开发好的java 类关联：create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';

在sql中就可以使用该函数了：Select tolowercase(name),age from t_test；

4. Hive函数高阶特性

a. UDTF函数-expode函数

explode函数是hive内置的UDTF函数，可以将一个map或者array类型的字段展开。array类型转换后是每个元素生成一行，map类型是每一对元素作为一行，key作为一列，value作为一列。

--数据
001,allen,usa|china|japan,1|3|7
002,kobe,usa|england|japan,2|3|5
--创建表
create table test_message(id int,name string,location array<string>,city array<int>) row format delimited fields terminated by ","
collection items terminated by '|';
--加载数据
load data local inpath "/root/hivedata/test_message.txt" into table test_message;
--explode
select explode(location) from test_message;
select name,explode(location) from test_message; --报错
当使用UDTF函数的时候,hive只允许对拆分字段进行访问的。

View Code

b. lateral view侧视图

lateral view侧视图，意义是配合UDTF来使用，把某一行数据拆分成多行数据，不加lateral view的UDTF智能提取单个字段拆分，并不能塞会原来数据表中，加上lateral view 就可以将拆分的单个字段数据与原始表数据关联上。

select subview.* from test_message lateral view explode(location) subview;
--lateral view explode 相当于一个拆分location字段的虚表,然后与原表进行关联.
select name,subview.* from test_message lateral view explode(location) subview as lc;

View Code

5. 行列转换

a. 多行转单列

concat_ws(参数1，参数2)，用于进行字符的拼接
- 参数1—指定分隔符
- 参数2—拼接的内容
collect_set(col3)：它的主要作用是将某字段的值进行去重汇总，产生array类型字段，如果不想去重可用collect_list()。

+-----------------+-----------------+-----------------+--+
| row2col_1.col1  | row2col_1.col2  | row2col_1.col3  |
+-----------------+-----------------+-----------------+--+
| a               | b               | 1               |
| a               | b               | 2               |
| a               | b               | 3               |
| c               | d               | 4               |
| c               | d               | 5               |
| c               | d               | 6               |
+-----------------+-----------------+-----------------+--+
6 rows selected (0.096 seconds)
0: jdbc:hive2://hadoop01:10000> select col1, col2, concat_ws('|', collect_set(cast(col3 as string))) as col3
. . . . . . . . . . . . . . . > from row2col_1
. . . . . . . . . . . . . . . > group by col1, col2;
+-------+-------+--------+--+
| col1  | col2  |  col3  |
+-------+-------+--------+--+
| a     | b     | 1|2|3  |
| c     | d     | 4|5|6  |
+-------+-------+--------+--+

View Code

b. 单列转多行

需要使用UDTF（表生成函数）explode()，该函数接受array类型的参数，其作用恰好与collect_set相反，实现将array类型数据行转列。explode配合lateral view实现将某列数据拆分成多行。

+-----------------+-----------------+-----------------+--+
| col2row_2.col1  | col2row_2.col2  | col2row_2.col3  |
+-----------------+-----------------+-----------------+--+
| a               | b               | ["1","2","3"]   |
| c               | d               | ["4","5","6"]   |
+-----------------+-----------------+-----------------+--+
2 rows selected (0.075 seconds)
0: jdbc:hive2://hadoop01:10000> select col1, col2, lv.col3 as col3
. . . . . . . . . . . . . . . > from col2row_2
. . . . . . . . . . . . . . . > lateral view explode(col3) lv as col3;
+-------+-------+-------+--+
| col1  | col2  | col3  |
+-------+-------+-------+--+
| a     | b     | 1     |
| a     | b     | 2     |
| a     | b     | 3     |
| c     | d     | 4     |
| c     | d     | 5     |
| c     | d     | 6     |
+-------+-------+-------+--+

View Code

c. 多行转多列

+---------------+---------------+---------------+--+
| row2col.col1  | row2col.col2  | row2col.col3  |
+---------------+---------------+---------------+--+
| a             | c             | 1             |
| a             | d             | 2             |
| a             | e             | 3             |
| b             | c             | 4             |
| b             | d             | 5             |
| b             | e             | 6             |
+---------------+---------------+---------------+--+
6 rows selected (0.092 seconds)
0: jdbc:hive2://hadoop01:10000> select col1,
. . . . . . . . . . . . . . . > max(case col2 when 'c' then col3 else 0 end) as c,
. . . . . . . . . . . . . . . > max(case col2 when 'd' then col3 else 0 end) as d,
. . . . . . . . . . . . . . . > max(case col2 when 'e' then col3 else 0 end) as e
. . . . . . . . . . . . . . . > from row2col
. . . . . . . . . . . . . . . > group by col1;
-------+----+----+----+--+
| col1  | c  | d  | e  |
+-------+----+----+----+--+
| a     | 1  | 2  | 3  |
| b     | 4  | 5  | 6  |
+-------+----+----+----+--+

View Code

6. reflect函数

reflect函数可以支持在sql中调用java中的自带函数，秒杀一切udf函数。

+----------------+----------------+--+
| test_udf.col1  | test_udf.col2  |
+----------------+----------------+--+
| 1              | 2              |
| 4              | 3              |
| 6              | 4              |
| 7              | 5              |
| 5              | 6              |
+----------------+----------------+--+
5 rows selected (0.061 seconds)
0: jdbc:hive2://hadoop01:10000> select reflect("java.lang.Math","max",col1,col2) from test_udf;
+------+--+
| _c0  |
+------+--+
| 2    |
| 4    |
| 6    |
| 7    |
| 6    |
+------+--+

View Code

Hive学习笔记02

相关推荐