hive 常识学习

hive 知识学习

hive为何要修改数据库:

deby只支持一个SESSION会话，如果hive使用默认的deby，那么在linux客户端开启第二个Hive命令行的时候，会报错, 而mysql是支持多会话的数据库。

hive特点：

提取转换加载
除了有SQL， UDF
还支持熟悉mr的用户自定义mapper reducer来处理内建的mapper reducer无法完成的复杂分析工作。

是解析引擎，将SQL转换为mr任务，来在hadoop上执行。
何时走jobtracker select 非*
何时走namenode select *

修改hive warehoust的位置：
hive-default.xml ---> hive.metastore.warehoust.dir

hive进入命令行的两种方式:
1 #hive
2 #hive --service cli

0 hive参数配置使用: (----> 在hql语句中，通过${}引用变量的值通过${env:}引用java system的数据

通过${system:}来引用shell环境变量的值)

方式1：  范围hivevar定义局部变量
#hive -d columeage=age
hive>create table t(name string, ${columeage} string)
方式2：  范围hiveconf定义全局   
#hive --hiveconf hive.cli.print.current.db=true;
#hive --hiveconf hive.cli.print.header=true;
方式3：  获取java system系统参数 
hive>create table t(name string, ${system:user.name} string)
方式4：  获取shell参数    通过命令env查看shell下的所有环境变量的数据  
hive>create table t(name string, ${env:HOSTNAME} string)

shell env数据部分结果如下：

[root@chinadaas109 ~]# env
HOSTNAME=chinadaas109
TERM=vt100
SHELL=/bin/bash
HISTSIZE=1000
.....

那么hive hql中引用 shell env参数案例如下：

hive (default)> create table ttttt(id string,${env:HOSTNAME});
FAILED: Parse Error: line 1:41 cannot recognize input near ')' '<EOF>' '<EOF>' in column type

hive (default)> create table ttttt(id string,${env:HOSTNAME} string);
OK
Time taken: 0.256 seconds
hive (default)> desc ttttt;
OK
col_name        data_type       comment
id      string
chinadaas109    string
Time taken: 0.254 seconds
hive (default)>

1 hive脚本执行:

linux上直接执行：
$>hive -e "hql" 

$>hive -e "hql">aaa  将执行结果覆盖写入到当前linux目录的aaa文件中

$>hive -S -e "hql">aaa    -S = -slience 以安静方式执行，不显示交互信息 OK time token:1.398 secondes Fetch: 4 row(s)   执行顺序不能换
   如果想保存处理结果同时又不想有交互信息，那么可以使用 -S
$>hive -f filename        -f = -file  执行完后仍留在linux命令台, 做ETL时使用到的案例写法 hive -f data_process_$DATE/02_all_step.sql

$>hive -i /home/my/hive-init.sql  -i 执行hive时初始化时执行， 执行完后将进入hive环境中

在hive命令行下执行：
hive>source file    在当前linux目录下执行存放hql的文件，如果你的hql文件所在的file在别的目录，那么指定相对或者绝对路径后 执行  hive>source 路径/file
eg:  showtable文件在当前linux目录的test文件夹下，其内容为 show tables;  那么在hive命令行下执行的时候写法为:
hive (default)> source test/showtable

2 hive与外部资源的交互：

hive环境内到linux的交互命令---> ！  但是并不是所有的linux命令都能支持hive环境内通过!引用
hive>!ls;
hive>!pwd;

hive环境内到hdfs交互命令     在hadoop命令格式为 hadoop dfs -ls 到hive的时候，将hadoop去掉即可，这么来记忆在hive中如何使用hadoop的命令
hive>dfs -ls /;
hive>dfs -mkdir /hive;

3 hive的JDBC模式

进入hive安装目录，bin/下看到有hive  hiveserver2等两个命令，后者就是hive远程服务对应的开启脚本 

hive端需要开启 远程服务 (端口号10000) 
在hive0.12之前，都是hiveserver1, 在hive0.13的时候，就是hiveserver2，因此要确定你的hiveserver是哪个版本。
直接使用 hiveserver2方式启动hive远程服务，也可以通过hive --service hiveserver2 方式启动，启动后，通过 如下命令查看是否开启，
[root@chinadaas109 ~]# netstat -anp | grep 10000
tcp        0      0 :::10000                    :::*                        LISTEN      10643/java  


在java代码中调用hive的JDBC建立连接，代码待续.....

4 hive web界面模式 (网络资料，本人没有安装过，做个备份)

Web界面安装：
下载apache-hive-0.14.0-src.tar.gz
制作war包放到HIVE_HOME/lib/：hwi/web/*里面所有文件打成war包   ---> 变成zip后 在重命名为war即可
复制 tools.jar(jdk的lib包下面的jar包) 到 hive/lib下
修改hive-site.xml
<property>
    <name>hive.hwi.listen.host</name>
    <value>0.0.0.0</value>
    </property>
  <property>
    <name>hive.hwi.listen.port</name>
    <value>9999</value>
    </property>
  <property>
    <name>hive.hwi.war.file</name>
    <value>lib/hive-hwi-0.14.0.war</value>
</property>
hive web界面的 (端口号9999) 启动方式
	#hive --service hwi &
用于通过浏览器来访问hive
http://hadoop0:9999/hwi/

5 hive set hiverc hivehistory

hive控制台set命令:  等效于 前面介绍的 在linux命令行下的 #hive --hiveconf hive.cli.print.current.db=true; 这种设置方式
hive>set hive.cli.print.current.db=true;
hive>set hive.metastore.warehouse.dir=/hive



hive启动时初始化参数可以使用set命令，在家目录的.hiverc文件内将上述set命令放进去，如果没有 .hiverc文件，可以手动创建此文件
为何要进入~呢，因为谁操作的hive，那么hive就会在这个人的家目录下 记录hive的 配置信息和历史操作命令
eg:
[root@chinadaas109 ~]# cat .hiverc 
set hive.cli.print.current.db=true; 
set hive.cli.print.header=true;
add jar /home/new_load_data/lib/hive-udf.jar; 
create temporary function hivenvl as 'org.apache.hadoop.hive.ql.udf.hivenvl'; 


hive历史操作命令集
~/.hivehistory

6 hive数据类型复合类型简介

数据类型的意义所在：
设置好了类型后，就可以通过hive函数来得到最大最小平均之类的结果

hive行/列分隔符图:

row format dilimited 就是\n的一种表示

数据类型：

基本数据类型：
  整型
  布尔
  浮点
  字符
  
复合数据类型：
  Struct
  Array
  Map


在创建符合类型的时候，基本字段的分隔符和复合类型字段的分隔符需要区别出来，如下一个是,一个是: 否则hive不知道该如何解析、


Array使用:
array中的数据为相同类型，例如，假如array A中元素['a','b','c']，则A[1]的值为'b'
hive>create table table1(name string, student_id_list array<INT>) ROW FORMAT DELIMITED  FIELDS TERMINATED BY ','  COLLECTION ITEMS TERMINATED BY ':';


  文件数据file1的内容：
   class1,1001:1002
   class2,1001:1002
load data local inpath file1 into table table1


Struct使用:
structs内部的数据可以通过DOT（.）来存取，例如，表中一列c的类型为STRUCT{a INT; b INT}，我们可以通过c.a来访问域a
hive> create table student_test(id INT, info struct<name:STRING, age:INT>)   ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  COLLECTION ITEMS TERMINATED BY ':';  

Map使用:

访问指定域可以通过["指定域名称"]进行，例如，一个Map M包含了一个group-》gid的kv对，gid的值可以通过M['group']来获取
create table employee(id string, perf map<string, int>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';

相关推荐