Hive Sql Tips

基本数据类型

TINYINT：1 byte 有符号整数(1)
SMALLINT：2 bytes 有符号整数(1)
INT：4 bytes 有符号整数(1)
BIGINT：8 bytes 有符号整数(1)
FLOAT：4 bytes 单精度浮点数(1.0)
DOUBLE：8 bytes 双精度浮点数(1.0)
BOOLEAN：布尔类型，true or false(false)
STRING：字符串，可指定字符集(“hive”)
TIMESTAMP：整数、浮点数or字符串(1321123121)

集合数据类型

ARRAY：一组有序的字段，字段的类型必须相同；
MAP：一组无序的键/值对，键的类型必须是原子的，值可以是任何数据类型，同一个映射的键和值额类型必须相同；
STRUCT：一组命名的字段，字段类型可以不同

去重

row_number()

SELECT id
    ,month
    ,flag
FROM (
    SELECT id
        ,month
        ,flag
        ,row_number() OVER (
            PARTITION BY id ORDER BY month DESC
            ) AS rn
    FROM view1
    ) t
WHERE t.rn = 1;

创建数据库、表、视图

#创建数据库：create database [if not exists] database_name;
hive> create database if not exists college;

#创建表(内部表、外部表以及分区表见附录)：
create [external] table [if not exists] table_name
[(col_1 dt [comment c_com_1],col_2 dt [comment c_com_2],...)] 
[partitioned by (col dt,...)];

hive> use college;
hive> create table if not exists student(id int,name string);

#创建视图
hive> create view stu as select id,name from student where id<10;

查看数据库、表

#查看所有数据库
hive> show databases;

#查看以h开头的所有数据库
hive> show databases like ‘col.*’

#查看hive数据库位置等信息
describe database hive;
desc database hive;
desc database extended hive;

#查看表
hive> show tables;
hive> show tables in college like ‘s.*’;

查看建表语句

hive> show create table tablename;

删除数据库、表

#删除数据库
drop database db [cascade]; # 非空数据库使用cascade，否则报错
#删除表
drop table table_name;

Load 数据

#装载本地数据
hive> load data local inpath ‘/home/hadoop/stu.txt’ overwrite into table student; #overwrite覆盖

#与下述语句等同(load本地数据本质即为上传本地数据到hive数据表存放路径)
hadoop fs -put /home/hadoop/stu.txt /hive/warehouse/college.db/student

#装载hdfs数据
hive> load data inpath ‘/user/hadoop/stu.txt’ overwrite into table student; # overwrite 覆盖

插入数据

insert (overwrite) table student_copy select * from student where id<10;

查询数据

select id,name
  case
  when id=1 then 'first'
  when id=2 then 'second'
  else 'other' end from student;

Order By 优化

caution

HiveQL 中的 order by 与其他SQL方言中的功能一样，就是将结果按某字段全局排序，这会导致所有 map 端数据都进入一个 reducer 中，在数据量大时可能会长时间计算不完，很容易造成机器宕机。

解决办法

方法一设置严格模式，禁用

set hive.mapred.mode = strict;

在严格模式下如果向使用 order by 进行排序，那么必须使用 limit 进行指定条数。

select * from ods_xxx_xxx order by create_date desc limit 100;

方法二使用sort by 代替 order by

order by 是对全局进行排序，之后产生一个 reduce，默认是 asc 升序；
sort by 是在每个 reduce 内部进行排序，对全局来说，不算排序；默认是 asc 升序。sort by 是单独在各自的 reduce 中进行排序，所以并不能保证全局有序，一般和 distribute by 一起执行，而且 distribute by 要写在 sort by 前面。如果 mapred.reduce.tasks=1和 order by 效果一样，如果大于 1 会分成几个文件输出每个文件会按照指定的字段排序，而不保证全局有序。sort by 不受 hive.mapred.mode 是否为 strict ,nostrict 的影响。

基本数据类型​

集合数据类型​

去重​

row_number()​

创建数据库、表、视图​

查看数据库、表​

删除数据库、表​

Load 数据​

插入数据​

查询数据​

Order By 优化​

解决办法​

方法一 设置严格模式，禁用​

方法二 使用sort by 代替 order by​

基本数据类型

集合数据类型

去重

row_number()

创建数据库、表、视图

查看数据库、表

删除数据库、表

Load 数据

插入数据

查询数据

Order By 优化

解决办法

方法一设置严格模式，禁用

方法二使用sort by 代替 order by