共计 6140 个字符,预计需要花费 16 分钟才能阅读完成。
日常使用的数据查询工具都是hive,了解一下hive sql执行顺序,其实与mysql之间的差异不是太大,参考mysql的执行顺序,hive的执行顺序如下
- from
- on
- join
- where
- group by
- having
- select
- distinct
- union
- order by
举例
explainselectcity,ad_type,device,sum(cnt)ascntfromtb_pmp_raw_log_basic_analysiswhereday=’2016-05-28’andtype =0andmedia =’sohu’and(deal_id =”ordeal_id =’-‘ordeal_idisNULL)groupbycity,ad_type,device
对应的stage任务如下
STAGE DEPENDENCIES:
Stage-1is a root stage
Stage-0is a root stage
STAGE PLANS:
Stage: Stage-1
MapReduce
MapOperator Tree:
TableScan
alias: tb_pmp_raw_log_basic_analysis
Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
Filter Operator
predicate: (((deal_id=”)or(deal_id=’-‘))ordeal_id isnull) (type: boolean)
Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
SelectOperator
expressions: city (type:string), ad_type (type:string), device (type:string), cnt (type: bigint)
outputColumnNames: city, ad_type, device, cnt
Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
GroupByOperator
aggregations:sum(cnt)
keys: city (type:string), ad_type (type:string), device (type:string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
Reduce Output Operator
key expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string)
sortorder:+++
Map-reducepartition columns: _col0 (type:string), _col1 (type:string), _col2 (type:string)
Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
GroupByOperator
aggregations:sum(VALUE._col0)
keys: KEY._col0 (type:string), KEY._col1 (type:string), KEY._col2 (type:string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
SelectOperator
expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
File Output Operator
compressed:false
Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit:-1
具体介绍如下
**stage1的map阶段**
TableScan:from加载表,描述中有行数和大小等
Filter Operator:where过滤条件筛选数据,描述有具体筛选条件和行数、大小等
Select Operator:筛选列,描述中有列名、类型,输出类型、大小等。
Group By Operator:分组,描述了分组后需要计算的函数,keys描述用于分组的列,outputColumnNames为输出的列名,可以看出列默认使用固定的别名_col0,以及其他信息
Reduce Output Operator:map端本地的reduce,进行本地的计算,然后按列映射到对应的reduce
**stage1的reduce阶段Reduce Operator Tree**
Group By Operator:总体分组,并按函数计算。map计算后的结果在reduce端的合并。描述类似。mode: mergepartial是说合并map的计算结果。map端是hash映射分组
Select Operator:最后过滤列用于输出结果
File Output Operator:输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。
stage0第二阶段没有,这里可以实现limit 100的操作
一些填坑点相关的记录
(1)join on where的使用
- 对于inner join 不会有影响
- 对于外连接有影响
a. WHERE 操作是对在join之后操作
b. ON操作是在join之前
举例说明: 下面给出相应的表数据
1.documents:
|id |name |
——–|————-|
|1 |Document1 |
|2 |Document2 |
|3 |Document3 |
|4 |Document4 |
|5 |Document5 |
2.downloads:
|id |document_id |username|
|——|—————|———-|
|1 |1 |sandeep |
|2 |1 |simi |
|3 |2 |sandeep |
|4 |2 |reya |
|5 |3 |simi |
a) 使用where操作
由于是在jion之后才进行筛选操作,使用left join操作实际上会产生7条数据,由于ID在downloads表中没有4,5,所以对应的数据都是NULL
此时where的条件筛选的作用出来了,只有username=‘sandeep’的数据才会被保留,因此符合这个条件的只有两条数据
SELECTdocuments.name,downloads.id
FROMdocuments
LEFTOUTERJOINdownloads
ONdocuments.id=downloads.document_id
WHEREusername=’sandeep’
Forabove query the intermediatejointablewill looklikethis.
|id(fromdocuments)|name |id(fromdownloads)|document_id|username|
|——————–|————–|———————|————-|———-|
|1 |Document1 |1 |1 |sandeep |
|1 |Document1 |2 |1 |simi |
|2 |Document2 |3 |2 |sandeep |
|2 |Document2 |4 |2 |reya |
|3 |Document3 |5 |3 |simi |
|4 |Document4 |NULL |NULL |NULL |
|5 |Document5 |NULL |NULL |NULL |
After applying the`WHERE`clauseandselecting the listed attributes,the result will be:
|name |id|
|————–|—-|
|Document1 |1 |
|Document2 |3 |
b) 使用on,是在join之前就参与条件筛选
SELECTdocuments.name,downloads.id
FROMdocuments
LEFTOUTERJOINdownloads
ONdocuments.id=downloads.document_id
ANDusername=’sandeep’
在没有select之前进行获取的左连接的数据如下所示:
|id(fromdocuments)|name |id(fromdownloads)|document_id|username|
|——————–|————–|———————|————-|———-|
|1 |Document1 |1 |1 |sandeep |
|2 |Document2 |3 |2 |sandeep |
|3 |Document3 |NULL |NULL |NULL |
|4 |Document4 |NULL |NULL |NULL |
|5 |Document5 |NULL |NULL |NULL |
上面数据表中没有满足两个条件的输出的数据都是null,只有满足两个条件的数据才会输出实际的数据结果,比如前两个数据
最终输出的结果数据如下所示:
|name |id |
|————|——|
| Document1|1 |
| Document2|3 |
| Document3|NULL|
| Document4|NULL|
| Document5|NULL|
正文完
请博主喝杯咖啡吧!