hive sql执行顺序以及相关例子

5,260次阅读

共计 6140 个字符，预计需要花费 16 分钟才能阅读完成。

日常使用的数据查询工具都是hive，了解一下hive sql执行顺序，其实与mysql之间的差异不是太大，参考mysql的执行顺序，hive的执行顺序如下

from
on
join
where
group by
having
select
distinct
union
order by

举例

explainselectcity,ad_type,device,sum(cnt)ascntfromtb_pmp_raw_log_basic_analysiswhereday=’2016-05-28’andtype =0andmedia =’sohu’and(deal_id =”ordeal_id =’-‘ordeal_idisNULL)groupbycity,ad_type,device

对应的stage任务如下

STAGE DEPENDENCIES:

Stage-1is a root stage

Stage-0is a root stage

STAGE PLANS:

Stage: Stage-1

MapReduce

MapOperator Tree:

TableScan

alias: tb_pmp_raw_log_basic_analysis

Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE

Filter Operator

predicate: (((deal_id=”)or(deal_id=’-‘))ordeal_id isnull) (type: boolean)

Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE

SelectOperator

expressions: city (type:string), ad_type (type:string), device (type:string), cnt (type: bigint)

outputColumnNames: city, ad_type, device, cnt

Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE

GroupByOperator

aggregations:sum(cnt)

keys: city (type:string), ad_type (type:string), device (type:string)

mode: hash

outputColumnNames: _col0, _col1, _col2, _col3

Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE

Reduce Output Operator

key expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string)

sortorder:+++

Map-reducepartition columns: _col0 (type:string), _col1 (type:string), _col2 (type:string)

Statistics: Numrows:8195357Datasize:580058024Basic stats: COMPLETE Column stats:NONE

value expressions: _col3 (type: bigint)

Reduce Operator Tree:

GroupByOperator

aggregations:sum(VALUE._col0)

keys: KEY._col0 (type:string), KEY._col1 (type:string), KEY._col2 (type:string)

mode: mergepartial

outputColumnNames: _col0, _col1, _col2, _col3

Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE

SelectOperator

expressions: _col0 (type:string), _col1 (type:string), _col2 (type:string), _col3 (type: bigint)

outputColumnNames: _col0, _col1, _col2, _col3

Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE

File Output Operator

compressed:false

Statistics: Numrows:4097678Datasize:290028976Basic stats: COMPLETE Column stats:NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0

Fetch Operator

limit:-1

具体介绍如下

**stage1的map阶段**

TableScan：from加载表，描述中有行数和大小等

Filter Operator：where过滤条件筛选数据，描述有具体筛选条件和行数、大小等

Select Operator：筛选列，描述中有列名、类型，输出类型、大小等。

Group By Operator：分组，描述了分组后需要计算的函数，keys描述用于分组的列，outputColumnNames为输出的列名，可以看出列默认使用固定的别名_col0，以及其他信息

Reduce Output Operator：map端本地的reduce，进行本地的计算，然后按列映射到对应的reduce

**stage1的reduce阶段Reduce Operator Tree**

Group By Operator：总体分组，并按函数计算。map计算后的结果在reduce端的合并。描述类似。mode: mergepartial是说合并map的计算结果。map端是hash映射分组

Select Operator：最后过滤列用于输出结果

File Output Operator：输出结果到临时文件中，描述介绍了压缩格式、输出文件格式。

stage0第二阶段没有，这里可以实现limit 100的操作

一些填坑点相关的记录

（1）join on where的使用

对于inner join 不会有影响
对于外连接有影响
a. WHERE 操作是对在join之后操作
b. ON操作是在join之前

举例说明: 下面给出相应的表数据

1.documents:

|id |name |

——–|————-|

|1 |Document1 |

|2 |Document2 |

|3 |Document3 |

|4 |Document4 |

|5 |Document5 |

2.downloads:

|id |document_id |username|

|——|—————|———-|

|1 |1 |sandeep |

|2 |1 |simi |

|3 |2 |sandeep |

|4 |2 |reya |

|5 |3 |simi |

a) 使用where操作

由于是在jion之后才进行筛选操作，使用left join操作实际上会产生7条数据，由于ID在downloads表中没有4，5，所以对应的数据都是NULL

此时where的条件筛选的作用出来了，只有username=‘sandeep’的数据才会被保留，因此符合这个条件的只有两条数据

SELECTdocuments.name,downloads.id

FROMdocuments

LEFTOUTERJOINdownloads

ONdocuments.id=downloads.document_id

WHEREusername=’sandeep’

Forabove query the intermediatejointablewill looklikethis.

|id(fromdocuments)|name |id(fromdownloads)|document_id|username|

|——————–|————–|———————|————-|———-|

|1 |Document1 |1 |1 |sandeep |

|1 |Document1 |2 |1 |simi |

|2 |Document2 |3 |2 |sandeep |

|2 |Document2 |4 |2 |reya |

|3 |Document3 |5 |3 |simi |

|4 |Document4 |NULL |NULL |NULL |

|5 |Document5 |NULL |NULL |NULL |

After applying the`WHERE`clauseandselecting the listed attributes,the result will be:

|name |id|

|————–|—-|

|Document1 |1 |

|Document2 |3 |

b) 使用on，是在join之前就参与条件筛选

SELECTdocuments.name,downloads.id

FROMdocuments

LEFTOUTERJOINdownloads

ONdocuments.id=downloads.document_id

ANDusername=’sandeep’

在没有select之前进行获取的左连接的数据如下所示：

|id(fromdocuments)|name |id(fromdownloads)|document_id|username|

|——————–|————–|———————|————-|———-|

|1 |Document1 |1 |1 |sandeep |

|2 |Document2 |3 |2 |sandeep |

|3 |Document3 |NULL |NULL |NULL |

|4 |Document4 |NULL |NULL |NULL |

|5 |Document5 |NULL |NULL |NULL |

上面数据表中没有满足两个条件的输出的数据都是null，只有满足两个条件的数据才会输出实际的数据结果，比如前两个数据

最终输出的结果数据如下所示：

|name |id |

|————|——|

| Document1|1 |

| Document2|3 |

| Document3|NULL|

| Document4|NULL|

| Document5|NULL|

正文完

请博主喝杯咖啡吧！

post-qrcode

发表至： Sql

2017-12-27

版权声明：本站原创文章，由 admin 2017-12-27发表，共计6140字。

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

hive传递参数变量方法

mongodb误操作查数据表操作日志

MySQL必会基础命令

一次小失误记录sql count

cart树ccp剪枝详细介绍

评论（没有评论）

验证码