On this page
Hive 外部联接行为
本文档基于DB2 外部联接行为的撰写。原始 HTML 文档附在Hive 设计文档上,可以是downloaded here。
Definitions
保留的行表 | 外部联接中的表必须返回所有行。 |
对于左外部联接,这是* Left 表;对于右外部联接,这是 Right 表;对于完全外部联接,两个表都是 Preserved Row 表。
| Null Supplying table |这是在不匹配的行中为其列填充空值的表。
在非完全外部联接的情况下,这是联接中的另一个表。对于完全外部联接,两个表也都是 Null Supplying 表。
| 在连接谓词期间 |在 JOIN ON 子句中的谓词。
例如,在“ R1 join R2 on R1.x = 5
”中,谓词“ R1.x = 5
”是 During Join 谓词*。
| 连接谓词之后 | WHERE 子句中的谓词。
谓词下推规则
可以通过以下两个规则来总结逻辑:
在连接谓词期间 不能被推过 保留行表 。
联接谓词之后 不能被推过 空供应表 。
下表中捕获了这一点:
保留行表 | 空供应表 | |
---|---|---|
Join | ||
Predicate |
Case J1: 未推送 |
案例 J2: Pushed |
Where Predicate |
Case W1: Pushed |
Case W2: Not Pushed |
有关案例 J1,J2,W1 和 W2 的图示,请参见下面的Examples。
Hive Implementation
Hive 通过 SemanticAnalyzer 和 JoinPPD 类中的以下方法来执行规则:
规则 1:在 Plan Gen 中进行 QBJoinTree 构造期间,parseJoinCondition()
逻辑将应用此规则。
规则 2:在 JoinPPD **(加入谓词下推)期间,getQualifiedAliases()
逻辑将应用此规则。
Examples
给定 Src(Key String,Value String),下面的 Left Outer Join 示例显示 Hive 具有正确的行为。
情况 J1:保留行表上的联接谓词
explain
select s1.key, s2.key
from src s1 left join src s2 on s1.key > '2';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
filter predicates:
0 {(VALUE._col0 > '2')}
1
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
情况 J2:空供应表上的联接谓词
explain
select s1.key, s2.key
from src s1 left join src s2 on s2.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Filter Operator
predicate:
expr: (key > '2')
type: boolean
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
案例 W1:在保留行表上的谓词
explain
select s1.key, s2.key
from src s1 left join src s2
where s1.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Filter Operator
predicate:
expr: (key > '2')
type: boolean
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
案例 W2:空供应表上的谓词在哪里
explain
select s1.key, s2.key
from src s1 left join src s2
where s2.key > '2';
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
s1
TableScan
alias: s1
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: string
s2
TableScan
alias: s2
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col4
Filter Operator
predicate:
expr: (_col4 > '2')
type: boolean
Select Operator
expressions:
expr: _col0
type: string
expr: _col4
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1