Hive 外部联接行为

本文档基于DB2 外部联接行为的撰写。原始 HTML 文档附在Hive 设计文档上,可以是downloaded here

Definitions

保留的行表外部联接中的表必须返回所有行。


对于左外部联接,这是* Left 表;对于右外部联接,这是 Right 表;对于完全外部联接,两个表都是 Preserved Row 表。
| Null Supplying table |这是在不匹配的行中为其列填充空值的表。
在非完全外部联接的情况下,这是联接中的另一个表。对于完全外部联接,两个表也都是
Null Supplying 表。
| 在连接谓词期间 |在 JOIN ON 子句中的谓词。
例如,在“ R1 join R2 on R1.x = 5”中,谓词“ R1.x = 5”是
During Join 谓词*。
| 连接谓词之后 | WHERE 子句中的谓词。

谓词下推规则

可以通过以下两个规则来总结逻辑:

  • 在连接谓词期间 不能被推过 保留行表

  • 联接谓词之后 不能被推过 空供应表

下表中捕获了这一点:

保留行表空供应表
Join

Predicate
Case J1:
未推送
案例 J2:
Pushed
Where
Predicate
Case W1:
Pushed
Case W2:
Not Pushed

有关案例 J1,J2,W1 和 W2 的图示,请参见下面的Examples

Hive Implementation

Hive 通过 SemanticAnalyzer 和 JoinPPD 类中的以下方法来执行规则:

规则 1:在 Plan Gen 中进行 QBJoinTree 构造期间,parseJoinCondition()逻辑将应用此规则。
规则 2:在 JoinPPD **(加入谓词下推)期间,getQualifiedAliases()逻辑将应用此规则。

Examples

给定 Src(Key String,Value String),下面的 Left Outer Join 示例显示 Hive 具有正确的行为。

情况 J1:保留行表上的联接谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s1.key > '2';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  filter predicates:
	    0 {(VALUE._col0 > '2')}
	    1
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

情况 J2:空供应表上的联接谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 1
		value expressions:
		      expr: key
		      type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

案例 W1:在保留行表上的谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 
where s1.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 0
		value expressions:
		      expr: key
		      type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

案例 W2:空供应表上的谓词在哪里

explain
select s1.key, s2.key 
from src s1 left join src s2 
where s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Filter Operator
	    predicate:
		expr: (_col4 > '2')
		type: boolean
	    Select Operator
	      expressions:
		    expr: _col0
		    type: string
		    expr: _col4
		    type: string
	      outputColumnNames: _col0, _col1
	      File Output Operator
		compressed: false
		GlobalTableId: 0
		table:
		    input format: org.apache.hadoop.mapred.TextInputFormat
		    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1