Hive 外部联接行为

本文档基于DB2 外部联接行为的撰写。原始 HTML 文档附在Hive 设计文档上，可以是downloaded here。

Definitions

保留的行表	外部联接中的表必须返回所有行。

对于左外部联接，这是* Left 表；对于右外部联接，这是 Right 表；对于完全外部联接，两个表都是 Preserved Row 表。
| Null Supplying table |这是在不匹配的行中为其列填充空值的表。
在非完全外部联接的情况下，这是联接中的另一个表。对于完全外部联接，两个表也都是 Null Supplying 表。
| 在连接谓词期间 |在 JOIN ON 子句中的谓词。
例如，在“ R1 join R2 on R1.x = 5”中，谓词“ R1.x = 5”是 During Join 谓词*。
| 连接谓词之后 | WHERE 子句中的谓词。

谓词下推规则

可以通过以下两个规则来总结逻辑：

在连接谓词期间 不能被推过 保留行表 。
联接谓词之后 不能被推过 空供应表 。

下表中捕获了这一点：

	保留行表	空供应表
Join
Predicate	Case J1: 未推送	案例 J2： Pushed
Where Predicate	Case W1: Pushed	Case W2: Not Pushed

有关案例 J1，J2，W1 和 W2 的图示，请参见下面的Examples。

Hive Implementation

Hive 通过 SemanticAnalyzer 和 JoinPPD 类中的以下方法来执行规则：

规则 1：在 Plan Gen 中进行 QBJoinTree 构造期间，parseJoinCondition()逻辑将应用此规则。
规则 2：在 JoinPPD **(加入谓词下推)期间，getQualifiedAliases()逻辑将应用此规则。

Examples

给定 Src(Key String，Value String)，下面的 Left Outer Join 示例显示 Hive 具有正确的行为。

情况 J1：保留行表上的联接谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s1.key > '2';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  filter predicates:
	    0 {(VALUE._col0 > '2')}
	    1
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

情况 J2：空供应表上的联接谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 1
		value expressions:
		      expr: key
		      type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

案例 W1：在保留行表上的谓词

explain 
select s1.key, s2.key 
from src s1 left join src s2 
where s1.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 0
		value expressions:
		      expr: key
		      type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

案例 W2：空供应表上的谓词在哪里

explain
select s1.key, s2.key 
from src s1 left join src s2 
where s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Filter Operator
	    predicate:
		expr: (_col4 > '2')
		type: boolean
	    Select Operator
	      expressions:
		    expr: _col0
		    type: string
		    expr: _col4
		    type: string
	      outputColumnNames: _col0, _col1
	      File Output Operator
		compressed: false
		GlobalTableId: 0
		table:
		    input format: org.apache.hadoop.mapred.TextInputFormat
		    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Docs

Docs4dev

Title here