Dynamic Partitioning

Overview

在 HCatalog 中写入数据时，可以将所有记录写入单个分区。在这种情况下，分区列不必在输出数据中。

以下 Pig 脚本说明了这一点：

A = load 'raw' using HCatLoader(); 
... 
split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
store for_us into 'processed' using HCatStorer("ds=20110110, region=us"); 
store for_eu into 'processed' using HCatStorer("ds=20110110, region=eu"); 
store for_asia into 'processed' using HCatStorer("ds=20110110, region=asia");

如果要同时将数据写入多个分区，可以通过在数据中放置分区列而不在存储数据时指定分区值来完成。

A = load 'raw' using HCatLoader(); 
... 
store Z into 'processed' using HCatStorer();

动态分区的工作方式是 HCatalog 在传递给它的数据中查找分区列，并使用这些列中的数据将行划分为多个分区。 (传递给 HCatalog 的数据“必须”具有与目标表的模式匹配的模式，因此应始终包含分区列.)请务必注意，分区列不能包含空值，否则整个过程将失败。

还需要注意的是，在一次运行中创建的所有分区都是一个事务的一部分。因此，如果该过程的任何部分失败，则不会将任何分区添加到表中。

External Tables

Version

本节描述了 HCatalog 0.5、0.12 和 0.13 中针对外部表的动态分区发生的更改。

从 HCatalog 0.5 开始，外部表上的动态分区被破坏(HCATALOG-500)。 Hive 0.12.0 中已通过基于元数据而不是用户规范(HIVE-5011)在位置中创建外部表的动态分区来解决此问题。从 Hive 0.13.0 开始，用户可以通过在作业配置属性 hcat.dynamic.partitioning.custom.pattern(HIVE-6109)中指定路径模式来自定义位置。外部表的静态分区在所有 Hive 版本中都可以具有用户指定的位置。

例如，在 Hive 0.12.0 中，如果将名为 user_logs 的表按(年，月，日，小时，分钟，国家/地区)分区并存储在外部位置“ hdfs：// hcat/data/user_logs”，则该位置它的动态分区具有标准的 Hive 格式，其中包含键和值，例如：

hdfs://hcat/data/user_logs/year=2013/month=12/day=21/hour=06/minute=10/country=US

在 Hive 0.13.0 和更高版本中，可以将 hcat.dynamic.partitioning.custom.pattern 配置为自定义路径模式。例如，模式“ ${year}/${month}/${day}/${hour}/${minute}/${country}”省略了路径中的键：

hdfs://hcat/data/user_logs/2013/12/21/06/10/US

每个动态分区列都必须以${column_name}格式出现在自定义位置路径中，并且自定义位置路径必须由所有动态分区列组成。其他有效的自定义路径字符串包括：

data/${year}/${month}/${day}/${country}
${year}‐${month}‐${day}/country=${country}
output/yr=${year}/mon=${month}/day=${day}/geo=${country}

另请参见HCatalog 配置属性。另请参见PDF 附到 HIVE-6019，以了解实现的详细信息。

Hive 动态分区

有关 Hive 动态分区的信息，请参见：

猪的用法

Pig 的用法非常简单！用户可以指定实际需要的密钥，而不必像 Store 通常那样指定所有密钥。 HCatOutputFormat 会在必要时触发动态分区使用(如果未指定键值)，并将检查数据以适当地将其写出。

所以这句话...

store A into 'mytable' using HCatStorer("a=1, b=1");

如果数据仅具有 a = 1 和 b = 1 的值，则...等效于以下任何语句：

store A into 'mytable' using HCatStorer();

store A into 'mytable' using HCatStorer("a=1");

store A into 'mytable' using HCatStorer("b=1");

另一方面，如果存在跨越多个分区的数据，则 HCatOutputFormat 将自动确定如何适当地喷洒数据。

例如，假设数据集中所有值的 a = 1，b 取值 1 和 2.然后下面的语句...

store A into 'mytable' using HCatStorer();

...等于以下任何一种 Statements：

store A into 'mytable' using HCatStorer("a=1");

split A into A1 if b='1', A2 if b='2';
store A1 into 'mytable' using HCatStorer("a=1, b=1");
store A2 into 'mytable' using HCatStorer("a=1, b=2");

MapReduce 的用法

与 Pig 一样，MapReduce 程序员看到的动态分区的唯一变化是，他们不必指定所有分区键/值组合。

当前的代码示例为(a = 1，b = 1)写入特定分区将如下所示：

Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);

要写入多个分区，以上每一项都必须启动单独的作业。

使用动态分区，我们只需指定我们知道或需要的尽可能多的键。它将自己找出其余的键并喷出必要的分区，从而能够通过一个作业创建多个分区。

Navigation Links

Previous: Storage Formats
Next: Notification

蜂房设计文档：Dynamic Partitions
Hive 教程：Dynamic-Partition Insert
Hive DML：动态分区插入

一般：HCatalog Manual – WebHCat Manual – Hive Wiki 主页 – Hive 项目 site

Docs

Docs4dev

Title here