Goal

The top level problem is as follows:

There are many tables of the following format:

and the following queries need to be performed efficiently:

The cardinality of 'x' is in 1000's per partition of T. Moreover, there is a skew for the values of 'x'. In general, there are ~10 values of 'x' which have a very large skew, and the remaining values of 'x' have a small cardinality. Also, note that this mapping (values of 'x' with a high cardinality) can change daily.

The above requirement can be solved in the following ways:

Basic Partitioning

Create a partition per value of 'x'.

List Bucketing

The basic idea here is as follows: Identify the keys with a high skew. Have one directory per skewed key, and the remaining keys go into a separate directory. This mapping is maintained in the metastore at a table or partition level, and is used by the Hive compiler to do input pruning. The list of skewed keys is stored at the table level. (Note that initially this list can be supplied by the client periodically, and eventually it can be updated when a new partition is being loaded.)

For example, the table maintains the list of skewed keys for 'x': 6, 20, 30, 40. When a new partition is being loaded, it will create 5 directories (4 directories for skewed keys + 1 default directory for all the remaining keys). The table/partition that got loaded will have the following mapping: 6,20,30,40,others. This is similar to hash bucketing currently, where the bucket number determines the file number. Since the skewed keys need not be consecutive, the entire list of skewed keys need be stored in each table/partition.

When a query of the form

is issued, the Hive compiler will only use the directory corresponding to x=30 for the map-reduce job.

For a query of the form

the Hive compiler will only use the file corresponding to x=others for the map-reduce job.

This approach is good under the following assumptions:

This approach can be extended to the scenario when there are more than one clustered key. Say we want to optimize the queries of the form

A query with all the clustering keys specified can be optimized easily. However, queries with some of the clustering keys specified:

can only be used to prune very few directories. It does not really matter if the prefix of the clustering keys is specified or not. For example for x=10, the Hive compiler can prune the file corresponding to (20, 'c'). And for y='b', the files corresponding to (10, 'a') and (20, 'c') can be pruned. Hashing for others does not really help, when the complete key is not specified.

This approach does not scale in the following scenarios:

Skewed Table vs. List Bucketing Table

A normal skewed table can be used for skewed join, etc. (See the Skewed Join Optimization design document.) You don't need to define it as a list bucketing table if you don't use the list bucketing feature.

List Bucketing Validation

Mainly due to its sub-directory nature, list bucketing can't coexist with some features.

DDL

Compilation error will be thrown if list bucketing table coexists with

DML

Compilation error will be thrown if list bucketing table coexists with

Partitioning value should not be the same as a default list bucketing directory name.

Alter Table Concatenate

Compilation error will be thrown if list bucketing table coexists with

Hive Enhancements

Hive needs to be extended to support the following:

Create Table

CREATE TABLE <T> (SCHEMA) SKEWED BY (keys) ON ('c1', 'c2') [STORED AS DIRECTORIES];

The table will be a skewed table. Skewed information will be created for all partitions.

For example:

'STORED AS DIRECTORIES' is an optional parameter. It tells Hive that it is not only a skewed table but also the list bucketing feature should apply: create sub-directories for skewed values.

Alter Table

Alter Table Skewed

ALTER TABLE <T> (SCHEMA) SKEWED BY  (keys) ON ('c1', 'c2') [STORED AS DIRECTORIES];

The above is supported in table level only and not partition level.

It will

It will impact

Alter Table Not Skewed

ALTER TABLE <T> (SCHEMA) NOT SKEWED;

The above will

It will impact

Alter Table Not Stored as Directories

ALTER TABLE <T> (SCHEMA) NOT STORED AS DIRECTORIES;

The above will

Alter Table Set Skewed Location

ALTER TABLE <T> (SCHEMA) SET SKEWED LOCATION (key1="loc1", key2="loc2");

The above will change the list bucketing location map.

Design

When such a table is being loaded, it would be good to create a sub-directory per skewed key. The infrastructure similar to dynamic partitions can be used.

Alter table <T> partition <P> concatenate; needs to be changed to merge files per directory.

Implementation

Version information

List bucketing was added in Hive 0.10.0 and 0.11.0.

HIVE-3026  is the root JIRA ticket for the list bucketing feature.  It has links to additional JIRA tickets which implement list bucketing in Hive, including: 

For more information, see Skewed Tables in the DDL document.

首页