LZO Compression

General LZO Concepts

LZO is a lossless data compression library that favors speed over compression ratio. See http://www.oberhumer.com/opensource/lzo and http://www.lzop.org for general information about LZO and see Compressed Data Storage for information about compression in Hive.

Imagine a simple data file that has three columns

  • id

  • first name

  • last name

Let's populate a data file containing 4 records:

19630001     john          lennon
19630002     paul          mccartney
19630003     george        harrison
19630004     ringo         starr

Let's call the data file /path/to/dir/names.txt .

In order to make it into an LZO file, we can use the lzop utility and it will create a names.txt.lzo file.

Now copy the file names.txt.lzo to HDFS.

Prerequisites

Lzo/Lzop Installations

lzo and lzop need to be installed on every node in the Hadoop cluster. The details of these installations are beyond the scope of this document.

core-site.xml

Add the following to your core-site.xml :

  • com.hadoop.compression.lzo.LzoCodec

  • com.hadoop.compression.lzo.LzopCodec

For example:

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec </value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
<value> com.hadoop.compression.lzo.LzoCodec </value>
</property>

Next we run the command to create an LZO index file:

hadoop jar /path/to/jar/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.LzoIndexer  /path/to/HDFS/dir/containing/lzo/files

This creates names.txt.lzo on HDFS.

Table Definition

The following hive -e command creates an LZO-compressed external table:

hive -e "CREATE EXTERNAL TABLE IF NOT EXISTS hive_table_name (column_1  datatype_1......column_N datatype_N)
         PARTITIONED BY (partition_col_1 datatype_1 ....col_P  datatype_P)
         ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
         STORED AS INPUTFORMAT  \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\"
                   OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";

Note: The double quotes have to be escaped so that the ' hive -e ' command works correctly.

See CREATE TABLE and Hive CLI for information about command syntax.

Hive Queries

Option 1: Directly Create LZO Files

  • Directly create LZO files as the output of the Hive query.

  • Use lzop command utility or your custom Java to generate .lzo.index for the .lzo files.

Hive Query Parameters

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

For example:

hive -e "SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec; SET hive.exec.compress.output=true;SET mapreduce.output.fileoutputformat.compress=true; <query-string>"

** Note: If the data sets are large or number of output files are large , then this option does not work.

Option 2: Write Custom Java to Create LZO Files

  • Create text files as the output of the Hive query.

  • Write custom Java code to

  • convert Hive query generated text files to .lzo files

    • generate .lzo.index files for the .lzo files generated above

Hive Query Parameters

Prefix the query string with these parameters:

SET hive.exec.compress.output=false
SET mapreduce.output.fileoutputformat.compress=false

For example:

hive -e "SET hive.exec.compress.output=false;SET mapreduce.output.fileoutputformat.compress=false;<query-string>"