编程知识 cdmana.com

Hive compression and column storage

Hadoop Compression configuration

MR Supported compression codes

Compressed format Tools Algorithm File extension Is it divisible Corresponding code / decoder Compression algorithm Original file size Compressed file size Compression speed
DEFAULT nothing DEFAULT .deflate no org.apache.hadoop.io.compress.DefaultCodec
Gzip gzip DEFAULT .gz no org.apache.hadoop.io.compress.GzipCodec gzip 8.3GB 1.8GB 17.5MB/s
bzip2 bzip2 bzip2 .bz2 yes org.apache.hadoop.io.compress.BZip2Codec bzip2 8.3GB 1.1GB 2.4MB/s
LZO lzop LZO .lzo yes com.hadoop.compression.lzo.LzopCodec LZO 8.3GB 2.9GB 49.3MB/s
Snappy nothing Snappy .snappy no org.apache.hadoop.io.compress.SnappyCodec

Compression parameter configuration

To be in Hadoop Medium enabled compression , The following parameters can be configured (mapred-site.xml In file ):

Parameters The default value is Stage Suggest
io.compression.codecs ( stay core-site.xml Middle configuration ) org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.Lz4Codec Input compression Hadoop Use the file extension to determine if a codec is supported
mapreduce.map.output.compress false mapper Output This parameter is set to true Enable compression
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.DefaultCodec mapper Output Use LZO、LZ4 or snappy Codec compresses data at this stage
mapreduce.output.fileoutputformat.compress false reducer Output This parameter is set to true Enable compression
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress. DefaultCodec reducer Output Use standard tools or codecs , Such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.type RECORD reducer Output SequenceFile The type of compression used by the output :NONE and BLOCK

Turn on Map Output stage compression

Turn on map Output phase compression can be reduced job in map and Reduce task Inter-data transfer volume . The specific configuration is as follows :

Case practice :

1. Turn on hive Intermediate transmission data compression function
hive (default)>set hive.exec.compress.intermediate=true;
2. Turn on mapreduce in map Output compression function
hive (default)>set mapreduce.map.output.compress=true;
3. Set up mapreduce in map Compression of output data
hive (default)>set mapreduce.map.output.compress.codec=
 org.apache.hadoop.io.compress.SnappyCodec;
4. Execute query statement
hive (default)> select count(ename) name from emp;

Turn on Reduce Output stage compression

When Hive When the output is written to a table , The output can also be compressed . attribute hive.exec.compress.output Controls this function . Users may need to keep the default values in the default Settings file false, The default output is then an uncompressed plain text file . The user can set this value to true, To enable output compression .

Case practice :

1. Turn on hive The final output data compression function

hive (default)>set hive.exec.compress.output=true;

2. Turn on mapreduce Final output data compression

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3. Set up mapreduce Final data output compression mode

hive (default)> set mapreduce.output.fileoutputformat.compress.codec =

org.apache.hadoop.io.compress.SnappyCodec;

4. Set up mapreduce The final data output is compressed into block compression

hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5. Test whether the output is a compressed file

hive (default)> insert overwrite local directory

'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

File storage format


Hive The format of the number of supported stores is mainly :TEXTFILE 、SEQUENCEFILE、ORC、PARQUET.

Column and row storage

image.png
Column and row storage
On the left is the logical table , The first one on the right is row storage , The second is column storage .

1. Characteristics of row storage

When querying for an entire row of data that satisfies the criteria , The column store needs to go to each clustered field to find the value of each column , The row store only needs to find one of the values , The rest of the values are in adjacent places , So row storage queries are faster at this point .

2. Characteristics of column storage

Because the data for each field is aggregated and stored , When the query requires only a few fields , Can greatly reduce the amount of data read ; The data type for each field must be the same , Column storage can be targeted to design better compression algorithm .

TEXTFILE and SEQUENCEFILE Is based on row storage ;

ORC and PARQUET Is based on column storage .

TextFile Format

The default format , Data is not compressed , High disk overhead , Data parsing is expensive . Can be combined with Gzip、Bzip2 Use , But use Gzip This way, ,hive No segmentation of data , So we can't do parallel operations on the data .

Orc Format

Orc (Optimized Row Columnar) yes Hive 0.11 A new storage format introduced in Windows xp .

Pictured 6-11 You can see each of them Orc File by 1 One or more stripe form , Every stripe250MB size , This Stripe Actual equivalent RowGroup Concept , But the size is determined by 4MB->250MB, This should improve the throughput of sequential reads . Every Stripe It's made up of three parts , Namely Index Data,Row Data,Stripe Footer:
image.png
Orc Format

  • Index Data: A lightweight index, The default is every time 1W Rows are indexed . The index made here should only record the fields in a certain row Row Data Medium offset.
  • Row Data: It stores specific data , So let's take some rows , These rows are then stored as columns . Each column is coded , Divided into multiple Stream To store .
  • Stripe Footer: It's individual Stream The type of , Length information .

Each file has one File Footer, That's each of them Stripe The number of rows , Every Column Data type information, etc ; The end of each file is one PostScript, This records the entire file's compression type as well FileFooter Length information, etc . When the file is read , Meeting seek Go to the end of the file and read PostScript, From the inside File Footer length , read FileFooter, I'm going to parse from the inside to each one Stripe Information , Read each Stripe, Read backwards .

Parquet Format

Parquet Is a column storage format for analytic business , from Twitter and Cloudera Cooperative development ,2015 year 5 Month from Apache Graduated from the incubator Apache Top projects .
Parquet Files are stored in binary form , So it can't be read directly , The file contains the data and metadata for that file , therefore Parquet Format files are self-parsed .
Usually , In the storage Parquet The data will be based on Block Size sets the size of the row group , Because in general each of these Mapper The smallest unit of data that a task processes is one Block, This allows you to group each row by one Mapper Task processing , Increases the parallelism of task execution .Parquet The format of the file is shown in the figure .

image.png

Parquet Format

The figure above shows one Parquet The content of the document , Multiple row groups can be stored in a file , The first part of the file is that of the file Magic Code, To verify whether it is a Parquet file ,Footer length The size of the file metadata is recorded , The offset of the metadata can be calculated from this value and the file length , The file's metadata contains the metadata information for each row group and the data that the file stores Schema Information . Except for the metadata for each row group in the file , The beginning of each page stores the metadata for that page , stay Parquet in , There are three types of pages : Data pages 、 Dictionary page and index page . The data page is used to store the value of the column in the current row group , The dictionary page stores the coded dictionary for the column value , Each column block contains at most one dictionary page , The index page is used to store the index of the column under the current row group , at present Parquet Index pages are not yet supported in .

Storage file compression ratio summary :

Compare the compression ratio of stored files and query speed .
ORC >  Parquet >  textFile
A summary of the query speed of stored files : Similar query speed .

A combination of storage and compression


modify Hadoop The cluster has Snappy Compression way

modify Hadoop The cluster has Snappy Compression way

Storage and compression summary

In actual project development ,hive The data storage format of the table is generally selected :orc or parquet. Compression mode is generally selected snappy,lzo.

版权声明
本文为[MosesDon]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201213174243691x.html

Scroll to Top