I’ve been working with Greenplum for years now and of all the minor releases I’ve seen I have to say 4.3.3 is one of my favorites.
This minor release included things like:
- Netbackup integration
- PL/R update to 3.1
- Fuzzy String Match module
What has me most excited about the release though is Delta Compression.
For those that don’t know Greenplum it is a MPP (Massive Parallel Processing) database which spreads data over multiple nodes to harness the compute and IO power of a cluster to process petabyte scale data sets.
Greenplum Database (GPDB) also embraces a concept we call polymorphic storage, the ability to store data in multiple formats within one logical table. A table can have partitions that are row oriented existing side by side with partitions are column oriented. In addition to this various, compression algorithms can be applied at the table and column level. Thus the latest 3 months of data in the table could be row oriented, the next 3 months columnar and uncompressed with the following 3+ months columnar with compression. As far as the end user is concerned it all queries the same and no changes need to be made in queries to interact with data in different parts of the lifecycle.
What was added in 4.3.3 was an additional way to compress data in a column to save space, Delta Compression. In addition to standard lzo and zlib column compression GPDB has been able to RLE ( Run Length Encoding ) compression for awhile now. Imagine if you had a table with dinner orders. One of those columns is what the order is and when you change from row based to columnar the data stored for the column look like this
Fish, Fish, Fish, Fish, Fish, Fish
What RLE compression does is store that same type of data like this
Fish x 6
( Ernie explain it here )
For data sets with a large number of repeating values this can save large amount of space.
What has been added with Delta Compression is data types such as integers and time we line up and express as their offset, so for example if you had
2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-04, 2014-01-04
With Delta Compression and RLE it would be stored as
2014-01-02(4), +1(5), +1(2)
After this fills up a bock we compress the result with zlib to get amazing results.
We have taken customer data and in test are seeing well over 100x compression on 10G worth of TIME values from a customers dataset. Even more impressive is the5000x compression on a similar 10G sequence column.
It those kinds of number I find exciting and why 4.3.3 is such a great release.