Yearly Archives

3 Articles

Greenplum Database 4.3.3 Adds Delta Compression

Posted by skahler on

I’ve been working with Greenplum for years now and of all the minor releases I’ve seen I have to say 4.3.3 is one of my favorites.

This minor release included things like:

  • Netbackup integration
  • PL/R update to 3.1
  • Fuzzy String Match module

What has me most excited about the release though is Delta Compression.

For those that don’t know Greenplum it is a MPP (Massive Parallel Processing) database which spreads data over multiple nodes to harness the compute and IO power of a cluster to process petabyte scale data sets.

Greenplum Database (GPDB) also embraces a concept we call polymorphic storage, the ability to store data in multiple formats within one logical table.  A table can have partitions that are  row oriented existing side by side with partitions are column oriented. In addition to this various, compression algorithms can be applied at the table and column level. Thus the latest 3 months of data in the table could be row oriented, the next 3 months columnar and uncompressed with the following 3+ months columnar with compression. As far as the end user is concerned it all queries the same and no changes need to be made in queries to interact with data in different parts of the lifecycle.

Polymorphic

 

What was added in 4.3.3 was an additional way to compress data in a column to save space, Delta Compression. In addition to standard lzo and zlib column compression GPDB has been able to RLE ( Run Length Encoding ) compression for awhile now. Imagine if you had a table with dinner orders. One of those columns is what the order is and when you change from row based to columnar the data stored for the column look like this

Fish, Fish, Fish, Fish, Fish, Fish

What RLE compression does is store that same type of data like this

Fish x 6

( Ernie explain it here  )

For data sets with a large number of repeating values this can save large amount of space.

What has been added with Delta Compression is data types such as integers and time we line up and express as their offset, so for example if you had

2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-04, 2014-01-04

With Delta Compression and RLE it would be stored as

2014-01-02(4), +1(5), +1(2)

After this fills up a bock we compress the result with zlib to get amazing results.

We have taken customer data and in test are seeing well over 100x compression on 10G worth of TIME values from a customers dataset. Even more impressive is the5000x compression on a similar 10G sequence column.

It those kinds of number I find exciting and why 4.3.3 is such a great release.

Will Hadoop knock out the MPP DB?

Posted by skahler on

boxingI was reading an excellent post ( When should I use Greenplum Database versus HAWQ? ) by my colleague Jon Roberts. Which got me thinking I should drop a post on why I still think MPP database and specifically Greenplum DB are relevant in a Hadoop crazy world. It should be noted that I am absolutely a Hadoop fanboy. I have managed to make my way to the last three Strata conferences and received my CDH certification before they had a pretty manager to take care of everything, so I’m no hater.

In considering Hadoop there are a few key items you should consider:

  • Big Data ≠ Hadoop
  • Hadoop is a platform

I’ll tackle the last first, Hadoop is a platform. Hadoop is made up a variety of components of which various sources will define in different ways. What all of them boil down to is a distributed datastore (most often HDFS ) with distributed processing implementation on top of it (most often MapReduce). This is changing landscape and often multiple implementations of processing are being offered in order to bring different capabilities forward. The net of this is much like implementing virtualization, you are going to need someone skilled in Hadoop to bring out it’s value and find the right use cases. In addition you also need someone who understands Hadoop infrastructure, it is fundamentally different than the standard infrastructure most companies have been moving to. Probably the biggest hurdle will be that you need to convince business units to align to doing things in a potentially new way on this new platform. The way in which they query, store and even load data will most likely change.  While there are a variety of tools and vendors out there poised to help ease and support this transition, realize this is something new on the scale of implementing virtualization, SAN storage or shifting to cloud hosting. Nobody will debate value can be found in that list of technologies, as long as you are willing to find the right use case and then bite the bullet and implement them. Much the same with Hadoop.

There is also a common misconception out there that Hadoop is the only path to work with Big Data. While a majority of the press revolves around what Google, Yahoo and Facebook are doing, the fact of the matter is that most companies Big Data is not at that scale. Many products exist out there that scale in a distributed fashion to handle datasets that span well beyond one rack or multiple racks without needing to implement the Hadoop platform beneath it. Standard infrastructure skills can be utilized with a much smaller learning curve to get results from the implementation of these technologies. Often these products exhibit greater analytics ability and/or faster data processing capabilities. What has caused Hadoop to thrive is not necessarily what kind of processing Hadoop can do in many cases. It’s that it is perceived as an alternative to do a subset of what is currently being done at a cheaper cost and people are hacking to add that missing functionality. While cost is an extremely important factor, if not the one important factor, it shows that Big Data processing is not exclusive to Hadoop. Just that what Hadoop has done for many companies is make Big Data look more palatable.

This is leads me to my view on the relevance of MPP databases. I believe over the next year as products like Hive and Impala permeate the market, things like HAWQ and Presto become more widely known and pieces like Stinger and Tez drive the speed of SQL on Hadoop it will push down the cost of the MPP database and thus the reason many companies are looking to Hadoop as an alternative. In addition people considering the move are going to get a more realistic look at what other companies have run into as they move Hadoop to production and what the true cost of bringing that infrastructure up are. As the delta between MPP and the true cost of Hadoop shrink you will see less positioning of Hadoop as the rip and replace for the MPP database but as an augmenter that sits along side of it and can be leveraged to do bring value to the right use cases.

In short, Hadoop is the new kid on the block and it won’t replace the MPP database. You can plan on there being a few fights though before they become good friends.

4.3 out and about

Posted by skahler on

TFCGreenplum Database 4.3 rolled out earlier this month and there are some big changes.

  • Append-Optimized Tables: UPDATE and DELETE operations are enabled on what were previously Append-Only tables.
  • Master Mirroring Enhanced: The way Master/StandbyMaster replication is implemented has changed. This should deliver faster master failover and many more of the functions related to it can now be done online.

Docs added to the docs page here