Greenplum OSS

Greenplum Utility testing with behave

The management utilities have a set of tests with them used to make sure things like gpinitsystem, gpcheck and analyzedb are functioning properly.

There is a README file in the source code that talks about how to make this happen using the behave framework, but I wanted to additionally break it down in a blog post.

Let’s say for instance I would like to make a change to gpinitsystem as a developer, then running the gpinitsystem behave tests would make sense in order to validate the way the program is executing in the way we expect it.

It should be no surprise that the first step is you need to have the behave python module installed, this is pretty simple:

sudo pip install behave

additionally you’ll want to have sourced the environment variables for the files you have compiled (note your path may be different)

source /usr/local/gpdb/

At that point in order to execute the tests for something like gpinitsystem, you want to make sure you are in the proper directory

cd ~/gpdb_build/gpdb/gpMgmt

and then use behave to execute the test

behave test/behave/mgmt_utils/gpinitsystem.feature

and it will run through the tests

if I want to execute a specific test I can pass a tag to just run tests that match the tag

behave test/behave/mgmt_utils/gpinitsystem.feature --tags gpinitsystem_verify_timezone_setting


Turn Analyze up to 10

mcfly_speakerIf you are a Greenplum admin you have probably written more than a few scripts to help you keep your tables analyzed. That or you frequently find yourself harassing users to keep their table stats fresh, since having accurate stats is one of the best ways to keep you cluster running in tip top shape. I don’t know about you, but for me this was not one of my favorite things spend my time on, so I love one of the tools Pivotal rolled out recently: analyzedb

analyzedb is a command line tool which analyzes tables concurrently and will capture current metadata for AO tables. Thus on subsequent runs it will refer to this data and only analyze those tables that have been modified since the last analysis. It can be run at a DB, schema and table level and when targeting tables you can even include or exclude specific columns in columnar tables. Fantastic. The parallelism level can be cranked from 1 to 10 depending on how crazy you want to get with you concurrent analyze sessions. I wouldn’t recommend turning it up past the default of 5 unless you are confident your system can handle it.

This is a major time saving tool that replaces a whole list of batch scripts I used previously.


Greenplum is Open

openGreenplum is now open, as in open source. is the site with all of the info and you can checkout the source yourself on github. Pivotal has been working to this internally for awhile and it is a huge step forward. I can’t think of any software product that was pulling in such significant revenue that was moved to the open source space. Industry wide we are in the middle of a shift similar to when music went from the phonograph to radio and it’s good to see Pivotal embracing the future.


GPDB on VM, it is a different beast

I get a large number of asks about virtualizing GPDB. Yes you can, that is the short answer. GPDB is software and all it really cares about is having the proper OS under it and it will 6938280971_326960b48a_zrun. Will it run well in that environment, that is an entirely different thing.

What you need to realize first is that virtualization was targeted for two things. One was to abstract a tie to physical hardware away from the idea of a server. This aspect we don’t really care about and are happy with. As long as the virtual resources provide enough horsepower to feed the beast, GPDB is okay. There is a little bit of loss in performance, but this gets less and less each year and depending on your workload may be of little concern.

The second aspect of virtualization was to allow servers to share resources and consolidate workload and make use of unused cycles. An example would be that if you have 6 servers and chart out their workloads it looks somewhat like a sawtooth pattern. Looking at that chart you realize if the you can line the teeth up correctly and combine the utilization that instead of 6 servers you could virtualize them and reduce the footprint. At that point you should be able to run the same 6 servers on 3 VM hosts that can handle that same load. BAM, cost savings, time savings and a promotion.

imageimage (1)

Taking this to the next level you start to place bets that if you have 4 VMs running on one host might use 4 cores though they wouldn’t all ever do it at the same time. So even though the host only has 8 cores you can tell each of the 4 VMs they have 4 cores under the assumption they all won’t be utilizing them at the same time, this is call over subscription.  Oversubscription is pretty standard practice in CPU, memory, network and IO in a VM infrastructure. It is this second virtualization practice where Greenplum Database and virtualization start to have an issue. GPDB is a parallel system and it wants to push a workload down to all of it segments at the same time and expects them all to do work at exactly the same time. A workload graph for Greenplum does not show a sawtooth pattern and resource utilization will happen across the cluster at the same time. So instead of filling gaps and balancing utilization out when you lay GPDB graphs on top of each other you get huge spikes.

image (2)image (3)

This doesn’t have to be a problem in a virtualized environment if you realize this is the pattern and are planning for that kind of utilization. If you are expecting to treat the cluster like your standard web farm you will be in for a rude awakening. When you hand out resources to GPDB VMs if you oversubscribe nodes and put them on the same host those VMs will all try to use all the resources they have available at the same time. You either need to do a static allocation of the resources or place your VMs in such a way that they will be able to chew on all of resources at any given point in time at the same time. It isn’t just CPU you need to think of in this manner, be sure you are also considering network, io and memory.  If you aren’t doing this then you are taking away GPDB’s ability to do what makes it such a powerful processing platform, that is it’s ability to align large amounts of processing and data access in parallel to solve big data problem quickly.

To be clear my goal here is not to discourage thoughts of virtualization, quite the contrary I’d like to see more of it. Just make sure to consider the type of workload this brings to the table as you virtualize it and you line up how the platform works with the resources you provide.

Version Release

Greenplum Database 4.3.3 Adds Delta Compression

I’ve been working with Greenplum for years now and of all the minor releases I’ve seen I have to say 4.3.3 is one of my favorites.

This minor release included things like:

  • Netbackup integration
  • PL/R update to 3.1
  • Fuzzy String Match module

What has me most excited about the release though is Delta Compression.

For those that don’t know Greenplum it is a MPP (Massive Parallel Processing) database which spreads data over multiple nodes to harness the compute and IO power of a cluster to process petabyte scale data sets.

Greenplum Database (GPDB) also embraces a concept we call polymorphic storage, the ability to store data in multiple formats within one logical table.  A table can have partitions that are  row oriented existing side by side with partitions are column oriented. In addition to this various, compression algorithms can be applied at the table and column level. Thus the latest 3 months of data in the table could be row oriented, the next 3 months columnar and uncompressed with the following 3+ months columnar with compression. As far as the end user is concerned it all queries the same and no changes need to be made in queries to interact with data in different parts of the lifecycle.



What was added in 4.3.3 was an additional way to compress data in a column to save space, Delta Compression. In addition to standard lzo and zlib column compression GPDB has been able to RLE ( Run Length Encoding ) compression for awhile now. Imagine if you had a table with dinner orders. One of those columns is what the order is and when you change from row based to columnar the data stored for the column look like this

Fish, Fish, Fish, Fish, Fish, Fish

What RLE compression does is store that same type of data like this

Fish x 6

( Ernie explain it here  )

For data sets with a large number of repeating values this can save large amount of space.

What has been added with Delta Compression is data types such as integers and time we line up and express as their offset, so for example if you had

2014-01-02, 2014-01-02, 2014-01-02, 2014-01-02, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-04, 2014-01-04

With Delta Compression and RLE it would be stored as

2014-01-02(4), +1(5), +1(2)

After this fills up a bock we compress the result with zlib to get amazing results.

We have taken customer data and in test are seeing well over 100x compression on 10G worth of TIME values from a customers dataset. Even more impressive is the5000x compression on a similar 10G sequence column.

It those kinds of number I find exciting and why 4.3.3 is such a great release.


Will Hadoop knock out the MPP DB?

boxingI was reading an excellent post ( When should I use Greenplum Database versus HAWQ? ) by my colleague Jon Roberts. Which got me thinking I should drop a post on why I still think MPP database and specifically Greenplum DB are relevant in a Hadoop crazy world. It should be noted that I am absolutely a Hadoop fanboy. I have managed to make my way to the last three Strata conferences and received my CDH certification before they had a pretty manager to take care of everything, so I’m no hater.

In considering Hadoop there are a few key items you should consider:

  • Big Data ≠ Hadoop
  • Hadoop is a platform

I’ll tackle the last first, Hadoop is a platform. Hadoop is made up a variety of components of which various sources will define in different ways. What all of them boil down to is a distributed datastore (most often HDFS ) with distributed processing implementation on top of it (most often MapReduce). This is changing landscape and often multiple implementations of processing are being offered in order to bring different capabilities forward. The net of this is much like implementing virtualization, you are going to need someone skilled in Hadoop to bring out it’s value and find the right use cases. In addition you also need someone who understands Hadoop infrastructure, it is fundamentally different than the standard infrastructure most companies have been moving to. Probably the biggest hurdle will be that you need to convince business units to align to doing things in a potentially new way on this new platform. The way in which they query, store and even load data will most likely change.  While there are a variety of tools and vendors out there poised to help ease and support this transition, realize this is something new on the scale of implementing virtualization, SAN storage or shifting to cloud hosting. Nobody will debate value can be found in that list of technologies, as long as you are willing to find the right use case and then bite the bullet and implement them. Much the same with Hadoop.

There is also a common misconception out there that Hadoop is the only path to work with Big Data. While a majority of the press revolves around what Google, Yahoo and Facebook are doing, the fact of the matter is that most companies Big Data is not at that scale. Many products exist out there that scale in a distributed fashion to handle datasets that span well beyond one rack or multiple racks without needing to implement the Hadoop platform beneath it. Standard infrastructure skills can be utilized with a much smaller learning curve to get results from the implementation of these technologies. Often these products exhibit greater analytics ability and/or faster data processing capabilities. What has caused Hadoop to thrive is not necessarily what kind of processing Hadoop can do in many cases. It’s that it is perceived as an alternative to do a subset of what is currently being done at a cheaper cost and people are hacking to add that missing functionality. While cost is an extremely important factor, if not the one important factor, it shows that Big Data processing is not exclusive to Hadoop. Just that what Hadoop has done for many companies is make Big Data look more palatable.

This is leads me to my view on the relevance of MPP databases. I believe over the next year as products like Hive and Impala permeate the market, things like HAWQ and Presto become more widely known and pieces like Stinger and Tez drive the speed of SQL on Hadoop it will push down the cost of the MPP database and thus the reason many companies are looking to Hadoop as an alternative. In addition people considering the move are going to get a more realistic look at what other companies have run into as they move Hadoop to production and what the true cost of bringing that infrastructure up are. As the delta between MPP and the true cost of Hadoop shrink you will see less positioning of Hadoop as the rip and replace for the MPP database but as an augmenter that sits along side of it and can be leveraged to do bring value to the right use cases.

In short, Hadoop is the new kid on the block and it won’t replace the MPP database. You can plan on there being a few fights though before they become good friends.


4.3 out and about

TFCGreenplum Database 4.3 rolled out earlier this month and there are some big changes.

  • Append-Optimized Tables: UPDATE and DELETE operations are enabled on what were previously Append-Only tables.
  • Master Mirroring Enhanced: The way Master/StandbyMaster replication is implemented has changed. This should deliver faster master failover and many more of the functions related to it can now be done online.

Docs added to the docs page here


Database Landscape

Data nerd goodness from the 451 group



Stability and finding insight

ninjahackerSince the roll out of Greenplum 4.2 in combination with Greenplum on-site for a little training our clusters stability over the last year has improved. The main issues we run into at this point in time is lack of space as people want to throw more and more on the clusters and the occasional bad query that gets into the system and causes issues. To help with some of the space issues some simple jobs have been moved to a Hadoop cluster, this allows us to use Greenplum for the more complex data analytics functions. By removing some of the rollup functionality we have cut a decent number of table scans out of the system which resulted in a definite responsiveness increase.

The other problem we run into is queries getting into the system that cause resource contention. We have a liberal access policy and as it’s a startup environment, change is constant. Tools thus far to track down a rogue query haven’t been exactly outstanding. With the acquisition of MoreVRP I’m hoping a blessed and bottled better solution for this will come out. Currently I gather process stats and drop them to a local filesystem and drag that into a local share to do pivot tables on it and look for issues. There are some in database Greenplum queries that do this, of course when the DB is having issues using it to troubleshoot why it is having issues doesn’t work very well. The past couple of weeks we’ve been doing some work with OpenTSDB and it looks very promising. The tcollector system is easy to work with and the amount of data we have been able to throw at a single TSDB instance is impressive. The display ability of OpenTSDB is it’s weak point, we haven’t found a good display layer. Using the current tools though I am able to product the following graph, which shows memory utilization by query on each segment. Stats are collected every minute so a lot of queries don’t show, but our sub-minute queries aren’t really our problem children. As I perfect the filters I think this will be extremely helpful in tracking down skewing queries and the memory hogs the get dropped on the system in realtime.



LSI 9260-8i fails

One of our clusters we tried out used C2100s and these LSI 9260-8i controllers. They ran fantastically out of the gate, but we started to run into some issues. I can attest to the fact that our rack of servers ran hot. In pulling an over nighter it would get cold in the facility and I’d stand behind that rack because the heat coming out of over a dozen servers with a dozen disks running and processors running full bore kept my fingers from cramping up. This wasn’t heat senor warning hot, but it definitely put out some warmth. Later on when we started to have issues we’d pull the controller out and find the heat sink sitting on top of the card not attached in any way. What I think the problem was is that these controllers have the heat sink held on by plastic clips and springs and as our servers would run warm, not to the sensor heat warning level, eventually the clips would melt leaving the heatsink sitting there on top of the controller.