May 2011 – gpadmin.me

Reading the news on the Greenplum HD announcement. I find it especially interesting because one of the main reasons I had an initial flurry of posts here and then trailed off was that I got heavily involved in our Hadoop installation and restructuring it. We’re currently using Cloudera‘s Hadoop packages and the way they handle distribution their software is about as good as you can get. I’m interested to see how Greenplum’s version of the software works. I’d heard talk of a couple of Map Reduce implementations at startups that were seeing impressive performance improvements. In a large enterprise there is definitely a place for both Hadoop and an MPP database and the trick is getting them to share data easily, which is why I was very impressed to see the 4.1 version of Greenplum with the ability to read from HDFS.

The big question is how well is Greenplum going to be able to support the release going forward. Greenplum is based off an older version of Postgres and I get a monthly question from someone about some feature that is in a later version of Postgres that doesn’t seem to be in Greenplum. Is their Hadoop implementation going to get the same treatment? Will Greenplum be able to keep up with the frequent changes to the Hadoop codebase and keep their internal product up to date, will it even really matter?

One extremely interesting thing we should see over the next year is a push on how to integrate EMC SAN architecture into both Greenplum and Hadoop. The old pre-EMC Greenplum sounded much like what I’ve heard from my Cloudera interactions, “Begrudgingly we see a use case for SAN storage, but might I suggest instead you cut off your left arm and beat yourself to death with it first.” I realize we’re working with really big data here so looking at SAN storage seems insane at first. Once you get into managing site to site interactions, non-interruptive backups and attempting to keep consistent IO through put across dozens if not hundreds of not only servers but different generations of servers, you can see the play.

I’m looking really long look at Flume right now and that’s a key feature that Cloudera implementation will have over what I’ve seen from Greenplum. The fault tolerant Name Node and Job Tracker look interesting but I don’t see these as very high risks in the current Hadoop system and as I understand they are already in the process of being addressed in core Hadoop. Performance promises are “meh”, I don’t take any bullet point that says X times speedup seriously. It could be true, but you really need a good whitepaper to backup a speed improvement boast.

So for me the jury is still out. It looks cool and I can’t wait to actually use it, but I said the same thing about Chorus a year ago and we still haven’t been approached with a production ready version of it.