I was reading an excellent post ( When should I use Greenplum Database versus HAWQ? ) by my colleague Jon Roberts. Which got me thinking I should drop a post on why I still think MPP database and specifically Greenplum DB are relevant in a Hadoop crazy world. It should be noted that I am absolutely a Hadoop fanboy. I have managed to make my way to the last three Strata conferences and received my CDH certification before they had a pretty manager to take care of everything, so I’m no hater.
In considering Hadoop there are a few key items you should consider:
- Big Data ≠ Hadoop
- Hadoop is a platform
I’ll tackle the last first, Hadoop is a platform. Hadoop is made up a variety of components of which various sources will define in different ways. What all of them boil down to is a distributed datastore (most often HDFS ) with distributed processing implementation on top of it (most often MapReduce). This is changing landscape and often multiple implementations of processing are being offered in order to bring different capabilities forward. The net of this is much like implementing virtualization, you are going to need someone skilled in Hadoop to bring out it’s value and find the right use cases. In addition you also need someone who understands Hadoop infrastructure, it is fundamentally different than the standard infrastructure most companies have been moving to. Probably the biggest hurdle will be that you need to convince business units to align to doing things in a potentially new way on this new platform. The way in which they query, store and even load data will most likely change. While there are a variety of tools and vendors out there poised to help ease and support this transition, realize this is something new on the scale of implementing virtualization, SAN storage or shifting to cloud hosting. Nobody will debate value can be found in that list of technologies, as long as you are willing to find the right use case and then bite the bullet and implement them. Much the same with Hadoop.
There is also a common misconception out there that Hadoop is the only path to work with Big Data. While a majority of the press revolves around what Google, Yahoo and Facebook are doing, the fact of the matter is that most companies Big Data is not at that scale. Many products exist out there that scale in a distributed fashion to handle datasets that span well beyond one rack or multiple racks without needing to implement the Hadoop platform beneath it. Standard infrastructure skills can be utilized with a much smaller learning curve to get results from the implementation of these technologies. Often these products exhibit greater analytics ability and/or faster data processing capabilities. What has caused Hadoop to thrive is not necessarily what kind of processing Hadoop can do in many cases. It’s that it is perceived as an alternative to do a subset of what is currently being done at a cheaper cost and people are hacking to add that missing functionality. While cost is an extremely important factor, if not the one important factor, it shows that Big Data processing is not exclusive to Hadoop. Just that what Hadoop has done for many companies is make Big Data look more palatable.
This is leads me to my view on the relevance of MPP databases. I believe over the next year as products like Hive and Impala permeate the market, things like HAWQ and Presto become more widely known and pieces like Stinger and Tez drive the speed of SQL on Hadoop it will push down the cost of the MPP database and thus the reason many companies are looking to Hadoop as an alternative. In addition people considering the move are going to get a more realistic look at what other companies have run into as they move Hadoop to production and what the true cost of bringing that infrastructure up are. As the delta between MPP and the true cost of Hadoop shrink you will see less positioning of Hadoop as the rip and replace for the MPP database but as an augmenter that sits along side of it and can be leveraged to do bring value to the right use cases.
In short, Hadoop is the new kid on the block and it won’t replace the MPP database. You can plan on there being a few fights though before they become good friends.