Turn Analyze up to 10

mcfly_speakerIf you are a Greenplum admin you have probably written more than a few scripts to help you keep your tables analyzed. That or you frequently find yourself harassing users to keep their table stats fresh, since having accurate stats is one of the best ways to keep you cluster running in tip top shape. I don’t know about you, but for me this was not one of my favorite things spend my time on, so I love one of the tools Pivotal rolled out recently: analyzedb

analyzedb is a command line tool which analyzes tables concurrently and will capture current metadata for AO tables. Thus on subsequent runs it will refer to this data and only analyze those tables that have been modified since the last analysis. It can be run at a DB, schema and table level and when targeting tables you can even include or exclude specific columns in columnar tables. Fantastic. The parallelism level can be cranked from 1 to 10 depending on how crazy you want to get with you concurrent analyze sessions. I wouldn’t recommend turning it up past the default of 5 unless you are confident your system can handle it.

This is a major time saving tool that replaces a whole list of batch scripts I used previously.


Greenplum is Open

openGreenplum is now open, as in open source. is the site with all of the info and you can checkout the source yourself on github. Pivotal has been working to this internally for awhile and it is a huge step forward. I can’t think of any software product that was pulling in such significant revenue that was moved to the open source space. Industry wide we are in the middle of a shift similar to when music went from the phonograph to radio and it’s good to see Pivotal embracing the future.


GPDB on VM, it is a different beast

I get a large number of asks about virtualizing GPDB. Yes you can, that is the short answer. GPDB is software and all it really cares about is having the proper OS under it and it will 6938280971_326960b48a_zrun. Will it run well in that environment, that is an entirely different thing.

What you need to realize first is that virtualization was targeted for two things. One was to abstract a tie to physical hardware away from the idea of a server. This aspect we don’t really care about and are happy with. As long as the virtual resources provide enough horsepower to feed the beast, GPDB is okay. There is a little bit of loss in performance, but this gets less and less each year and depending on your workload may be of little concern.

The second aspect of virtualization was to allow servers to share resources and consolidate workload and make use of unused cycles. An example would be that if you have 6 servers and chart out their workloads it looks somewhat like a sawtooth pattern. Looking at that chart you realize if the you can line the teeth up correctly and combine the utilization that instead of 6 servers you could virtualize them and reduce the footprint. At that point you should be able to run the same 6 servers on 3 VM hosts that can handle that same load. BAM, cost savings, time savings and a promotion.

imageimage (1)

Taking this to the next level you start to place bets that if you have 4 VMs running on one host might use 4 cores though they wouldn’t all ever do it at the same time. So even though the host only has 8 cores you can tell each of the 4 VMs they have 4 cores under the assumption they all won’t be utilizing them at the same time, this is call over subscription.  Oversubscription is pretty standard practice in CPU, memory, network and IO in a VM infrastructure. It is this second virtualization practice where Greenplum Database and virtualization start to have an issue. GPDB is a parallel system and it wants to push a workload down to all of it segments at the same time and expects them all to do work at exactly the same time. A workload graph for Greenplum does not show a sawtooth pattern and resource utilization will happen across the cluster at the same time. So instead of filling gaps and balancing utilization out when you lay GPDB graphs on top of each other you get huge spikes.

image (2)image (3)

This doesn’t have to be a problem in a virtualized environment if you realize this is the pattern and are planning for that kind of utilization. If you are expecting to treat the cluster like your standard web farm you will be in for a rude awakening. When you hand out resources to GPDB VMs if you oversubscribe nodes and put them on the same host those VMs will all try to use all the resources they have available at the same time. You either need to do a static allocation of the resources or place your VMs in such a way that they will be able to chew on all of resources at any given point in time at the same time. It isn’t just CPU you need to think of in this manner, be sure you are also considering network, io and memory.  If you aren’t doing this then you are taking away GPDB’s ability to do what makes it such a powerful processing platform, that is it’s ability to align large amounts of processing and data access in parallel to solve big data problem quickly.

To be clear my goal here is not to discourage thoughts of virtualization, quite the contrary I’d like to see more of it. Just make sure to consider the type of workload this brings to the table as you virtualize it and you line up how the platform works with the resources you provide.


Will Hadoop knock out the MPP DB?

boxingI was reading an excellent post ( When should I use Greenplum Database versus HAWQ? ) by my colleague Jon Roberts. Which got me thinking I should drop a post on why I still think MPP database and specifically Greenplum DB are relevant in a Hadoop crazy world. It should be noted that I am absolutely a Hadoop fanboy. I have managed to make my way to the last three Strata conferences and received my CDH certification before they had a pretty manager to take care of everything, so I’m no hater.

In considering Hadoop there are a few key items you should consider:

  • Big Data ≠ Hadoop
  • Hadoop is a platform

I’ll tackle the last first, Hadoop is a platform. Hadoop is made up a variety of components of which various sources will define in different ways. What all of them boil down to is a distributed datastore (most often HDFS ) with distributed processing implementation on top of it (most often MapReduce). This is changing landscape and often multiple implementations of processing are being offered in order to bring different capabilities forward. The net of this is much like implementing virtualization, you are going to need someone skilled in Hadoop to bring out it’s value and find the right use cases. In addition you also need someone who understands Hadoop infrastructure, it is fundamentally different than the standard infrastructure most companies have been moving to. Probably the biggest hurdle will be that you need to convince business units to align to doing things in a potentially new way on this new platform. The way in which they query, store and even load data will most likely change.  While there are a variety of tools and vendors out there poised to help ease and support this transition, realize this is something new on the scale of implementing virtualization, SAN storage or shifting to cloud hosting. Nobody will debate value can be found in that list of technologies, as long as you are willing to find the right use case and then bite the bullet and implement them. Much the same with Hadoop.

There is also a common misconception out there that Hadoop is the only path to work with Big Data. While a majority of the press revolves around what Google, Yahoo and Facebook are doing, the fact of the matter is that most companies Big Data is not at that scale. Many products exist out there that scale in a distributed fashion to handle datasets that span well beyond one rack or multiple racks without needing to implement the Hadoop platform beneath it. Standard infrastructure skills can be utilized with a much smaller learning curve to get results from the implementation of these technologies. Often these products exhibit greater analytics ability and/or faster data processing capabilities. What has caused Hadoop to thrive is not necessarily what kind of processing Hadoop can do in many cases. It’s that it is perceived as an alternative to do a subset of what is currently being done at a cheaper cost and people are hacking to add that missing functionality. While cost is an extremely important factor, if not the one important factor, it shows that Big Data processing is not exclusive to Hadoop. Just that what Hadoop has done for many companies is make Big Data look more palatable.

This is leads me to my view on the relevance of MPP databases. I believe over the next year as products like Hive and Impala permeate the market, things like HAWQ and Presto become more widely known and pieces like Stinger and Tez drive the speed of SQL on Hadoop it will push down the cost of the MPP database and thus the reason many companies are looking to Hadoop as an alternative. In addition people considering the move are going to get a more realistic look at what other companies have run into as they move Hadoop to production and what the true cost of bringing that infrastructure up are. As the delta between MPP and the true cost of Hadoop shrink you will see less positioning of Hadoop as the rip and replace for the MPP database but as an augmenter that sits along side of it and can be leveraged to do bring value to the right use cases.

In short, Hadoop is the new kid on the block and it won’t replace the MPP database. You can plan on there being a few fights though before they become good friends.


4.3 out and about

TFCGreenplum Database 4.3 rolled out earlier this month and there are some big changes.

  • Append-Optimized Tables: UPDATE and DELETE operations are enabled on what were previously Append-Only tables.
  • Master Mirroring Enhanced: The way Master/StandbyMaster replication is implemented has changed. This should deliver faster master failover and many more of the functions related to it can now be done online.

Docs added to the docs page here


Database Landscape

Data nerd goodness from the 451 group



Stability and finding insight

ninjahackerSince the roll out of Greenplum 4.2 in combination with Greenplum on-site for a little training our clusters stability over the last year has improved. The main issues we run into at this point in time is lack of space as people want to throw more and more on the clusters and the occasional bad query that gets into the system and causes issues. To help with some of the space issues some simple jobs have been moved to a Hadoop cluster, this allows us to use Greenplum for the more complex data analytics functions. By removing some of the rollup functionality we have cut a decent number of table scans out of the system which resulted in a definite responsiveness increase.

The other problem we run into is queries getting into the system that cause resource contention. We have a liberal access policy and as it’s a startup environment, change is constant. Tools thus far to track down a rogue query haven’t been exactly outstanding. With the acquisition of MoreVRP I’m hoping a blessed and bottled better solution for this will come out. Currently I gather process stats and drop them to a local filesystem and drag that into a local share to do pivot tables on it and look for issues. There are some in database Greenplum queries that do this, of course when the DB is having issues using it to troubleshoot why it is having issues doesn’t work very well. The past couple of weeks we’ve been doing some work with OpenTSDB and it looks very promising. The tcollector system is easy to work with and the amount of data we have been able to throw at a single TSDB instance is impressive. The display ability of OpenTSDB is it’s weak point, we haven’t found a good display layer. Using the current tools though I am able to product the following graph, which shows memory utilization by query on each segment. Stats are collected every minute so a lot of queries don’t show, but our sub-minute queries aren’t really our problem children. As I perfect the filters I think this will be extremely helpful in tracking down skewing queries and the memory hogs the get dropped on the system in realtime.



LSI 9260-8i fails

One of our clusters we tried out used C2100s and these LSI 9260-8i controllers. They ran fantastically out of the gate, but we started to run into some issues. I can attest to the fact that our rack of servers ran hot. In pulling an over nighter it would get cold in the facility and I’d stand behind that rack because the heat coming out of over a dozen servers with a dozen disks running and processors running full bore kept my fingers from cramping up. This wasn’t heat senor warning hot, but it definitely put out some warmth. Later on when we started to have issues we’d pull the controller out and find the heat sink sitting on top of the card not attached in any way. What I think the problem was is that these controllers have the heat sink held on by plastic clips and springs and as our servers would run warm, not to the sensor heat warning level, eventually the clips would melt leaving the heatsink sitting there on top of the controller.


Testing out UAC

I haven’t been a big fan of Greenplum’s performance monitoring tools. It’s been a couple years though so it’s time to try them again and see what we’ve got. So just got UAC up and running today and it’s looking fairly nice.


Finding trouble spots

I’ve been fighting some database performance issues recently and started to use the following query to look for tables that are being used that are showing that they have 0 rows. These are likely to be unanalyzed tables being used in queries. We have queries to look through the entire database for potential unanalyzed tables, but it takes much less resources just to look at what’s currently in flight and try to address what we are currently hammering on. There are a lot of other table in there that I don’t need to join on for the visible data set, but I’ve got them there in case I need to start ungrouping things and pulling in more specific data.

SELECT n.nspname AS "schema_name",                                       
       c.relname AS "object_name",                                       
FROM pg_locks l,                                                       
     pg_class c,                                                       
     pg_database d,                                                    
     pg_stat_activity s,                                               
     pg_namespace n                                               
WHERE l.relation = c.oid                                          
AND l.database = d.oid                                            
AND = s.procpid                                             
AND c.relnamespace = n.oid                                        
AND c.relkind = 'r'                                               
AND n.nspname !~ '^pg_'::text                                     
AND c.reltuples = 0                                               
GROUP BY 1, 2, 3, 4                                               
ORDER BY 1, 2;