2015 – gpadmin.me

If you are a Greenplum admin you have probably written more than a few scripts to help you keep your tables analyzed. That or you frequently find yourself harassing users to keep their table stats fresh, since having accurate stats is one of the best ways to keep you cluster running in tip top shape. I don’t know about you, but for me this was not one of my favorite things spend my time on, so I love one of the tools Pivotal rolled out recently: analyzedb

analyzedb is a command line tool which analyzes tables concurrently and will capture current metadata for AO tables. Thus on subsequent runs it will refer to this data and only analyze those tables that have been modified since the last analysis. It can be run at a DB, schema and table level and when targeting tables you can even include or exclude specific columns in columnar tables. Fantastic. The parallelism level can be cranked from 1 to 10 depending on how crazy you want to get with you concurrent analyze sessions. I wouldn’t recommend turning it up past the default of 5 unless you are confident your system can handle it.

This is a major time saving tool that replaces a whole list of batch scripts I used previously.

I get a large number of asks about virtualizing GPDB. Yes you can, that is the short answer. GPDB is software and all it really cares about is having the proper OS under it and it will run. Will it run well in that environment, that is an entirely different thing.

What you need to realize first is that virtualization was targeted for two things. One was to abstract a tie to physical hardware away from the idea of a server. This aspect we don’t really care about and are happy with. As long as the virtual resources provide enough horsepower to feed the beast, GPDB is okay. There is a little bit of loss in performance, but this gets less and less each year and depending on your workload may be of little concern.

The second aspect of virtualization was to allow servers to share resources and consolidate workload and make use of unused cycles. An example would be that if you have 6 servers and chart out their workloads it looks somewhat like a sawtooth pattern. Looking at that chart you realize if the you can line the teeth up correctly and combine the utilization that instead of 6 servers you could virtualize them and reduce the footprint. At that point you should be able to run the same 6 servers on 3 VM hosts that can handle that same load. BAM, cost savings, time savings and a promotion.

Taking this to the next level you start to place bets that if you have 4 VMs running on one host might use 4 cores though they wouldn’t all ever do it at the same time. So even though the host only has 8 cores you can tell each of the 4 VMs they have 4 cores under the assumption they all won’t be utilizing them at the same time, this is call over subscription. Oversubscription is pretty standard practice in CPU, memory, network and IO in a VM infrastructure. It is this second virtualization practice where Greenplum Database and virtualization start to have an issue. GPDB is a parallel system and it wants to push a workload down to all of it segments at the same time and expects them all to do work at exactly the same time. A workload graph for Greenplum does not show a sawtooth pattern and resource utilization will happen across the cluster at the same time. So instead of filling gaps and balancing utilization out when you lay GPDB graphs on top of each other you get huge spikes.

This doesn’t have to be a problem in a virtualized environment if you realize this is the pattern and are planning for that kind of utilization. If you are expecting to treat the cluster like your standard web farm you will be in for a rude awakening. When you hand out resources to GPDB VMs if you oversubscribe nodes and put them on the same host those VMs will all try to use all the resources they have available at the same time. You either need to do a static allocation of the resources or place your VMs in such a way that they will be able to chew on all of resources at any given point in time at the same time. It isn’t just CPU you need to think of in this manner, be sure you are also considering network, io and memory. If you aren’t doing this then you are taking away GPDB’s ability to do what makes it such a powerful processing platform, that is it’s ability to align large amounts of processing and data access in parallel to solve big data problem quickly.

To be clear my goal here is not to discourage thoughts of virtualization, quite the contrary I’d like to see more of it. Just make sure to consider the type of workload this brings to the table as you virtualize it and you line up how the platform works with the resources you provide.