Turn Analyze up to 10

mcfly_speakerIf you are a Greenplum admin you have probably written more than a few scripts to help you keep your tables analyzed. That or you frequently find yourself harassing users to keep their table stats fresh, since having accurate stats is one of the best ways to keep you cluster running in tip top shape. I don’t know about you, but for me this was not one of my favorite things spend my time on, so I love one of the tools Pivotal rolled out recently: analyzedb

analyzedb is a command line tool which analyzes tables concurrently and will capture current metadata for AO tables. Thus on subsequent runs it will refer to this data and only analyze those tables that have been modified since the last analysis. It can be run at a DB, schema and table level and when targeting tables you can even include or exclude specific columns in columnar tables. Fantastic. The parallelism level can be cranked from 1 to 10 depending on how crazy you want to get with you concurrent analyze sessions. I wouldn’t recommend turning it up past the default of 5 unless you are confident your system can handle it.

This is a major time saving tool that replaces a whole list of batch scripts I used previously.


Greenplum is Open

openGreenplum is now open, as in open source. is the site with all of the info and you can checkout the source yourself on github. Pivotal has been working to this internally for awhile and it is a huge step forward. I can’t think of any software product that was pulling in such significant revenue that was moved to the open source space. Industry wide we are in the middle of a shift similar to when music went from the phonograph to radio and it’s good to see Pivotal embracing the future.


GPDB on VM, it is a different beast

I get a large number of asks about virtualizing GPDB. Yes you can, that is the short answer. GPDB is software and all it really cares about is having the proper OS under it and it will 6938280971_326960b48a_zrun. Will it run well in that environment, that is an entirely different thing.

What you need to realize first is that virtualization was targeted for two things. One was to abstract a tie to physical hardware away from the idea of a server. This aspect we don’t really care about and are happy with. As long as the virtual resources provide enough horsepower to feed the beast, GPDB is okay. There is a little bit of loss in performance, but this gets less and less each year and depending on your workload may be of little concern.

The second aspect of virtualization was to allow servers to share resources and consolidate workload and make use of unused cycles. An example would be that if you have 6 servers and chart out their workloads it looks somewhat like a sawtooth pattern. Looking at that chart you realize if the you can line the teeth up correctly and combine the utilization that instead of 6 servers you could virtualize them and reduce the footprint. At that point you should be able to run the same 6 servers on 3 VM hosts that can handle that same load. BAM, cost savings, time savings and a promotion.

imageimage (1)

Taking this to the next level you start to place bets that if you have 4 VMs running on one host might use 4 cores though they wouldn’t all ever do it at the same time. So even though the host only has 8 cores you can tell each of the 4 VMs they have 4 cores under the assumption they all won’t be utilizing them at the same time, this is call over subscription.  Oversubscription is pretty standard practice in CPU, memory, network and IO in a VM infrastructure. It is this second virtualization practice where Greenplum Database and virtualization start to have an issue. GPDB is a parallel system and it wants to push a workload down to all of it segments at the same time and expects them all to do work at exactly the same time. A workload graph for Greenplum does not show a sawtooth pattern and resource utilization will happen across the cluster at the same time. So instead of filling gaps and balancing utilization out when you lay GPDB graphs on top of each other you get huge spikes.

image (2)image (3)

This doesn’t have to be a problem in a virtualized environment if you realize this is the pattern and are planning for that kind of utilization. If you are expecting to treat the cluster like your standard web farm you will be in for a rude awakening. When you hand out resources to GPDB VMs if you oversubscribe nodes and put them on the same host those VMs will all try to use all the resources they have available at the same time. You either need to do a static allocation of the resources or place your VMs in such a way that they will be able to chew on all of resources at any given point in time at the same time. It isn’t just CPU you need to think of in this manner, be sure you are also considering network, io and memory.  If you aren’t doing this then you are taking away GPDB’s ability to do what makes it such a powerful processing platform, that is it’s ability to align large amounts of processing and data access in parallel to solve big data problem quickly.

To be clear my goal here is not to discourage thoughts of virtualization, quite the contrary I’d like to see more of it. Just make sure to consider the type of workload this brings to the table as you virtualize it and you line up how the platform works with the resources you provide.


What kind of disk performance does your GP see?

During our regular maintenance widows I run a gpcheckperf to see where our disk speeds in the Greenplum cluster are coming in. This is a result from an C2100 with a single LSI 9260-8i  controller. There are two virtual disk composed of 6 disks each arranged in a RAID5. For the file system I’m using xfs with the mount options: logbufs=8, logbsize=256k, noatime, attr2, nobarrier and seeing these results.

/usr/local/greenplum-db/./bin/gpcheckperf -f /data/gpadmin/hosts.seg -d /data/gpdb_p1 -d /data/gpdb_p2 -r d -D

disk write min bandwidth (MB/s): 945.25 [sdw15]
disk write max bandwidth (MB/s): 1007.74 [sdw13]

disk read min bandwidth (MB/s): 1239.10 [sdw15]
disk read max bandwidth (MB/s): 1691.65 [sdw12]

Are these similar number to what you are getting in your clusters?