Controller Setting for Greenplum

I brought another node into one of our clusters yesterday and it made me things of the controller setting I put on the system. In various systems we’ve used the PERC6/E, H700 and LSI-9260-8i controllers and I’ve found I on all of them I see best disk performance and reliability if I set

  • Read Policy: No Read Ahead – Having the controller do read ahead dramatically increases the io done on my servers and I’ve seen no benefit
  • Write Policy: Force Write Back – This is playing with a little bit of fire because I’m telling the server that even if the battery isn’t full or it’s going through a charging cycle to go ahead use the battery backed write cache. The fact that Greenplum data is duplicated on another server gets me past the small amount of edge cases where the server will be without power long enough that the lack of juice in the battery is going to come into play. The issue is that when the controller goes to charge the batter it will stop using the cache and force everything to write to disk. This has a huge impact on io speed and cause the whole cluster to grind to a halt while the one server struggles with io.

What had started the need for me to bring this other node into our cluster is that every outage I do a io check on the clusters using gpcheckperf and I see one array is under performing all the others

disk write bandwidth (MB/s): 620.81 [sdw11]
disk write bandwidth (MB/s): 365.38 [sdw09]
disk write bandwidth (MB/s): 621.01 [sdw08]

It’s an issue we’ve had before where one disk in the array starts to under perform but doesn’t fail out. At this point I’ll need to go in and break the RAID5 array into direct access for each disk individually and run benchmarks against them to see if I can figure out who the bad boy is and eject him from class.

Version Release

4.1 Looks Sweet

Just downloaded the 4.1 Release Notes and looking them over. There are a ton of things in there we could make immediate use of:

    Enhanced Workload Management: Statement-Level Memory

    Tight Hadoop Integration

    Simpler, Faster Installation: Parallel Installation Utility

    Better In Database Analytics

    Enhanced Support of Joins for UPDATE and DELETE

    Ordered Aggregates and Ordered Reduce Specifications in
    Greenplum MapReduce

    Transaction Performance Enhancement

    Enhanced gpload Performance to Avoid Catalog Bloat

    Online Segment Rebalancing Following a Recovery

    Performance Monitor Enhancements

    PostgreSQL 9.0 Client Compatibility

    Oracle Unbreakable Linux 5.5 Support


GP is out

Greenplum rolled out this last week and fixes many network communication and Append Only table issues that we’ve been running into. The caveat here is that GP has had some additional things crop up so I would recommend waiting until is released before making any updates.

GPDB Release Notes

Disaster Recovery

standby masters are good

We had a meeting last week where we were talking about Greenplum and we got on the topic of the standby masters. Which are really just sitting out there pulling transaction logs from the active master and hanging around. The consensus was that the need for the standby masters would be very few and far between and we haven’t needed them for the entire time we’ve been running GP thus far. That was last week. This week we’ve had two cases where we have needed the standby masters. One was a hardware issue with the 10G nic in the master of one cluster and the other was a table issue during recovery of another cluster. So yeah, the standby masters are needed.


Greenplum plugin accepted

My Greenplum plugin was accepted, it can be found here. It’s not anything all the special yet. Written in perl it relies on DBD::Pg to do most the heavy lifting. Currently it has four basic functions

1) DB login check. This is a very simple check just sees if it can make an actual connection to the GPDB. There are instances where using a tcp port check you could see the that it’s up but the login for your user is prevented. A good a example of this would be removing remote access from pg_hba.conf in order to do maintenance and then forgetting to enable remote users and/or access from specific network locations.

2) SELECT test. This test goes in and does a SELECT COUNT(1), gp_segment_id FROM schema.table GROUP BY gp_segment_id. The idea behind this check is to make sure a table is responding on all segments. This could also be used as a SLA check to make sure you aren’t surpassing certain time constraints to pull results from tables. Currently I do this again a small 1000 row table I generated in our systems.

3) WRITE test. Here the plugin logs into GPDB and attempts to create a temp table. We’ve had instances where GPDB has been up and you are able to log in and do select queries yet any query that requires a write just hangs. This check it to make sure that issue does not come up. I have yet to test on a 3.x system if this check will fail should the system go into “read-only” mode.

4) A very simple segment status check. This is the base for a more extensive check that I will build up. Currently it goes into gp_configuration or gp_segment_configuration depending upon your choice of 3.x or 4.x and looks to make sure all the segments are online. It sends back a crit status if any segments show offline. I plan to do a lot of tweaking to this test in the near future. Being able to specific a number of segments online warn and crit threshold as well as checking to see if any two segments containing the same content are down.

The timeout on all the test are configurable and default to 300 seconds. I wouldn’t suggest setting up any of these test to repeat more often than 5 minutes, except for possibly the login test. Doing something like forcing you GP cluster to do a select test against a multi-TB table every minute would probably be a bad idea.

On our system our current check setup look like this:

This being the first plugin I’ve submitted feedback on code cleanup or different additions that should be incorporated into a Greenplum plugin would be appreciated.


A nagios check to go in and check various Greenplum availability pieces


-t, --timeout Plugin timeout in seconds [default=300]
-U, --username Username to connect with (mandatory)
-P, --password Password to connect with (mandatory)
-H, --dbhost Database Hostname to connect to (mandatory)
-D, --db Database to connect to (mandatory)

Check to see if Greenplum accepts a connection
This is the default check


Check to see if getting data from a table works
The check executes a SELECT count by segment_id query
for the specified table. This could also be used to
setup SLA checks for getting data from the db

--select-schema Schema for Select check (mandatory)
--select-table Table for Select check (mandatory)

Check to see if table creation works
The check creates a temp table with and id(int) and
vlas(char) columns and set the distributed by to id.
This helps to monitor if the catalog queries and table
creation are happening in a reasonable amount of table

--create-table Table for Create check (mandatory)

Check to see if GP considers any segments offline
Query the gp_configuration (3.x) and gp_segment_configuration (4.x)
tables to see if any segments are marked down at the master level.
Currently this will crit if any show down.



plugin action

Just uploaded a Greenplum plugin to the Nagios plugin exchange. It’s my first go at an official Nagios plugin. I’ve created checks for a simple login, reading from a table, writing to a temp table and checking for segment status in 3.x and 4.x installs. Once it’s accepted I’ll publish links here.


move data from db to db

We had two Greenplum instances running and we needed to copy one big table to the other. There’s a variety of ways to dump and import but we were moving a large amount of data, so the thought of dropping it to a local file was not a very good option.

One of the ways to move a table from one instance to the other using COPY commands. If you are on the destination database you would use:

psql -h remotegreenplumtopulldatafrom -U me -c "COPY myschema.mytable to STDOUT" | psql -c "COPY myschema.mytable FROM STDIN"

We were going to move a lot of data though. So I came up with a quick script to do it a partition at a time

# database_dump
# Simple program do go out and copy dated partitions of
# a greenplum db to another db where the table exists
# it starts at start_time and goes incrementally backwards
# a day at a time until it gets to end_time
# 2010-03-16 SHK

use strict;
use warnings;
use POSIX;
use Time::Local;

# Fields are S,M,H,D,M,Y - Note month is 0-11
my $start_time = timelocal(0,0,0,1,2,2011);
my $end_time = timelocal(0,0,0,1,2,2010);
my $cmd;

my $running_date = $start_time;
until ( $running_date < $end_time ) { my $part_date = strftime("%Y%m%d",localtime($running_date)); $running_date -= 86400 $cmd = qq~psql -h remotegreenplumtopullfrom -U skahler -c "COPY myschema.mytable_1_prt_$partdate to STDOUT" | psql -c "COPY myschema.mytable FROM STDIN"~; print "$cmd\n"; system($cmd); } exit;

In our case the new db didn't have much action going on while I was moving the data in. I think if it was an active table that I was moving things into I'd create create a table on the destination target, load the data into it and then exchange that partition into the my target table.


gp_segment_id helps find skew

Finding table skew in Greenplum is very important. If you end up with a bad distribution of records across segments you end up with one node doing much more work than the others. Unfortunately Greenplum firmly supports No Child Left Behind, it will only be as fast as it’s slowest member. Thus it is extremely important to have an even distribution across all the segments. A good way to check this is to use the hidden column gp_segment_id. A simple count query with this parameter will let you know how well your data is spread across nodes.

So I’m pulling a sample of 500 tuples from an example advertising impression data set.

First I think I’ll distribute it on the business unit. The results are:

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
5 | 500
(1 row)

Ouch, that was not the right choice. Obviously the set of data I got is all from one unit. So let’s try something different. How about distributing by the campaign they are associate with.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
1 | 43
7 | 33
0 | 58
5 | 54
4 | 39
6 | 46
3 | 50
2 | 177
(8 rows)

A little bit better, but that doesn’t work so well either. The data is spread across the segments but segment 2 is holding much more data than everybody else. This will make for some hot spotting when I query the data. Next up maybe I can try by ip.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
7 | 62
4 | 68
1 | 56
6 | 63
3 | 64
0 | 66
5 | 58
2 | 63
(8 rows)

Ah, that looks much better. Of course that this could be skewed in this dataset if we were heavily from a certain country or data came much more often from specific network segment. So I would need to continue to watch it and see if skew develops over time. It looks like this will work for now.


Game over try again

Tonights Greenplum error was

ERROR: could not find segment file to use for inserting into relation table (64749). (appendonlywriter.c:569) SQL state: XX000

Which essentially means game over. Dump your table and recreate it, because you won’t be able to put any more data into it. Luckily you can still pull it out. This seems to only happen to append only compressed tables in both 3.x and 4.x. It’s supposed to be fixed in an upcoming patch release. It’s still enough to make George Bush sad.


Getting Distinct counts on keys

It seems that someone always wants to know how much of X do I have in the db

Good Idea:

SELECT COUNT(1) from ( SELECT distribkeyvalue FROM db.really_big_table GROUP BY distribkeyvalue) a

Bad Idea:

SELECT COUNT(DISTINCT distribkeyvalue ) FROM db.really_big_table

In the first case the Greenplum optimizer will realize that it can do all of the work on the nodes and just forward the final counts to be aggregated where in the second case it’s going to try to bring all the data back to a central location in order to attempt establish uniqueness in the dataset, ouch.