Year: 2011

gp_segment_id helps find skew

Post author By skahler
Post date 2011/02/20
1 Comment on gp_segment_id helps find skew

Finding table skew in Greenplum is very important. If you end up with a bad distribution of records across segments you end up with one node doing much more work than the others. Unfortunately Greenplum firmly supports No Child Left Behind, it will only be as fast as it’s slowest member. Thus it is extremely important to have an even distribution across all the segments. A good way to check this is to use the hidden column gp_segment_id. A simple count query with this parameter will let you know how well your data is spread across nodes.

So I’m pulling a sample of 500 tuples from an example advertising impression data set.

First I think I’ll distribute it on the business unit. The results are:

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id; gp_segment_id | count ---------------+------- 5 | 500 (1 row)

Ouch, that was not the right choice. Obviously the set of data I got is all from one unit. So let’s try something different. How about distributing by the campaign they are associate with.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id; gp_segment_id | count ---------------+------- 1 | 43 7 | 33 0 | 58 5 | 54 4 | 39 6 | 46 3 | 50 2 | 177 (8 rows)

A little bit better, but that doesn’t work so well either. The data is spread across the segments but segment 2 is holding much more data than everybody else. This will make for some hot spotting when I query the data. Next up maybe I can try by ip.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id; gp_segment_id | count ---------------+------- 7 | 62 4 | 68 1 | 56 6 | 63 3 | 64 0 | 66 5 | 58 2 | 63 (8 rows)

Ah, that looks much better. Of course that this could be skewed in this dataset if we were heavily from a certain country or data came much more often from specific network segment. So I would need to continue to watch it and see if skew develops over time. It looks like this will work for now.

Errors

Game over try again

Tonights Greenplum error was

ERROR: could not find segment file to use for inserting into relation table (64749). (appendonlywriter.c:569) SQL state: XX000

Which essentially means game over. Dump your table and recreate it, because you won’t be able to put any more data into it. Luckily you can still pull it out. This seems to only happen to append only compressed tables in both 3.x and 4.x. It’s supposed to be fixed in an upcoming patch release. It’s still enough to make George Bush sad.

Tags Append-only, compressed, error

Uncategorized

Getting Distinct counts on keys

Post author By skahler
Post date 2011/02/17
1 Comment on Getting Distinct counts on keys

It seems that someone always wants to know how much of X do I have in the db

Good Idea:

SELECT COUNT(1) from ( SELECT distribkeyvalue FROM db.really_big_table GROUP BY distribkeyvalue) a

Bad Idea:

SELECT COUNT(DISTINCT distribkeyvalue ) FROM db.really_big_table

In the first case the Greenplum optimizer will realize that it can do all of the work on the nodes and just forward the final counts to be aggregated where in the second case it’s going to try to bring all the data back to a central location in order to attempt establish uniqueness in the dataset, ouch.

Benchmarking

Disk performance and disk fragmentation

Post author By skahler
Post date 2011/02/17
No Comments on Disk performance and disk fragmentation

My last post had some statistics for a C2100 cluster we were running. Last night I did maintenance on a cluster that is running on R710 attached via PERC6/E controllers to a MD1120 array filled with 24 300GB disks (10k 2.5″). These are split into 4 arrays with 6 disks in each setup RAID5. The gpcheckperf at the start of my recent maintenance

gpadmin@mdw:~> gpcheckperf -f hosts.seg -d /data/vol1 -d /data/vol2 -d /data/vol3 -d /data/vol4 -r d -D

disk write min bandwidth (MB/s): 888.01 [sdw14-1]
disk write max bandwidth (MB/s): 968.73 [ sdw4-1]

disk read min bandwidth (MB/s): 1592.66 [ sdw7-1]
disk read max bandwidth (MB/s): 1941.55 [sdw13-1]

one of the next things I do is take a look at disk defragmentation using “xfs_db -c frag -r /dev/X” where X is one of my four arrays. In this case I came up with about 35% fragmentation across all of our arrays.

to clean this up I do a run of xfs_fsr across the disks which got them all down to less than 1% fragmentation.

the next disk test produced similar write speeds but increased read speed

disk write min bandwidth (MB/s): 872.72 [ sdw8-1]
disk write max bandwidth (MB/s): 960.32 [sdw15-1]

disk read min bandwidth (MB/s): 1975.79 [ sdw8-1]
disk read max bandwidth (MB/s): 2052.40 [ sdw2-1]

Up until the last couple of months it was not uncommon for us to hit 80%+ fragmentation on all of our nodes in the Greenplum cluster. Our recent switch from Suse to Redhat should help fix this, there was apparently a bug fix that RHEL implements in a recent kernel release to clean this up. I’ve noticed that in this cluster fragmentation can have a significant impact on our reported speeds. Oddly on clusters with a single controller running 12 600GB disks ( 15k 3.5″ ) split into two arrays that I see very little change in these io reports, even when stepping down from 95% fragmentation to 1%.

Tags R710, xfs

Benchmarking

What kind of disk performance does your GP see?

Post author By skahler
Post date 2011/02/15
No Comments on What kind of disk performance does your GP see?

During our regular maintenance widows I run a gpcheckperf to see where our disk speeds in the Greenplum cluster are coming in. This is a result from an C2100 with a single LSI 9260-8i controller. There are two virtual disk composed of 6 disks each arranged in a RAID5. For the file system I’m using xfs with the mount options: logbufs=8, logbsize=256k, noatime, attr2, nobarrier and seeing these results.

/usr/local/greenplum-db/./bin/gpcheckperf -f /data/gpadmin/hosts.seg -d /data/gpdb_p1 -d /data/gpdb_p2 -r d -D

disk write min bandwidth (MB/s): 945.25 [sdw15]
disk write max bandwidth (MB/s): 1007.74 [sdw13]

disk read min bandwidth (MB/s): 1239.10 [sdw15]
disk read max bandwidth (MB/s): 1691.65 [sdw12]

Are these similar number to what you are getting in your clusters?

Tags C2100, Greenplum, LSI 9260-8i

Documentation

Adding plperlu Language to Greenplum on RHEL5

Post author By skahler
Post date 2011/02/14
1 Comment on Adding plperlu Language to Greenplum on RHEL5

In order to get plperlu added as a language on our Greenplum 4.0RHEL5 cluster I had to take a couple additional steps. My first unhappy message was

db=# CREATE LANGUAGE plperlu;
ERROR: could not load library “/usr/local/greenplum-db-4.0.4.3/lib/postgresql/plperl.so”:
libperl.so: cannot open shared object file: No such file or directory
db=# CREATE LANGUAGE plperlu;ERROR: could not load library “/usr/local/greenplum-db-4.0.4.3/lib/postgresql/plperl.so”:
libperl.so: cannot open shared object file: No such file or directory

So greenplum can’t find libperl.so

[gpadmin@mdw ~]$ locate libperl.so

/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so

Looks like it’s on the system so I just need to make it available

[root@mdw ~]# echo “/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/” > /etc/ld.so.conf.d/perl.conf

[root@mdw ~]# ldconfig

Next I went out and did the same thing on all the nodes using gpssh. Then it’s back to the master and we try to create the language again

db=# CREATE LANGUAGE plperlu;

CREATE LANGUAGE

Time: 892.420 ms

I should note that I tried to install plperl on the SLES11 servers we had and ran into an issue because the version of perl on SLES11 is 5.10 and unfortunately the plperl.so within Greenplum’s distro is looking for 5.8. I didn’t look for a fix because I knew we were going to jump to Redhat in the near future and it would work there.

Alpine Miner

Alpine Miner First Look

Downloaded the new Alpine Miner off of the the Greenplum Community site. Unfortunately for me there is only a Mac and Windows version so I had to fire up a VM to try it out.

Documentation

Adding Documentation

Doing the most exciting of all tasks, I’m adding documentation to the site.

Greenplum Command references at : GP COMMANDS

Greenplum Documentation at: DOCS

I find myself constantly sharing my latest GP Admin guide with other people so I figured this would be a great spot to drop the pdf documentation that you’ll find in the release packages. It would also be handy to keep a running log the various documentation attached with each release.

Additionally I’m using <command> -? constantly to get the help from the program I want to run. So I put these up on the site too in order to make it easier to get at those outputs.

Uncategorized

More Info Soon

Gearing up and planning to put this site into action. The goal to be a repository for Greenplum ideas and tips submitted and maintained by people using the system.