Categories
Uncategorized

gp_segment_id helps find skew

Finding table skew in Greenplum is very important. If you end up with a bad distribution of records across segments you end up with one node doing much more work than the others. Unfortunately Greenplum firmly supports No Child Left Behind, it will only be as fast as it’s slowest member. Thus it is extremely important to have an even distribution across all the segments. A good way to check this is to use the hidden column gp_segment_id. A simple count query with this parameter will let you know how well your data is spread across nodes.

So I’m pulling a sample of 500 tuples from an example advertising impression data set.

First I think I’ll distribute it on the business unit. The results are:

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
5 | 500
(1 row)

Ouch, that was not the right choice. Obviously the set of data I got is all from one unit. So let’s try something different. How about distributing by the campaign they are associate with.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
1 | 43
7 | 33
0 | 58
5 | 54
4 | 39
6 | 46
3 | 50
2 | 177
(8 rows)

A little bit better, but that doesn’t work so well either. The data is spread across the segments but segment 2 is holding much more data than everybody else. This will make for some hot spotting when I query the data. Next up maybe I can try by ip.

db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
7 | 62
4 | 68
1 | 56
6 | 63
3 | 64
0 | 66
5 | 58
2 | 63
(8 rows)

Ah, that looks much better. Of course that this could be skewed in this dataset if we were heavily from a certain country or data came much more often from specific network segment. So I would need to continue to watch it and see if skew develops over time. It looks like this will work for now.

Categories
Uncategorized

Getting Distinct counts on keys

It seems that someone always wants to know how much of X do I have in the db

Good Idea:

SELECT COUNT(1) from ( SELECT distribkeyvalue FROM db.really_big_table GROUP BY distribkeyvalue) a

Bad Idea:

SELECT COUNT(DISTINCT distribkeyvalue ) FROM db.really_big_table

In the first case the Greenplum optimizer will realize that it can do all of the work on the nodes and just forward the final counts to be aggregated where in the second case it’s going to try to bring all the data back to a central location in order to attempt establish uniqueness in the dataset, ouch.

Categories
Uncategorized

More Info Soon

Gearing up and planning to put this site into action. The goal to be a repository for Greenplum ideas and tips submitted and maintained by people using the system.