Finding table skew in Greenplum is very important. If you end up with a bad distribution of records across segments you end up with one node doing much more work than the others. Unfortunately Greenplum firmly supports No Child Left Behind, it will only be as fast as it’s slowest member. Thus it is extremely important to have an even distribution across all the segments. A good way to check this is to use the hidden column gp_segment_id. A simple count query with this parameter will let you know how well your data is spread across nodes.
So I’m pulling a sample of 500 tuples from an example advertising impression data set.
First I think I’ll distribute it on the business unit. The results are:
db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
5 | 500
(1 row)
Ouch, that was not the right choice. Obviously the set of data I got is all from one unit. So let’s try something different. How about distributing by the campaign they are associate with.
db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
1 | 43
7 | 33
0 | 58
5 | 54
4 | 39
6 | 46
3 | 50
2 | 177
(8 rows)
A little bit better, but that doesn’t work so well either. The data is spread across the segments but segment 2 is holding much more data than everybody else. This will make for some hot spotting when I query the data. Next up maybe I can try by ip.
db=# select gp_segment_id, count(1) from skahler.demo group by gp_segment_id;
gp_segment_id | count
---------------+-------
7 | 62
4 | 68
1 | 56
6 | 63
3 | 64
0 | 66
5 | 58
2 | 63
(8 rows)
Ah, that looks much better. Of course that this could be skewed in this dataset if we were heavily from a certain country or data came much more often from specific network segment. So I would need to continue to watch it and see if skew develops over time. It looks like this will work for now.