Getting Distinct counts on keys
It seems that someone always wants to know how much of X do I have in the db
SELECT COUNT(1) from ( SELECT distribkeyvalue FROM db.really_big_table GROUP BY distribkeyvalue) a
SELECT COUNT(DISTINCT distribkeyvalue ) FROM db.really_big_table
In the first case the Greenplum optimizer will realize that it can do all of the work on the nodes and just forward the final counts to be aggregated where in the second case it’s going to try to bring all the data back to a central location in order to attempt establish uniqueness in the dataset, ouch.