Getting Distinct counts on keys

It seems that someone always wants to know how much of X do I have in the db

Good Idea:

SELECT COUNT(1) from ( SELECT distribkeyvalue FROM db.really_big_table GROUP BY distribkeyvalue) a

Bad Idea:

SELECT COUNT(DISTINCT distribkeyvalue ) FROM db.really_big_table

In the first case the Greenplum optimizer will realize that it can do all of the work on the nodes and just forward the final counts to be aggregated where in the second case it’s going to try to bring all the data back to a central location in order to attempt establish uniqueness in the dataset, ouch.

One reply on “Getting Distinct counts on keys”

In 4.0 and higher versions of the database you should be able to use the second form and it should do as well as the first.

Leave a Reply

Your email address will not be published. Required fields are marked *