Nagios – gpadmin.me

My Greenplum plugin was accepted, it can be found here. It’s not anything all the special yet. Written in perl it relies on DBD::Pg to do most the heavy lifting. Currently it has four basic functions

1) DB login check. This is a very simple check just sees if it can make an actual connection to the GPDB. There are instances where using a tcp port check you could see the that it’s up but the login for your user is prevented. A good a example of this would be removing remote access from pg_hba.conf in order to do maintenance and then forgetting to enable remote users and/or access from specific network locations.

2) SELECT test. This test goes in and does a SELECT COUNT(1), gp_segment_id FROM schema.table GROUP BY gp_segment_id. The idea behind this check is to make sure a table is responding on all segments. This could also be used as a SLA check to make sure you aren’t surpassing certain time constraints to pull results from tables. Currently I do this again a small 1000 row table I generated in our systems.

3) WRITE test. Here the plugin logs into GPDB and attempts to create a temp table. We’ve had instances where GPDB has been up and you are able to log in and do select queries yet any query that requires a write just hangs. This check it to make sure that issue does not come up. I have yet to test on a 3.x system if this check will fail should the system go into “read-only” mode.

4) A very simple segment status check. This is the base for a more extensive check that I will build up. Currently it goes into gp_configuration or gp_segment_configuration depending upon your choice of 3.x or 4.x and looks to make sure all the segments are online. It sends back a crit status if any segments show offline. I plan to do a lot of tweaking to this test in the near future. Being able to specific a number of segments online warn and crit threshold as well as checking to see if any two segments containing the same content are down.

The timeout on all the test are configurable and default to 300 seconds. I wouldn’t suggest setting up any of these test to repeat more often than 5 minutes, except for possibly the login test. Doing something like forcing you GP cluster to do a select test against a multi-TB table every minute would probably be a bad idea.

On our system our current check setup look like this:

This being the first plugin I’ve submitted feedback on code cleanup or different additions that should be incorporated into a Greenplum plugin would be appreciated.

check_greenplum


A nagios check to go in and check various Greenplum availability pieces
GENERAL OPTIONS:
  -t, --timeout           Plugin timeout in seconds       [default=300]

  -U, --username          Username to connect with        (mandatory)

  -P, --password          Password to connect with        (mandatory)

  -H, --dbhost            Database Hostname to connect to (mandatory)

  -D, --db                Database to connect to          (mandatory)
TESTS:

      Check to see if Greenplum accepts a connection

      This is the default check
  --do-connect-test
      Check to see if getting data from a table works

      The check executes a SELECT count by segment_id query

      for the specified table. This could also be used to

      setup SLA checks for getting data from the db
  --do-select-test

  --select-schema         Schema for Select check         (mandatory)

  --select-table          Table for Select check          (mandatory)
       Check to see if table creation works

       The check creates a temp table with and id(int) and

       vlas(char) columns and set the distributed by to id.

       This helps to monitor if the catalog queries and table

       creation are happening in a reasonable amount of table 
  --do-create-test

  --create-table          Table for Create check          (mandatory)
        Check to see if GP considers any segments offline

        Query the gp_configuration (3.x) and gp_segment_configuration (4.x)

        tables to see if any segments are marked down at the master level.

        Currently this will crit if any show down.

--do-3x-all-segments-valid --do-4x-all-segments-valid