Benchmarks for R510 Greenplum Nodes

gpcheckperf results from hammering against a couple of our R510s. The servers are setup with 12 3.5 600GB 15k SAS6 disks split into four virtual disks. The first 6 are one group and 50GB is split off for an OS partition and the rest dropped into a data partition. The second set of six disks are setup in a similar fashion with 50GB going to a swap partition and the rest going to another big data partition. No Read Ahead, Force Write Back and a Stripe Elements Size of 128KB. Partitions formatted with XFS and running on RHEL5.6.

[gpadmin@mdw ~]$ /usr/local/greenplum-db/bin/gpcheckperf -h sdw13 -h sdw15  -d /data/vol1 -d /data/vol2 -r dsN -D -v

  disk write avg time (sec): 85.04
  disk write tot bytes: 202537697280
  disk write tot bandwidth (MB/s): 2275.84
  disk write min bandwidth (MB/s): 1087.34 [sdw15]
  disk write max bandwidth (MB/s): 1188.50 [sdw13]
  -- per host bandwidth --
     disk write bandwidth (MB/s): 1087.34 [sdw15]
     disk write bandwidth (MB/s): 1188.50 [sdw13]

  disk read avg time (sec): 64.67
  disk read tot bytes: 202537697280
  disk read tot bandwidth (MB/s): 2987.98
  disk read min bandwidth (MB/s): 1461.30 [sdw15]
  disk read max bandwidth (MB/s): 1526.68 [sdw13]
  -- per host bandwidth --
     disk read bandwidth (MB/s): 1461.30 [sdw15]
     disk read bandwidth (MB/s): 1526.68 [sdw13]

  stream tot bandwidth (MB/s): 8853.81
  stream min bandwidth (MB/s): 4250.22 [sdw13]
  stream max bandwidth (MB/s): 4603.59 [sdw15]
  -- per host bandwidth --
     stream bandwidth (MB/s): 4603.59 [sdw15]
     stream bandwidth (MB/s): 4250.22 [sdw13]

 Netperf bisection bandwidth test
 sdw13 -> sdw15 = 1131.840000
 sdw15 -> sdw13 = 1131.820000

 sum = 2263.66 MB/sec
 min = 1131.82 MB/sec
 max = 1131.84 MB/sec
 avg = 1131.83 MB/sec
 median = 1131.84 MB/sec

Controller Setting for Greenplum

I brought another node into one of our clusters yesterday and it made me things of the controller setting I put on the system. In various systems we’ve used the PERC6/E, H700 and LSI-9260-8i controllers and I’ve found I on all of them I see best disk performance and reliability if I set

  • Read Policy: No Read Ahead – Having the controller do read ahead dramatically increases the io done on my servers and I’ve seen no benefit
  • Write Policy: Force Write Back – This is playing with a little bit of fire because I’m telling the server that even if the battery isn’t full or it’s going through a charging cycle to go ahead use the battery backed write cache. The fact that Greenplum data is duplicated on another server gets me past the small amount of edge cases where the server will be without power long enough that the lack of juice in the battery is going to come into play. The issue is that when the controller goes to charge the batter it will stop using the cache and force everything to write to disk. This has a huge impact on io speed and cause the whole cluster to grind to a halt while the one server struggles with io.

What had started the need for me to bring this other node into our cluster is that every outage I do a io check on the clusters using gpcheckperf and I see one array is under performing all the others

disk write bandwidth (MB/s): 620.81 [sdw11]
disk write bandwidth (MB/s): 365.38 [sdw09]
disk write bandwidth (MB/s): 621.01 [sdw08]

It’s an issue we’ve had before where one disk in the array starts to under perform but doesn’t fail out. At this point I’ll need to go in and break the RAID5 array into direct access for each disk individually and run benchmarks against them to see if I can figure out who the bad boy is and eject him from class.


Disk performance and disk fragmentation

My last post had some statistics for a C2100 cluster we were running. Last night I did maintenance on a cluster that is running on R710 attached via PERC6/E controllers to a MD1120 array filled with 24 300GB disks (10k 2.5″). These are split into 4 arrays with 6 disks in each setup RAID5. The gpcheckperf at the start of my recent maintenance

gpadmin@mdw:~> gpcheckperf -f hosts.seg -d /data/vol1 -d /data/vol2 -d /data/vol3 -d /data/vol4 -r d -D

disk write min bandwidth (MB/s): 888.01 [sdw14-1]
disk write max bandwidth (MB/s): 968.73 [ sdw4-1]

disk read min bandwidth (MB/s): 1592.66 [ sdw7-1]
disk read max bandwidth (MB/s): 1941.55 [sdw13-1]

one of the next things I do is take a look at disk defragmentation using “xfs_db -c frag -r /dev/X” where X is one of my four arrays. In this case I came up with about 35% fragmentation across all of our arrays.

to clean this up I do a run of xfs_fsr across the disks which got them all down to less than 1% fragmentation.

the next disk test produced similar write speeds but increased read speed

disk write min bandwidth (MB/s): 872.72 [ sdw8-1]
disk write max bandwidth (MB/s): 960.32 [sdw15-1]

disk read min bandwidth (MB/s): 1975.79 [ sdw8-1]
disk read max bandwidth (MB/s): 2052.40 [ sdw2-1]

Up until the last couple of months it was not uncommon for us to hit 80%+ fragmentation on all of our nodes in the Greenplum cluster. Our recent switch from Suse to Redhat should help fix this, there was apparently a bug fix that RHEL implements in a recent kernel release to clean this up. I’ve noticed that in this cluster fragmentation can have a significant impact on our reported speeds. Oddly on clusters with a single controller running 12 600GB disks ( 15k 3.5″ ) split into two arrays that I see very little change in these io reports, even when stepping down from 95% fragmentation to 1%.


What kind of disk performance does your GP see?

During our regular maintenance widows I run a gpcheckperf to see where our disk speeds in the Greenplum cluster are coming in. This is a result from an C2100 with a single LSI 9260-8i  controller. There are two virtual disk composed of 6 disks each arranged in a RAID5. For the file system I’m using xfs with the mount options: logbufs=8, logbsize=256k, noatime, attr2, nobarrier and seeing these results.

/usr/local/greenplum-db/./bin/gpcheckperf -f /data/gpadmin/hosts.seg -d /data/gpdb_p1 -d /data/gpdb_p2 -r d -D

disk write min bandwidth (MB/s): 945.25 [sdw15]
disk write max bandwidth (MB/s): 1007.74 [sdw13]

disk read min bandwidth (MB/s): 1239.10 [sdw15]
disk read max bandwidth (MB/s): 1691.65 [sdw12]

Are these similar number to what you are getting in your clusters?