High Avg Disk Queue Length and finding the Cause
Avg Disk Queue Length is one of the main counters in the perfmon application. Avg Disk Queue Length is an estimate of requests on the physical or logical disk that are either in service or waiting for service. The value is a product of Disk Transfers/sec (response X I/O) and Avg Disk sec/Transfer.
What does it all mean? It’s confusing for many, but there are many instances where a high Avg Disk Queue Length does not mean a bottleneck. To see whether Avg Disk Queue Length is indeed showing a true representation of your disk’s performance, you need to compare Current Disk Queue Length over an interval. Add the Current Disk Queue Length to the counters graph in perfmon.
If the Current Disk Queue Length for the previous interval matches the Current Disk Queue Length for the current interval, then indeed the Avg. Disk Queue Length can be used as a general representation of the condition of your storage system.
Say your Avg. Disk Queue Length shows a value of 4, and the Current Disk Queue Length for the current interval is 3, and the previous interval was 0. This means the number of I/O arrivals is greater than the I/O completions during the interval. This results in an incorrect value for Avg Disk Queue Length – often to the horror of System Administrators.
Suppose you have determined the value of Avg Disk Queue Length is indeed accurate and useful – how much is too much? As a general rule for hard disks, an Avg Disk Queue Length greater than 2 (per hard disk) for extended periods of time is considered undesirable. If you have a RAID system with 8 disks, you do not want an Avg Disk Queue Length greater than 16. Faster hard disks with quicker access times (and therefore I/O) will allow greater flexibility with these numbers. Avg Disk sec/read and Avg Disk sec/write should be under 10ms – over 20ms may indicate a bottleneck. If while Avg. Disk Queue Length is over 2 and % Disk Time is hovering at 60% or above, you may want to look into a possible I/O bottleneck.
Below is a perfmon graph taken on a test machine. Avg Disk Queue Length reaches 36!! on a 2 disk RAID1 configuration.
Using Process Explorer (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx) we are able to see which applications have the highest I/O reads and writes. The following screenshot shows over 9 million I/O reads and 260 000 I/O writes in a little over 4 hours uptime for a DBServer application.
Using another program called FileMon (http://technet.microsoft.com/en-us/sysinternals/bb896642.aspx) we are able to see each program being accessed on the machine in real-time. The small screenshot shows a section of DBServer operations all within the same second. As it turns out, there were well over 300 instances during a one-second interval, correlating to the spike that sent the Avg Disk Queue Length to 36.
This particular situation was a stress test comprised of 12 users performing typical operations at the same time on a networked database server. Obviously a 2 disk RAID1 system (10K SAS) was not up to the task.