Disk Storage: How Much is that Doggie in the Window? The traditional approach of purchasing storage is broken. Comparing the purchase cost per terabyte of one storage array to another misses over 70% of the potential costs and leads to two classic storage management pitfalls. Managers who purchase new storage often focus too much of their attention on the upfront cost of buying disk and inherit a cost structure that requires ever-increasing expenditures on advanced functions, storage software, and support and maintenance. Perhaps even more at risk are those managers who elect to stay with their existing environment, unaware that a more cost-effective storage solution is available to them. This second predicament is exacerbated further by the fact that the financial savings from consolidation or migration do not show up in traditional storage expenditure categories such as disk drives, disk expansion units, and disk arrays.


Disk Storage Utalization Rates: Short stroking utilization rate is simply the percentage of usable storage that end up being used for data. Older systems that do not have thin provisioning, for example, end up wasting space because storage is allocated in larger blocks than can be managed effectively. Also, to achieve required performance, older systems often either were short-stroked or extra capacity and drives were purchased. As a result, utilization tends to run as low as 40% in the case of NAS solutions and 50-55% for older block arrays. Utilization in todays most advanced systems can run as high as 82% for block and even as high as 85% for unified options, depending on performance needs.

For example, in the case of a database, physically positioning files on the outer edges of the disks is difficult if not impossible in modern RAID systems, and therefore short-stroking seems to be the common response. However, one has to take into account the operating system and the RAID selected. You must tune your disk architecture to support the expected IO profile and must tune the database system to take advantage of the disk architecture. For example, some databases have different IO characteristics depending on whether they are reading or writing data and what type of read or write is being done, while some databases have fixed read/write sizes. You must determine the IO profile for your database and then use that IO profile to determine the maximum and minimum IO size. The IO profile will tell you what percentage of IO is large IO and what percentage is small IO, and it will also give you the expected IO rate in IO/second (IOPS). Once you have the IO per second, you can determine the IO capacity (number of drives) needed to support your database and then you can determine a reasonable utilization rate.


What is the best RAID for Random Write Performance? For the same number of drives, RAID 1+0 random write performance is about one half that of RAID 0. This is consistent with the fact that RAID 1+0 requires two low level drive writes for each high level array write, but does not require any extra reads or parity calculations on the part of a newer controller. For RAID 5 random write, in addition to writing the data, new parity must be calculated and written. This requires two disk reads, data and parity, and two disks writes, data and parity. Therefore, RAID 5 random write performance can be as little as 25% of the rate of RAID 0 if full writes are not attained. A RAID 6 random write requires 3 disk reads and 3 disk writes, data plus two parity drives. Therefore, once again RAID 6 performance is approximately 1/6 or 16.7% as fast as RAID 0 if attention is not paid to selecting the optimal stripe unit size. It is important to note that while the relative random write performance is impacted more significantly by RAID levels than random read performance; the write cache does help increase random write performance overall. This is best exemplified by RAID 0, which has no write penalty. A twelve drive RAID 0 logical drives performs about 8,350 random writes per second while achieving only 4,850 random reads per second. This difference is primarily attributable to the benefits of the write cache.

As with random read, random write performance with hard drives (spinning media drives) is limited by the performance of the drives. SSDs eliminate the drive latency issue, so random write performance increases when using SSDs. When comparing the performance impact on random reads and random writes in an SSD configuration, the physical limitations of flash memory result in lower performance improvement for random write performance. However, compared to hard drives, the SSD write performance is still very high. In the case of sequential read performance the maximum throughput capability of a drive determines the performance upper limit and therefore the read performance of an array tends to scale directly with the number of drives in the array.


Why is Read Cache hit ratio important? The default configuration on most controllers assigns 10% of the available cache space for read cache. Read cache can only increase performance if read data has previously been stored in the cache. Since the size of the disk array is many orders of magnitude larger than the size of the cache, the probability that a random read would already be in the cache is very small. For this reason, most controllers do not store random read data in the cache. Read cache is most effective in increasing the performance for sequential small-block read workloads and, in particular, read workloads at low queue depths. The controller differentiates between sequential and random workloads and then uses read cache in a predictive capacity to pre-fetch data when it detects sequential workloads. It identifies the pattern of the read commands, and then reads ahead on the drives. After reading the data, the controller puts that data into the cache, so it is available if the upcoming read commands call for In the case of writing to cache, the default configuration in many cases will be 90% of the available cache space is reserved for writing. Through a process known as “posted writes” or “write-back caching,” the controller(s) uses the write cache as an output buffer. Applications post write commands to the controller, and then continue without waiting for completion of the write operation to the drive. The application sees the write as completed in a matter of microseconds instead of milliseconds. In high workload environments, the write cache typically fills up and remains full most of the time. The controller writes the data to the drive as it works through the list of write commands in its write cache. It analyzes the pending write commands, and then determines how to handle them most efficiently.


What is Locality of Reference? Locality of reference (LOR) simply means the average physical distance the drive head has to travel to access a read is reduced and therefore the average read access time improves over what is commonly quoted by the drive manufacturer. There are two types of locality of reference the first being spatial and this is where data on the disk is closer than what one might expect if all data was normally distributed. Meanwhile, data can also be temporal in that it has been accessed recently and is therefore still available from cache. LOR can have a substantial impact on read service times from disk.


Why care about concurrency and stripe unit size? Concurrency and stripe unit size are two of the most important variables outside of the scope of the workload details. It is important to recognize that the optimal size of the stripe unit is directly related to the degree of concurrency. A great deal of research has been completed around this idea. Selecting the optimal stripe unit size based on concurrency is a key performance design factor. In general the best rule is that the stripe unit size should be equal to the concurrency slope coefficient * (average disk positioning time) * (disk transfer rate) * (workload concurrency - 1) + 1 sector. This formula is based on the fairly simple notion that large striping units maximize the amount of data a disk transfers during each access, but this requires high concurrency in the workload to make use of all the disks. In contrast, small striping units can make use of all the disks even with low workload concurrency, but they cause the disks to transfer less data. Since the way RAID impacts writes the optimal striping unit size for reads is different than the way RAID impacts the optimal striping unit size for writes. Therefore, in practical terms the optimal striping unit size is generally 1/2 * average positioning time* disk transfer rate.


Did you know how your load data impacts its sequentiality? The demand as well as type of Disk I/O is naturally a key factor in estimating the performance a disk sub-system will be able to provide. One way to come up with a preliminary estimate of your likely Disk I/O is to have a simple model that is pertinent to each of the different possible workloads. A Business Intelligence (BI) workload is a highly sequential write and read environment. However, to ensure a sequential writing of information if you are using a conventional array the final load partitioning can only occur using one core, because if you have concurrent requests for new database pages it will cause a random placement of the data. Therefore the sequentiality of writes and ultimately reads will depend on how the data is loaded.


OLTP and storage: Disk I/O requirements in an OLTP environment are largely a function of log writes, dirty pages write backs and I/Os due to cache misses. However, for a new OLTP disk sizing exercise this kind of data is typically not available. In addition, designing storage for a new OLTP application has a number of additional key items that ideally should be taken into account. For example, the trend towards large database page sizes such as 16 KB, is not optimal for exploiting Flash technology, whereas an 8 KB block size and in some cases, a 4 KB block size will deliver substantially better performance. A smaller block size also places less demand on buffer sizes and this space can be better used to improve performance by buffering more data or allowing maintenance procedures to take place, such as index rebuilding. Having a smaller block size also benefits a reduction in the database catalog. Finally, CPU demands can be more easily addressed by a multi-core processor design since this architecture is more suitable for offering lower response time for small block operations.

A tpc-C study completed by Gottstein, Petrov and Ivanov et al showed a 20% performance improvement when using 4 KB page size compared to 16 KB. Due to the fact that Flash provides higher random performance and lower latency, one obtains better CPU utilization. This CPU utilization improvement can have very significant software savings. Finally, IO parallelism and command queuing now allows several IO requests to be executed in parallel which means OLTP databases can successfully use asynchronous page.

While noting all of these factors, historically tpm-C has been the gold standard for modeling OLTP workloads. More recently, to move towards an arguably more realistic and sequential workload (accounting for greater data skews) tpm-E has been introduced. Interestingly, because tpm-E can have as much memory as 8% of the database, this ends up being large enough to capture the skewed access when the data are cached in the buffer main memory pool. As a result, the I/O pattern of tpm-E compared to tpm-C appears to the disk as just as random. This means that any performance study done with tpm-C is just as valid for tpm-E. Therefore, whether you use tpm- or tpm-E data for this disk performance analysis exercise is irrelevant.