A formula for the optimal storage mix

Having done your homework, you may very well know a hybrid solution can offer the best price/performance option. The question then becomes what percentage of your data storage capacity should be flash and what percentage should be traditional disk? In terms of an IOP density function, the following ratios can be used when empirical data is not available: 69% of IOPS tend to occur on the first 20% of storage, 20% occur on the next 20%, and 11% of IOPS relate to the last 60% of the storage capacity. As a result, placing 40% of your data on flash suggests you will get 89% of your IOPS.


Genome sequencing needs a hybrid data storage solution

The science of genome sequencing is evolving much faster than the refresh cycle of data center equipment. As a result, it is incumbent on IT to purchase equipment with sufficient flexibility to anticipate new sequencing demands. Just what exactly are the factors that drive those demands in terms of compute and data storage resources?

Genome sequencing usually means creating an enormous database, and then mapping it against the variant in question. This requires a sort step of the read mapping. While this process can arguably be done more efficiently in some cases by using third party software, it’s still a very-compute intensive undertaking. For example, a recent study of 440 patients required 300,000 core hours of computer time.

Currently, each genome read mapping uses approximately 500 core hours of processing. Given the cost of a core hour is between $.05 and $0.12, add in the variant calling cost of about 200 core hours per genome, and one can quickly get an idea of the compute resources required. As far as storage performance, a reasonable heuristic is 10 sustained IOPS per study participant, and at least 1 GB/s of sustained reads. However, the latency requirements are much more modest than the required data transfer rate would suggest. Therefore, genome sequencing is arguably an optimal workload for a hybrid data storage solution.

There are two distinct methods of genome sequencing: whole genome sequencing, and exome sequencing. Whole, as the name implies, includes sequencing the non-differentiated aspects of the human genome, while exome only sequences the exonic aspects, which typically amount to 1-2% of the whole genome. As a result, the amount of data storage and disk IOPS required will depend on sequencing approach (whole or exome), plus the number of base reads, and the read length.

In the case of exome sequencing, the file size can be derived from: a coverage factor of 40x, 110,000,000 reads and a read length of 75. This will give you a 5.7GB BAM file. Add in meta data for future analysis and visualization, and the expanded file will be almost 7GB.

In the case of whole genome sequencing, if we use fairly standard measures such as coverage of 37.7 x, a read length of 115, with 975,000,000 reads, the BAM file would be 82GB. Adding meta you can expect the BAM file size to be about 105 GB.

Processing resource requirements in combination with disk IO are slated to increase as HighSeq x 10 analysis becomes increasingly common. For rough estimate purposes, each read mapping requires about 500 core hours while each variant calling needs 200 core hours. These numbers will trend upwards fairly dramatically if HighSeq x 10 is fully exploited.

Finally, one of the constraining factors in designing a storage solution for genome sequencing is not only high IOPS requirements, but also the ability to move data blocks of varying sizes at 1GB/s transfer rates or more.


Oracle and flash: why?

One of the more interesting questions concerning the use of flash storage is its value for Oracle redo logs. It turns out, that the answer is not as straight forward as we might like. Instead, it depends upon the characteristics of the workload under consideration, and the design of your selected flash system. For example, a flash product may have an inherent feature whereby there is a performance penalty for writes which are not aligned on 4K boundaries. If this is the case, then the penalties for misaligned random writes are much greater than for misaligned sequential writes of the same size. Second, as write sizes increase, the misalignment penalties (in percentage terms) decrease dramatically, especially in the sequential case. Since all database writes in Oracle are aligned on 4K boundaries (as long as the default block size is at least 4K), using flash for database table spaces should never result in slower performance due to misaligned writes. On the other hand, redo writes are only guaranteed to fall on 512 byte boundaries. To complicate things a little further, redo writes also have a quasi-random access pattern when async I/O is employed. These two properties together will contribute to performance degradations for some workloads.


When deduplication grows your data storage capacity

Reading a file out of storage causes it to be reconstituted and therefore any secondary copy will be larger than the primary. If a file is snapshotted before deduplication, the post-deduplication snapshot will preserve the entire file. This means that while the storage space due to deduplication of primary files may shrink, the storage capacity required for snapshots may expand dramatically.

Scaling can be a bit of a challenge for systems that use deduplication because ideally the scope of deduplication needs to be shared across storage devices. If you have multiple backup devices in your infrastructure with discrete deduplication, then space efficiency maybe adversely affected.


Do you use SAS for analytics?

In the case of an analytic workload, setting up file systems to support the typical amount of data, as well as having sufficient disk space for temporary files, can be very challenging. In addition, ensuring there is sufficient sustained IO bandwidth to complete jobs within the time allotted can be a challenge. In general, file systems striped across many smaller disks work more effectively than the use of a few larger disks. Therefore, the focus shifts away from IOPS to sustaining a high level of throughput. Generally speaking, analytics tends to be highly sequential. However, when each user has their own session, if more than one session is trying to access the same data file, it may appear to the data storage system that this workload has a random component. If you are using SAS, then your performance may be limited by the file system cache. The maximum file system cache throughput is about 2 GB/s on UNIX and Linux and 1.4 GB/s on a 64 bit Windows machine.