A collection of informal posts created by the minds at CIOview
Having done your homework you may very well know a hybrid solution in many cases offers the best price/performance option. However, the question then becomes what percentage of your data storage capacity should be Flash and what percentage should be traditional disk? In terms of an IOP density function, the following ratios can be used when empirical data is not available: 69% of IOPS tend to occur on the first 20% of storage, 20% occur on the next 20% and 11% of IOPS relate to the last 60% of the storage capacity. As a result, placing 40% of your data on Flash suggests you will get 89% of your IOPS.
The basic challenge is the science of genome sequencing is evolving much faster than the refresh cycle of data center equipment. As a result, it is incumbent on IT to purchase equipment with sufficient flexibility to anticipate new sequencing demands. Just what exactly are the factors that drive those demands in terms of compute and data storage resources?
Genome sequencing usually means creating an enormous database, and then mapping it against the variant in question. This requires a sort step of the read mapping. While this process can arguably be done more efficiently in some cases by using third party software, it’s still a very compute intensive undertaking. For example, a recent study of 440 patients required 300,000 core hours of computer time.
Currently each genome read mapping uses approximately 500 core hours of processing. Given the cost of a core hour is between $.05 and $0.12. Add in the variant calling cost of about 200 core hours per genome, and one can quickly get an idea of the compute resources required. As far as storage performance, a reasonable heuristic is 10 sustained IOPS per study participant, and at least 1 GB/s of sustained reads. However, the latency requirements are much more modest than the required data transfer rate would suggest. Therefore, genome sequencing is arguably an optimal workload for a hybrid data storage solution.
There are two distinct methods of genome sequencing: whole genome sequencing, and exome sequencing. Whole as the name implies includes sequencing the non-differentiated aspects of the human genome, while exome only sequences the exonic aspects which typically amount to 1-2% of the whole genome. As a result, the amount of data storage and disk IOPS required will depend on: sequencing approach (whole or exome), plus the number of base reads, and the read length.
In the case of exome sequencing, the file size can be derived from: a coverage factor of 40x, 110,000,000 reads and a read length of 75. This will give you a 5.7GB BAM file. Add in Meta data for future analysis and visualization, and the expanded file will be almost 7GB.
In the case of whole genome sequencing, if we use fairly standard measures such as: coverage of 37.7 x, a read length of 115, with 975,000,000 reads the BAM file would be 82GB. Adding Meta you can expect the BAM file size to be about 105 GB.
Processing resource requirements in combination with Disk IO are slated to increase as HighSeq x 10 analysis becomes increasingly common. For rough estimate purposes each read mapping requires about 500 core hours while each variant calling needs 200 core hours. These numbers will trend upwards fairly dramatically if HighSeq x 10 is fully exploited.
Finally, one of the constraining factors in designing a storage solution for genome sequencing is not only high IOPS requirements, but also the ability to move data blocks of varying sizes at 1GB/s transfer rates or more.
One of the more interesting questions concerning the use of flash storage is its value for Oracle redo logs. It turns out, that the answer is not as straight forward as we might like. Instead, it depends upon the characteristics of the workload under consideration, and the design of your selected flash system. For example, a flash product may have an inherent feature whereby there is a performance penalty for writes which are not aligned on 4K boundaries. If this is the case, then the penalties for misaligned random writes are much greater than for misaligned sequential writes of the same size. Second, as write sizes increase, the misalignment penalties (in percentage terms) decrease dramatically, especially in the sequential case. Since all database writes in Oracle are aligned on 4K boundaries (as long as the default block size is at least 4K), using flash for database table spaces should never result in slower performance due to misaligned writes. On the other hand, redo writes are only guaranteed to fall on 512 byte boundaries. To complicate things a little further, redo writes also have a quasi-random access pattern when async I/O is employed. These two properties together will contribute to performance degradations for some workloads.