Genomic data storage – the dilemma between FASTQ and BAM: which format is more likely to preserve the long-term integrity of data?

by Guillaume Rizk, PhD, CTO of Enancio

posted on July 10, 2019

The volume of genomic sequencing data is increasing at an incredible pace, thanks to the affordability of DNA high throughput sequencing, and the many promises the field holds for human health. Whole genome analysis (WGA) is making its move out of research labs and into hospitals, allowing the advent of precision medicine for everyone. In the near future, genomics could very well dwarf the other traditional big data domains, such as astronomy or youtube videos, according to this study from Stephens et al.[1], if only because of the sheer size of the files.

However, the move to the diagnosis field also comes with rigorous constraints for data management. Medicine demands reliable and reproducible analysis. It may be necessary to re-run an analysis in order to double-check diagnosis results or in a litigation event. The access to the original unaltered data is therefore crucial. Moreover, the fast evolution of bioinformatic tools might also prompt repeat analyses for new discoveries. Finally, deep learning methods, which could lead to significant breakthroughs, are only unlocked by huge amounts of data. All these facts among many others lead to the same conclusion: genomic data are highly valuable, and should be stored on a long-term period.

This brings therefore the question of the most suitable format for long-term storage. As WGA comes in different formats, this is not a straightforward issue. The data coming from the sequencer is in FASTQ format, then after the first analysis step it is turned into a BAM format1, additionally containing mapping information. Finally variant discovery tools produce VCF files containing the list of variants only. On a side-note and to make things a little bit more confusing, the FASTQ file may be stored as an unmapped BAM file (uBAM), with all reads unmapped. This uBAM is more like a FASTQ than a BAM, it is nothing more than FASTQ data plus some metadata.

The storage of VCF files comes with the appeal of a much reduced data footprint and allows for deep learning studies, but does not allow to run mapping or variant calling again. A lot of information would definitely be lost.

Then comes the choice between BAM and FASTQ. BAM a priori contains everything in the FASTQ, plus the mapping information. Storing BAM makes it easier to run many different variant callers without the need to re-run the time consuming alignment step. Moreover some tools allow to re-generate FASTQ from BAM, if ever it is needed. So it seems a reasonable choice.

BAM files are not 100% lossless

However, if we dwelve into the details, this BAM to FASTQ conversion is not completely lossless. The FASTQ file regenerated back from BAM will not be byte-identical to the original FASTQ file that was used to generate the BAM. Some differences are subtle, and some are significant enough to compromise the repeatability of the analysis. And even subtle differences may be a huge issue in a strict environment when data integrity checks are conducted with checksums.

Here is a list of issues when converting BAM to FASTQ:

Read names. The read names may be truncated, since the SAM specification define the name as the regexp [!-?A-~]{1,255}, whereas read names in FASTQ files may come in any length and format. Although this will not impact variant calling results, it will generate a different checksum making it difficult to find if the difference is only due to read names or other more important reasons.

- Hard clipping. For hard clipped alignments (clipped alignment refers to local alignment, i.e. when some part of the read is not aligned to the genome and does not contribute to the alignment score), the part of the read and corresponding quality scores that is hard clipped is simply deleted and not recorded in the BAM file. Some mappers (such as BWA-MEM) play nicely, and only hard clip secondary alignments, meaning the full read will always be somewhere in the BAM file. Depending on the software used to generate the BAM file, hard clipping may incur a data loss.

- Unmapped reads. Reads that cannot be mapped to the genome may be omitted from the BAM file, incurring an obvious data loss. Again, some mappers are nice and do store the unmapped reads in the BAM file. But different mappers/ configurations will definitely lead to data loss.

- Read ordering. This is probably the biggest issue, and the least obvious one. After the mapping phase, reads are sorted according to their alignment position on the genome. This is required for variant calling analysis, and also has the side effect of increasing DNA compression by gzip. Thus when converting back to FASTQ file, the read order will be the ordering present in the BAM file, not the original one. Keeping the original read order may seem irrelevant, since the read order coming out of the sequencer is completely arbitrary anyway. However, changing the read order leads to several issues. First, it makes it impossible to check data integrity by the means of common checksums. And more importantly, it breaks repeatability, because many tools have heuristics that will make their result dependent on the read order. A study from Firtina and Alkan [2] specifically addressed this question. They performed the same analysis twice, the first time with the original FASTQ, and the second time after shuffling the reads. They measured the effect on the number of SNVs and indels detected and found out that the difference is typically around 1-2% and can go as high as 25% in the most serious cases.

In a nutshell, although you can regenerate a FASTQ file from the corresponding BAM file, you have no guarantee that you will be able to get back to the exact same original FASTQ, and you take the risk of losing valuable information in the process. If you have full control over the pipeline that generates the BAM, then things are a little better, but you still cannot avoid the issue of a different read order compromising repeatability.

Make no compromise between cost and 100% lossless

The 100% fail-safe choice is then to store FASTQ files but this would come at a cost, since the BAM format results in smaller files. The solution to avoid a compromise between safety and cost of storage is to use Lena, Enancio’s compression software. It is designed so that you can archive FASTQ files with a 75% lower footprint than the corresponding BAM file, and is fully lossless. A free trial is available here.

1. We use BAM to designate the SAM/BAM/CRAM family of format.

References and additionnal links:

  • [1] Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., ... & Robinson, G. E. (2015). Big data: astronomical or genomical?. PLoS biology, 13(7), e1002195.
  • [2] Firtina, C., & Alkan, C. (2016). On genomic repeats and reproducibility. Bioinformatics, 32(15), 2243-2247.
  • BamHash software, designed to check if the content of a BAM file is consistent with the FASTQ file. However, by design read order is not taken into account.
    Óskarsdóttir, A., Másson, G., & Melsted, P. (2015). BamHash: a checksum program for verifying the integrity of sequence data. Bioinformatics, 32(1), 140-141.
  • An interesting biostar discussion on the subject: