2012-02-14

Reference based SAM/BAM compression

In some respects the SAM/BAM specification is quite loose, in that there is more than one way to represent a given piece of information. We can take advantage of this to reduce the size on disk of mapped reads which match the reference sequence, while still maintaining conformance within the spec. I've written a SAM/BAM reference based compression script in Python - put this in your pipeline and smoke it!

As an aside, you might also be interested in The Pistoia Alliance Sequence Squeeze Competition which is also looking at sequence compression.

What about CRAM?

The CRAM ToolKit aims to provide a space efficient alternative to BAM using reference-based sequence compression, and optional lossy storage of the sequence base qualities. Please see Ewan Birney's May 2011 introductionengineering and future trio of blog posts as well as the open access paper Fritz et al. 2011.

Just prior to the release of CRAM v0.6, I attended a NextGenBUG meeting where Guy Cochrane and Vadim Zalunin from the EBI gave talk about CRAM. Vadim explained that they are currently looking at transparent integration into Picard (the Java library for samtools), but that support in the samtools C library was a lower priority. Also round-trip conversion of BAM to CRAM and back isn't always possible yet, with tags a particular problem.

My impression was that CRAM is still very much in its infancy, and widespread adoption is still a long way off. I may be biased because I don't directly use Java and Picard - while the samtools C API has wrappers for most mainstream languages. However, the more advanced work they are doing with CRAM should pay dividends down the line.

Size optimizing SAM/BAM

While CRAM is looking very promising, I believe there is mileage in encouraging adoption of smaller changes (which are within the SAM/BAM format) right away.

The first low hanging fruit is the read's sequence, the storage for which can be reduced considerably for mapped reads while still adhering to the SAM/BAM file format. That's what I'm looking at in this blog post.

Next there is the read qualities - and if you are willing to give up information here, again size optimization is possible within BAM. Trivially the quality scores can be omitted, for example for perfectly mapping reads. Also, a reduced quality range ("vertical compression") is possible, for instance round down PHRED scores to give just 0, 10, 20, 30, 40 etc. That doesn't save any space immediately, but should make the data easier for BGZF/GZIP to compress.

When we come to the tags, which are often a large proportion of the data held in SAM/BAM files, there are less easy choices. I think read groups in particular are a special case where perhaps it might pay off to change the SAM/BAM format itself - related to this indexing by read group seems useful.

But back to practicalities - making SAM/BAM files smaller by reference based compression of the read sequences...

Plan A: Omit sequence of near-perfect reads

A read's sequence can be omitted entirely in BAM, or replaced by an asterisk placeholder '*' in SAM. Why might you do that? Well if the read matched perfectly, you wouldn't need to store the read sequence. How could you record the fact it matched perfectly? Using the CIGAR string with the equals ('=') operator.

In fact, you could do this for near-perfect reads where the CIGAR string (after any 'M' operators are clarified into 'X' or '=') contains 'D', 'N', 'H', 'P' or '=' only. In such case the read's sequence is redundant, whereas any CIGAR 'S', 'I', or 'X' operators requires some of the read sequence.

That looks good in principle, but there are some significant downsides. Most importantly, it would require added complexity in SAM/BAM reading code to know that if the SEQ is missing that it might be implicit via the CIGAR string. That could be quite a hurdle.

Additionally, you can't use this trick for any reads with soft trimming, or indeed any read with a discrepancy versus the reference. That rules out a lot of real world data.

Plan B: Store sequence differences only

Fortunately, there is an alternative which is ideal for the typical case of imperfectly mapped reads. The read sequence is usually held as the letters ACGT in SAM (IUPAC ambiguity codes like N are allowed too, case is not preserved), or in a bit packed field in BAM using half a byte per base (4 bits) pre-compression. Additionally bases matching there references can be represented in SAM with an equals sign instead, or in BAM with no bits set.

So for example, a read 'ACGTACGTACGT' with one mismatch in the sixth base (CIGAR string '12M' or '5=1X6='), could be stored as '=====C======' instead. That doesn't save you any disk space until we try and compress it, and BAM files are normally compressed (using BGZF, a form of GZIP). Using this trick, both the BAM file and a gzipped version of the SAM file shrink - runs of single characters like this are ideal for efficient compression using run length encoding.

Test Code

This can probably be done with Python and PySAM to access the samtools API. However I've had trouble editing BAM records (e.g. altering CIGAR strings) like this, so opted for a much simpler approach.

This Python script (sam_seq_equals.py) takes SAM format as input (together with a reference FASTA file) and returns SAM format where matching bases are converted into equal signs. The idea is this can then be piped from/to BAM using samtools.

Initially I cheated - in the first iteration for any mapped reads I replaced all the bases with equal signs. That gave a lower bound for the benefits this strategy should show - at a fraction of the programming effort, and convinced me to do it properly.

Test Case

Hand waving arguments and toy datasets are all very well, but to make the point we need some actual numbers ideally from real data. So, I decided to use some BAM files used in the CRAM evaluations, from the 1000 (Human) Genomes Project (g1k),  Solexa/Illumina data for the exome. See the CRAM development mailing list archive for details. In particular this data is from Ms NA12878 (female since there is no Y chromosome). There was no particular reason for picking this individual over another.

The right most column is bits per base (bpb), a metric that the EBI are using in developing CRAM and planning for the sequence archive. Averaging less than one bit per base is quite possible if you can cut down on the metadata.

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.bam
This file has lots of tags, all the original quality scores, and all the chromosomes.
In total 12,397,549,884 bases in all 163,310,370 mapped reads, none unmapped.
CompressionSize (bytes)Saving on BAMbpb
BAM, normal17,512,742,373-11.30
BAM, seq equals14,545,219,58416.9%9.39
CRAM???

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.bam
This file has lots tags, and all the original quality scores, but just chromosome 20.
In total 298,406,433 bases in 3,931,328 mapped reads, none unmapped.
CompressionSize (bytes)Saving on BAMbpb
BAM, normal424,968,956-11.39
BAM, seq equals384,405,0629.5%10.31
CRAM???

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.all.cram(.bam)
No tags, quality scores for substitutions and insertions only, just chromosome 20.
In total 298,406,433 bases in 3,931,328 mapped reads, none unmapped.
CompressionSize (bytes)Saving on BAMbpb
BAM, normal138,216,326-3.71
BAM, seq equals98,977,30228.4%2.65
CRAM, pre 0.549,304,65264.3%1.32

NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.cram(.bam)
No tags, quality scores as above plus 'pileup3' and 'cutoff5', just chromsome 20.
In total 298,406,433 bases in 3,931,328 mapped reads, none unmapped.
CompressionSize (bytes)Saving on BAMbpb
BAM, normal112,757,708-3.02
BAM, seq equals72,630,83335.6%1.95
CRAM, pre 0.517,033,84884.9%0.46

Each of these four sub-tables should be a fair loss-less comparison. The first two tables have no CRAM entry as currently CRAM doesn't support a loss-less mode which preserves tags etc (update - CRAM 0.7 should). Here using equals signs in the BAM file shows a good saving, about ten percent.

The later two sub-tables are from files after CRAM has discarded things like read tags. I compared the pre-0.5 CRAM file (downloaded), the BAM file recovered from the CRAM file (also downloaded), and my conversion of that BAM file. By discarding the tags and many of the quality scores, the sequence makes up a larger fraction of the data, so the relative benefit of reference based compression rshould increase. I would stress that discarding all the tags in a BAM file is not a viable option for most use cases, so take these numbers with a pinch of salt.

The final example shows almost an order of magnitude size reduction with CRAM (and this was prior to v0.5) and judicious quality selection.

Conclusions

This aimed to demonstrate that we can save significant amounts of space by the adoption of small changes to BAM like this (which are within the specification) right away. Also, as Ewan and others have noted, reference based compression gets better as read lengths and accuracy increase - so the benefit of using the equals sign for reference based compression in BAM should improve with updated sequencing technology.

I think this particular idea could be used for ArchiveBAM, although as of ArchiveBAM v1.0 format they explicitly forbid using equals signs in the SEQ field. Perhaps we can encourage them to change their mind?

There may be problems with some downstream analysis tools like SNP finders which fail to handle the use of equals signs in the SAM/BAM read sequence. I'll file bug reports if I find any, those should be easily addressed.

In terms of using this script right now, it ran in a few minutes for the single chromosomes, which is quite usable. The whole human genome took a few hours - but I want to stress this has had no optimisation at all, and has quite a few debugging assertions in place.

It would make sense to integrate this as an option for samtools view (adding or removing equals signs in read sequences given a FASTA reference), and later the output of assemblers and alignment tools like BWA.

2 comments:

  1. Nice work Peter!
    I am looking to compress BAMs as much as possible and I was decided to include this in my pipeline until I reach: "There may be problems with some downstream analysis tools like SNP finders which fail to handle the use of equals signs..."
    Did you find any bug since the writing?

    Pablo.


    ReplyDelete
    Replies
    1. Try it and see? This is entirely within spec, so please report any bugs in tools which can't cope. I can tell you the SAM/BAM viewer Tablet should cope nicely. You may also want to look at trying the CRAM format as well/instead.

      Delete