Notes on Rastair, a variant and methylation caller

In the last year, I’ve had the pleasure to work on Rastair, a bioinformatics project by Benjamin Schuster-Böckler from the Ludwig Institute for Cancer Research at the University of Oxford. It is a command-line application written in Rust that has excellent performance and a rich feature set. As of now in April 2026, we just released versions 2.0 and 2.1, so I wanted to take some time to write down some of my thoughts.

Intro

Benjamin was involved in the development of TAPS, an alternative way of processing DNA samples. It has different properties than the typical (bisulfite) method and thus needs different tooling. Extremely simplified, I wrote a new version of an existing tool which takes in TAPS genome sequencing data and finds positions with variants as well as methylation (basically, switches for genes). The complexity of the implementation comes down to two factors: The size of the data being processed (it’s a lot so performance optimizations can be measured in hours saved), and the fact that we need to deal with a lot of uncertainty and statistics to make the right calls.

Even a year in, I’m not claiming to be an expert in statistics or bioinformatics, so I won’t take credit for the ideas there. I have however put a lot of thought and effort into making sure Rastair outperforms similar tools in runtime speed and hopefully also user experience. In this article, I’d like to describe some of the challenges and some neat Rust implementation details. If you want to read more about the science, check out our paper!

What Rastair does

Rastair’s main command is “call”. It basically reads a BAM file (that contains short snippets of DNA that have been aligned to a reference genome) and outputs a VCF file with our analysis results by their position in the genome, i.e., the “calls” we make with the evidence from the input files.

We are looking for two things:

Variants: Does the sample have a different base than the reference at this position?
Methylation: Is this position in a CG group and does it have an additional methyl group on it?

This allows people to further analyze this to find patterns and understand properties about the DNA sample.

Putting all CPU cores to work

Rastair will run on very large input data: My test file is 120 GB and I want to test this on my laptop. To have this go anywhere, we need to make sure we’re using all the hardware we have, starting with CPU cores.

Our main input format is a BAM file. The DNA snippets it contains have been aligned in a previous step so we know where they start and end and what the position relative to the reference is. Variants (SNV, single nucleotide variants) can then be found by looking at just one position and all the reads we have for it. Methylation looks at two positions.

Rastair expects this BAM file to be sorted and to have an index file next to it, which allows random access to fetch specific positions. BAM files also have a header which lists all the “contigs” (e.g. chromosomes) and their length. With this, we can split up the work and parallelize!

Rastair first builds a list of “segments” (with a length of 10 000 bases by default) that overlap a little on the edges to capture enough context for methylation calls and some statistics on the surrounding context. This segment list becomes our task list: Using Rayon, we process it using one thread per CPU core (or a given number).

Each “worker” thread is (almost) lock-free. On start, it opens all files it needs and puts them in a thread-local static. For each segment, it then fetches the actual data from the files, does its processing, and sends the results along with the segment index to a channel. If there are no results, it sends an empty list alongside its index.

Rastair uses an “ordered channel” with an internal buffer so that results can arrive out of order (some segments might have no interesting data or less coverage, so they are faster to process) but the subscriber will always get them in order.

A special “writer” thread is the only subscriber to this channel: It takes the results, converts them to the output format(s) the user requested and then writes to the target files (or stdout).

Reducing allocations

Multitheading was not enough to get this to be really fast. Next cultprit: Allocating memory, which (as so often) is one of the main performance bottlenecks.

Whenever we can, we try to use “iterator chains” that combine our processing steps together and filter out items before we ever allocate/collect them. This means we usually have a very clean and pure mapping function in each step, and are only looking at one position at a time. When we need more context, we have a map_surrounding helper that buffers the positions before/after.

Another trick is to accumulate in place: Similar to .collect(), we have our own types that can be built from iterators. In some places we go over the same list many times, however, and just use a loop. Why make it complicated?

In many places we know that there are most likely only a few items in a list. In this case, we use SmallVec to not allocate a Vec. This is heavily used when building structures to call into the VCF writer where we have lots of arrays with 0-4 numbers. Another benefit is to have more cache locality since structs can contain most of the content directly without having to chase a pointer. I guess we could have used a size-capped library as well to reduce the amount of checks (and add error handling) but it wasn’t an issue yet.

Aside from short lists, we also have a lot of short strings that we never change. I often use SmolStr in my Rust projects, and this is no exception. It’s a pretty neat little library that also comes with free cloning.

In general, I opted for pretty generic tools here to not have to think about too many edge cases. What if a user has a contig name that is very very long? It should just work.

It should also be noted that I switched to the mimalloc allocator quite early in the project, which gave a lot of allocation performance benefits (at least on macOS). I’m also using it in “override” mode, which means it’s also used for non-Rust code like htslib. (I quickly tried jemalloc but saw no performance difference. Since this is Rust, switching allocators is pretty simple.)

There’s a big C library

The main library for dealing with many of the different file formats in this ecosystem is htslib. It’s written for and used by the samtools suite, and we’re using it through the rust-htslib wrapper. This made it quite simple to get started and have great support for all typical formats from the get-go.

While I was building our VCF/BCF output, I noticed some performance issues however. From Rust, we pass a &[u8] byte slice for field names, but the C function expects a null-terminated slice, so rust-htslib allocates a std::ffi::CString for every call. Furthermore, in the error case it always calls str::from_utf8 (and to_owned) to print the field name. I fixed this in my fork of the library by using CStr8 from the cstr8 crate which is both null-terminated and valid UTF-8. The caller can trivially uphold this.

In a similar vein, I also later added a bam::RecordView<'a> type which is a borrowed version of the existing Record structure. This further reduced the needed allocations in our hottest loop.

Small Bonus Challenge

When going over the reads in a BAM file, we sometimes come across the same read twice. This is due to the sequencing: Two reads are generated from opposite ends of the same DNA fragment but when the fragment is too short, both reads cover the same positions in their middle. If we count both reads, we basically count one piece of evidence twice.

This means we need to de-duplicate reads by their ID (qname). In many cases, we have low coverage (<50 reads for a position), but in some regions there is much higher coverage (e.g. due to misaligning reads with low quality). Rastair needs to handle both cases in a reasonable fashion.

For this, I designed a simple but effective two-way lookup system. Since reads often have a common prefix (a sequencing ID), I first only look at the last 4 characters of a read. This is very quick to compare and fits nicely in the CPU cache. I build this suffix list up while going over the reads. For each read, I first check if we already saw this suffix (at least once), if yes, we compare the full name. If it is a real duplicate, we add this read to a “skip” list, otherwise we add it to the list and continue processing. At the end, we remove all reads with swap_remove by going over the skip list in reverse. The suffix lists are cleared and reused between positions.

Nice documentation

Being the fastest tool is good marketing, but we also want people to enjoy using it, so we spent some time making sure we have nice documentation. You can see the result on rastair.com. A big chunk of this was written by Benjamin who knows what our users actually want to read, while I filled in the technical details.

Rastair is a CLI tool written in Rust, so to nobody’s surprise it uses clap for defining its command-line arguments and subcommands. Using clap-markdown, we also show the CLI help on the website.

One of our output formats is VCF (variant call file) which basically has one position per line and lists a bunch of field values for it. Some of these fields are standardized but many are not and each VCF starts with a header that defines all fields. In Rust, I define these fields with macros, and then have one more macro that defines the entire structure with all possible fields. This same macro then also adds some helpers methods which an internal subcommand calls to render it all as a Markdown file, generating one more page of documentation that cannot be out of date.

Further notes

This is but the beginning! I’ll try to write more posts on this as I develop it and as I find the time to sit down and write them. Let me know what you’d like to read more about.

There’s at least one more topic I want to get into: Rastair uses a Random Forest ML model to evaluate positions to get great accuracy. But this is gonna be slow, right? Yes – and this is why we also ported this part to run in a compute shader on a GPU! Stay tuned, because this makes my laptop competitive with a 32-core server.