Seqair, my Rust-native take on htslib
This post introduces seqair , a new Rust library for bioinformatics formats I’ve been working on. We’ll look in detail at one of my main design choices here, a columnar store for iterating BAM data, and talk about how I tried to create nice Rust APIs.
Rastair
, the project I’ve been working on,
reads and writes a lot of file formats.
SAM/BAM/CRAM input, VCF/BCF output, reference FASTA,
all through rust-htslib
, a wrapper around htslib
,
the C library behind the samtools
suite.
After a year of using (and tweaking) it,
I wanted to see what a Rust-native approach could look like.
Seqair is the result of that experiment.
An experiment
The week before Easter we released Rastair 2.1 with GPU support.
The next big task is local read realignment1
which requires changing the order and annotations of BAM records.
I’m very intrigued by the challenge
and also by trying to understand the formats and their uses on a deeper level.
Trying to optimize BCF writing in rust-htslib was how I got started2,
and I didn’t want to stop there.
For realignment,
I could either fiddle with the raw pointers to htslib structs,
or write intermediate BAM records only to then parse them again.
Both are not very satisfying options.
For what Rastair needs, there are several possible implementations and htslib and noodles each present only one of them. I had a vague idea of how I would want to do it, given a clean slate.
So, I decided to do an experiment. Can I write my ideas as specifications using tracey (I’ve heard good things), combine this with the extensive specifications for the formats and use Claude Code to get a prototype for this going?3 The goal is to prove whether my ideas are feasible and also have some fun with trying out new tools and see how they deal with some of my coding style choices.
This prototype is done when I have
a library that can replace rust-htslib for the pileup4 generation in Rastair
and show an example of how realignment would be handled.
A Rustic API
One of my goals was to make the developer experience very Rust-like, because that is what I enjoy (and what I’m good at). There are many aspects to this and I want to name only a few of them here.
I wanted to have strong types for everything
that enforce logic constraints,
and distinguish types with parameters.
For example:
Some formats have zero-based positions
and others (like VCF) count from 1.
Naturally, I added a type Pos<T>
that exists as both Pos<Zero> and Pos<One>
(aliased to Pos0 and Pos1 for convenience).
Another example is that there is one Reader entrypoint,
which automatically detects the file format(s)
and provides detailed errors.
This is not something I’ve seen noodles provide,
and the errors from htslib were sadly not very helpful.
Also, htslib has no way of plugging its logs into something other than stderr.
Seqair uses tracing
for logs and instrumentation,
which is Rust-native and can be enabled or disabled by the end user.
When writing VCF/BCF files5,
we first need to write a header,
which defines all the fields.
In the binary version, we will refer to them by their ID.
This header needs to be written in a specific order
(configs, then filters, then info fields, then format fields, then samples).
In the same way,
records (lines) need to be written in a specific order,
so that we can stream them directly to an output buffer.
All of this is enforced using the type-state pattern6,
where methods like VcfHeaderBuilder::formats or RecordEncoder::begin_samples
transition from one state to another,
which then includes different methods to add different data.
More on this in a future post!
Seqair leans harder into type-state builders than either noodles or rust-htslib. I’m not sure this is necessarily better for the average user who might not be enjoying these Rust features as much as me.
A columnar record store for BAM
Let me pick some pieces that I think came out well, and talk about how the design comes together and how it differs from the other implementations (or not).
When a reader decodes a segment7, it produces hundreds to thousands of BAM records. Each record has a handful of fixed-size fields (position, flags, mapping quality) and several variable-length ones, namely the read name, the CIGAR8, the sequence, the per-base qualities, the aux tags.
The obvious design is one struct per record,
with the variable-length fields as Box<[u8]> or Vec<u8>.
That’s six heap allocations per record9,
and a typical segment has thousands of records
and we then throw them all away and start again for the next segment.
Seqair replaces this with a RecordStore
with vectors that act like columns of a table.
A compact SlimRecord holds the fixed fields and offsets into these slabs,
one each for names, bases, CIGAR bytes, quality, and auxiliary tags.
pub struct RecordStore {
records: Vec<SlimRecord>,
names: Vec<u8>, // [qname₀|qname₁|qname₂|...]
bases: Vec<Base>, // [bases₀|bases₁|bases₂|...]
cigar: Vec<u8>, // [cigar₀|cigar₁|cigar₂|...]
qual: Vec<u8>, // [qual₀|qual₁|qual₂|...]
aux: Vec<u8>, // [aux₀|aux₁|aux₂|...]
}
pub struct SlimRecord {
// ...
bases_off: u32,
pub seq_len: u32,
name_off: u32,
name_len: u16,
cigar_off: u32,
pub n_cigar_ops: u16,
qual_off: u32, // qual_len = seq_len
aux_off: u32,
aux_len: u32,
}
People call it “columnar” but I always draw it like rows in my head. Kinda like this:
Decoding a BAM record is now a handful of extend_from_slice calls
into slabs that were pre-sized from the compressed byte count of the segment.
After warm-up there are no allocations in the hot loop at all:
clear() resets lengths without releasing capacity,
and the next region reuses the same memory10.
Splitting this into slabs isn’t just because I think it would be cool but also because the access pattern matches11. For example, read names are (right now) only used during overlapping-pair dedup, so they sit by themselves as a compact contiguous buffer that cache-prefetches nicely during a linear scan.
CIGAR lives in its own slab
because it is the one thing that changes during local realignment.
store.set_alignment(idx, new_pos, new_cigar) appends the new ops to the end of the CIGAR slab
and rewrites the record’s cigar_off, n_cigar_ops, pos, and end_pos.
The old bytes become dead data in the slab,
but since we’re about to call store.clear() at the end of the region it doesn’t matter.
Append-only mutation means the sequence, quality, and aux slabs
never have to be touched when realignment moves a read around.
The pileup engine sits on top of this.
A PileupAlignment carries pre-extracted flat fields
(the base at this position, its Phred score, the read’s MAPQ, the flags)
and a record index into the store.
The store is borrowed for the duration of the iteration,
and Rust’s lifetimes do the rest.
Customization
The same append-only shape pays off twice more.
The reader actually takes a “customizer”
(a user-defined type that implements seqair’s CustomizeRecordStore trait)
that lets the user control some of its behavior.
First, it lets users decide whether to keep a record.
The customizer gets a parsed record, can analyze it,
and decide whether it is of interest.
If not,
each slab gets truncated back to the length it had before the push,
which discards the record.
The just-pushed SlimRecord already records where each slab’s tail was,
so its offsets are used to call truncate.
This is a really cheap and easy to implement filter method12.
Second, the RecordStore is actually generic over a type U (“user data”),
with a bonus Vec<U> slab growing alongside the others.
The default U = () is free13.
The customizer can produce a U = CustomizeRecordStore::Extra type per record,
and the SlimRecord carries an extras_idx into that slab.
The pileup engine’s AlignmentView exposes aln.extra() -> &U
alongside the other accessors,
so the user’s data behaves just like CIGAR, AUX, etc.
This means a user can add data per-record
without the need for any parallel data structure that has to stay in sync.
Code example
Here’s a “full” pileup example, to illustrate what the API actually looks like:
use clap::Parser as _;
use seqair::{
Readers,
bam::{
RecordStore,
record_store::{CustomizeRecordStore, SlimRecord},
},
};
#[derive(Debug, clap::Parser)]
struct Cli {
input: std::path::PathBuf, // BAM file
reference: std::path::PathBuf, // FASTA file
#[clap(long)]
region: seqair_types::RegionString,
#[clap(long, default_value_t = 20)]
min_mapq: u8,
}
fn main() -> anyhow::Result<()> {
let args = Cli::parse();
let mut readers = Readers::open_customized(
&args.input,
&args.reference,
ReadInfoBuilder { min_mapq: args.min_mapq },
)?;
let max_len = 100_000.try_into()?;
let segments = readers.segments(&args.region, SegmentOptions::new(max_len))?;
for segment in segments {
let mut engine = readers.pileup(&segment)?;
while let Some(column) = engine.pileups() {
for aln in column.alignments() {
let info = aln.extra();
// ...
}
}
} // pileup engine drop returns store to readers for reuse in next iteration
Ok(())
}
struct ReadInfo {
read_group: Option<String>,
}
#[derive(Debug, Clone)]
struct ReadInfoBuilder {
min_mapq: u8,
}
impl CustomizeRecordStore for ReadInfoBuilder {
type Extra = ReadInfo;
fn keep_record(&mut self, rec: &SlimRecord, _: &RecordStore<ReadInfo>) -> bool {
!rec.flags.is_unmapped()
}
fn compute(&mut self, rec: &SlimRecord, store: &RecordStore<ReadInfo>) -> ReadInfo {
ReadInfo { read_group: rec.aux(store).ok().and_then(|aux| aux.get("RG").ok()) }
}
}
BGZF and the shape of cluster I/O
The columnar store was the big idea, but along the way I had some more. One was about how we read our files.
BAM files are compressed with BGZF , which is just gzip with a stricter block layout. Every block is an independent gzip member of up to 64 kb uncompressed, and the compressed block size is encoded in the gzip extra field. This is what makes random access possible: A BAI index can point at a specific block offset, and the reader can decompress just that block without touching the rest of the file.
But the interesting part is the I/O pattern on top of it. People want to run Rastair runs on HPC nodes where the BAM files live on something like NFS, and probably not a fast local SSD. Each I/O operation there costs a network round-trip, and the standard access pattern14 turns into hundreds of small reads per region. On my local SSD this is invisible but on a shared cluster filesystem I think there is a lot to gain from reading bigger chunks and not just hoping for the OS to help us out.
RegionBuf changes the access pattern.
For each segment, it queries the index file,
merges overlapping and adjacent chunks into the smallest possible set of byte ranges,
and does one seek plus read_exact per merged range into a Vec<u8>.
Typical regions then come down to a single multi-MB read
followed by entirely in-memory decompression.
The BGZF reader doesn’t know or care that it’s reading from a buffer rather than a file,
it’s the same Read + Seek implementation,
just backed by bytes that are already resident.
Two smaller details make this work in practice, and getting them to play nicely together took some time.
First, chunk merging has to extend each merged range’s end by at least one maximum BGZF block size past the last chunk’s end, because the final record in a chunk can span into the next block. Without the extension, you’d get a truncated read on the very last record of a region. This is the kind of bug you only see with real BAM files and not just ideal unit tests.
Second, bin 0 is split out.
BAI’s binning scheme places reads that straddle a 64 Mbp boundary into bin 0,
which has chunks scattered across the file at positions that may be gigabytes away
from the rest of the query’s chunks.
Blindly merging bin 0 into the region read means possibly a multi-GB read per region.
Seqair’s index provides query_split
which returns nearby (levels 3–5, close together) and distant (levels 0–2) chunks separately.
Forkable readers
Rastair processes regions in parallel via rayon . Each worker thread needs its own BAM reader because file handles have mutable seek state, but nothing else needs to be thread-local. Re-reading and re-parsing the BAM index on every worker thread would waste both memory and NFS bandwidth for no reason.
Seqair’s reader splits into shared immutable state
and per-thread mutable state
(the file handle and the decompression buffers).
A prototype reader is opened once on the main thread;
each worker calls fork() to get a lightweight copy
that shares the index and header via Arc and owns a fresh File.
There is no lock contention and no re-parsing.
It’s a pattern that only makes sense for me
because the Rust type system can enforce that the shared state is genuinely read-only after construction.
If any of it were &mut-accessible (or I couldn’t prove that it wasn’t),
I’d need either heroic discipline or mutexes.
This means the code example from above can just be rewritten like this:
use rayon::prelude::*;
segments.par_bridge().try_for_each(|segment| -> anyhow::Result<()> {
let mut readers = readers.fork()?;
let mut engine = readers.pileup(&segment)?;
// ...
});
Each rayon worker could also have the forked readers as thread-local so that the files don’t need to be opened/closed more than once.
A note on CRAM and testing
Seqair also reads CRAM v3.0 and v3.1, which is a column-oriented, reference-based compression format that most large-scale sequencing projects now use. CRAM decoding is a genuinely hairy format: records are stored as differences against the reference, split across “data series” that each pick their own codec, and the codecs themselves include rANS 4x8, rANS Nx16, tok3, Huffman, plus bzip2 and LZMA for the easy cases.
I want to be honest about this part: the CRAM implementation was entirely written by Claude Code from the CRAM codecs spec with test vectors borrowed from noodles. I read the spec well enough to prompt intelligently and to review what came out, but I did not write a single rANS decoder by hand and I would not trust myself to spot a subtle bug in the arithmetic coder without another tool to compare against.
The tests seqair has for CRAM are cross-validation against htslib on a small number of files we happen to have plus a property test that round-trips small synthetic inputs through noodles. That is not a lot of coverage for a format this complex. In the Rastair pipeline we mostly read BAM, so the CRAM path is load-bearing for exactly one workflow where users pull public datasets stored in CRAM. It works for what we’ve thrown at it; I would not recommend anyone else use it for anything serious yet, and this is a feature I want to improve soon.
Performance
Correctness is one thing,
but I also wanted to make sure it can compete with htslib’s performance.
I wrote some benchmarks using criterion
.
The microbenchmarks show seqair sits in the same ballpark as htslib:
| Benchmark | rust-htslib | seqair | noodles | Comment |
|---|---|---|---|---|
bam_record_decode |
1.8ms | 2.4ms | 5.7ms | seqair decodes sequence |
fasta_fetch |
313µs | 162µs | 140µs | Solid win for noodles! |
pileup_e2e |
8.9ms | 7.8ms | 7.9ms | noodles test simplified |
vcf_1k |
1.1ms | 0.7ms | 2.1ms | streaming to /dev/null |
bcf_10k |
16.8ms | 6.7ms | 30.8ms | rust-htslib allocates a lot |
That said, microbenchmarks only go so far. The real test is profiling Rastair runs on representative inputs and optimizing. That’s what I did for the htslib path and it would surprise me if I get away with not doing it again here. I’ll get back to this topic after some more work has been done.
What I got out of it
Was it worth it to spend this time on it? I’d say so! It was fun to get this working and to experience building it the way I did. Not as much hands-on programming as I’m used to, but the resulting code is not horrible. There’s a lot of compiler-enforced correctness15 as well as comparison tests that were fun to set up.
It works on the BAM files I’ve thrown at it. The initial integration procudes zero diff lines against existing Rastair output, even on quite big files (after some tweak with float precision).
I could prove that my columnar storage design makes sense. Preliminary results show it’s at least as fast as htslib in Rastair (on my laptop). It is practically doing zero allocations reading records (amortized), and the design is nice to work with. I added an example for a nice realignment API, and it fits right in.
The push-time filtering and per-record extras as opt-in extensions of the same slab design are nice. Both cost nothing when unused, they lean on the existing offsets rather than adding some parallel system, and they solve a real problem we have in Rastair.
I could also get some “bonus features” in like a nice VCF/BCF writer API, similarly offering zero allocations writing records (amortized). I might write another post on this.
Next up: A whole lot of testing of this in Rastair with real-world files!
-
With our knowledge of how TAPS methylation looks, we think we can fix some alignment issues in regions with a lot of insertions or deletions. ↩︎
-
I forked
rust-htslibandhts-systo update both the bindings and add some features. My PRs aren’t merged yet and there is more work I did that I could upstream… My time is limited. ↩︎ -
I’ll probably write about the LLM aspect of this separately, but this post I want to keep about the architecture and API. ↩︎
-
We have many reads that are short overlapping snippets of DNA. You can imagine this as a 2D grid with the position on the x axis and the reads as a stack going up (or down). A “pileup” is the column view, where each column shows all the bases for a position. In practice, there is also a lot of metadata. ↩︎
-
Yes, I also added this, after complaining about how inefficient the Rust wrapper was in my Rastair post. ↩︎
-
This is one of my favorite features in Rust that I don’t get to use so often. I wrote about it all the way back in 2016! ↩︎
-
Rastair splits everything up to smaller chunks for parallel processing. Seqair assumes this is the use case as well. ↩︎
-
“Compact Idiosyncratic Gapped Alignment Report”, a compact string describing how a read aligns to the reference. Something like
76M2D24Mmeans “76 matches, 2 deletions, 24 matches”. ↩︎ -
Five slices for the variable-length fields plus the
Recorditself, maybe even asRc. ↩︎ -
This looks and feels like arena allocation, all that’s missing is me calling it that. Using just
Vecs is quite simple and they do the job. ↩︎ -
Bonus: Seqair decodes the 4-bit packed sequence to
[Base]using SIMD at push time so that the pileup engine never has to unpack bytes again and all bases are of a known structure. ↩︎ -
I was also thinking of adding multiple filter “hooks” so that user can discard records earlier in the parsing process, but that wasn’t necessary yet. ↩︎
-
A
Vec<()>is a zero-sized vector: it tracks a length without ever allocating, and iterates as a range yielding(). A nice quirk of the type system. ↩︎ -
seek to a chunk, read the header (~18 bytes), read the compressed body (~20–40 KiB), decompress, repeat ↩︎
-
This isn’t something you can benchmark but which noticeably changed how much of the code I had to hold in my head at once. ↩︎