The Secret Life of Cows
A lot of people at RustFest Paris mentioned Cows
– which may be surprising if you’ve never seen std::borrow::Cow
!
Cow
in this context stands for “Clone on Write” and
is a type that allows you to reuse data if it is not modified.
Somehow, these bovine super powers of Rust’s standard library
appear to be a well-kept secret
even though they are not new.
This post will dig into this very useful pointer type by
explaining why in systems programming languages you need such fine control,
explain Cows in detail,
and compare them to other ways of organizing your data.
Contents
Organizing Data
This is what it all comes down to: People want to have a good, precise way to organize their data. And they want their programming language to support them. That’s why a lot of newer languages include a bunch of data structures optimized for different use cases, and that is also why software developers are dealing with API documentation so often. To ensure that your code has the performance characteristics you expect, it is essential to know which piece of data is represented in which way.
In systems programming languages, this is in some regards even more important. You want to know:
- exactly where your data lives,
- that it is efficiently stored,
- that it is removed as soon as you stop using it,
- and that you don’t copy it around needlessly.
Ensuring all these properties is a great way to write fast programs. Let’s look at how we can do this in Rust.
Where Does Our Data Live
It is quite explicit where your data lives.
By default, primitive types and structs containing primitive types are allocated on the stack,
without any dynamic memory allocation.
If you want to store data of a size only known at runtime
(say the text content of a file),
you need to use a type that dynamically allocates memory (on the heap),
for example String
, or Vec
.
You can explicitly allocate a data type on the heap by wrapping it in a Box
.
(If you’re unfamiliar with the notion of “Stack and Heap”, you can find a good explanation in this chapter of the official Rust book.)
Note: Creating a new (not-empty) String
means allocating memory,
which is a somewhat costly operation.
A language like Rust gives you quite a few options to
skip some allocations,
and doing so can speed up performance-critical parts of your code significantly.
(Spoiler: Cow
is one of these options.)
Structuring Data
If you know what you will do with your data,
you can probably figure out how to best store it.
If you for example always iterate through a known list of values, an array (or a Vec
) is the way to go.
If you need to look up values by known keys, and don’t care about the order they are stored in, a hash map sounds good.
If you need a stack to put data onto from different threads, you can use crossbeam-deque.
This is just to give you a few examples – there are books on this topic and you should read them.
A Cow
doesn’t really help you here per-se, but you can use it inside your data structures.
Dropping Data
Luckily, in Rust it is easy to make sure our data gets removed from memory as soon as possible (so we don’t use up too much memory and slow down the system). Rust uses the ownership model of automatically dropping resources when they go out of scope, so it doesn’t need to periodically run a garbage collector to free memory. You can still waste memory, of course, by allocating too much of it manually, or by building reference cycles and leaking it.
No Needless Copying
One important step towards being a responsible citizen in regard to memory usage is to not copy data more than necessary. If you for example have a function that removes whitespace at the beginning of a string, you could create a new string that just contains the characters after the leading whitespace. (Remember: A new string means a new memory allocation.) Or, you could return a slice of the original string, that starts after the leading whitespace. The second options requires that we keep the original data around, because our new slice is just referencing it internally. This means that instead of copying however many bytes your string contains, we just write two numbers: A pointer to the point in the original string after the leading whitespace, and the length of the remaining string that we care about. (Carrying the length with us is a convention in Rust.)
But what about a more complicated function?
Let’s imagine we want to replace some characters in a string.
Do we always need to copy it over with the characters swapped out?
Or can we be clever and return some pointer to the original string if there was no replacement needed?
Indeed, in Rust we can! This is what Cow
is all about.
What is a Cow Anyway
In Rust, the abbreviation “Cow” stands for “clone on write”1.
It is an enum with two states: Borrowed
and Owned
.
This means you can use it to abstract over
whether you own the data or just have a reference to it.
This is especially useful when you want to return a type
from a function that may or may not need to allocate.
A std Example
Let’s look at an example.
Say you have a Path
and want to convert it to a string.
Sadly, not every filesystem path is valid UTF-8
(Rust strings are guaranteed to be UTF-8 encoded).
Rust has a handy function to get a string regardless:
Path::to_string_lossy
.
When the path is valid UTF-8 already,
it will return a reference to the original data,
otherwise it will create a new string
where invalid characters are replaced with the �
character.
use std::borrow::Cow;
use std::path::Path;
let path = Path::new("foo.txt");
match path.to_string_lossy() {
Cow::Borrowed(_str_ref) => println!("path was valid UTF-8"),
Cow::Owned(_new_string) => println!("path was not valid UTF-8"),
}
A Beefy Definition
With that in mind, let’s look at the actual definition of Cow
:
enum Cow<'a, B: ToOwned + ?Sized + 'a> {
/// Borrowed data.
Borrowed(&'a B),
/// Owned data.
Owned(<B as ToOwned>::Owned),
}
As you can see, it takes some convincing to have Rust accept this type in a way we can work with it. Let’s go through it one by one.
'a
is the lifetime that we need our data to be valid for. For theOwned
case it’s not very interesting (to Cow own the data – it’s valid until the Cow goes out of scope), but in case the Cow containsBorrowed
data, this lifetime is a restriction set by the data we refer to. We cannot have a Cow that refers to already freed memory, and rustc will let us know when that is possible by mentioning that the Cow outlives its'a
.-
ToOwned
is a trait that defines a method to convert borrowed data into owned data (by cloning it and giving us ownership of the new allocation, most likely). The type we receive from this method is an associated type on the trait, and its name isOwned
(yep, the same name as the Cow variant, sorry). This allows us to refer to it inOwned(<B as ToOwned>::Owned)
.To make this a bit more concrete, let’s assume we have a Cow that’s storing a
&str
(in theBorrowed
case). TheToOwned
implementation ofstr
hastype Owned = String
, so<&str as ToOwned>::Owned == String
. -
?Sized
is a funny one. By default, Rust expects all types to be of a known size, which it expresses by having an implicit constraint on theSized
marker trait. You can explicitly opt-out of this by adding a “constraint” on?Sized
.The thing is: Not all possible types have a known size. For example,
[u8]
is an array of bytes somewhere in memory, but we don’t know its length. In your application code you won’t see a type like this directly, you’ll see it behind references instead. And note: In Rust, the reference itself can contain the length. (See what I wrote above about slices!)But how does that relate to Cow again? You see, the
B
in Cow’s definition is behind a reference: Once directly visible in theBorrowed
variant, and the second type hidden in theToOwned::Owned
(which is of typeBorrow<Self>
). Since a Cow should be able to contain a&[u8]
, its definition needs to work for&'a B
whereB = [u8]
. That in turn means need to say: “we don’t require this to beSized
, we know it’s behind a reference anyway” – which is exactly what the?Sized
syntax does.
Alright, so far so good!
Let me just point out one thing though:
If you want to store a &'input str
in a Cow (Using Cow::Borrowed(&'input str)
for example), what is the concrete type of the Cow?
(The generic one is Cow<'a, T>
.)
Right! Cow<'input, str>
!
The type definition for the Borrowed
variant contains &'a T
,
so our generic type is the type we refer to.
This also means that ToOwned
doesn’t need to be implemented for references,
but for concrete types, like str
and Path
.
Let me note something about that lifetime the Cow carries with it real quick:
If you want to replace the type of bar
in
struct Foo { bar: String }
with a Cow
,
you’ll have to specify the lifetime of the reference the Cow
can include:
struct Foo<'a> { bar: Cow<'a, str> }
.
This means that every time you now use Foo
that lifetime will be tracked,
and every time you take or return Foo
you might just need to annotate it.
One easy way around this is to use 'static'
:
You can omit the lifetime annotation on your struct,
but your Cow can only contain references to static memory.
This might sound less useful than a generic lifetime
– that’s because it is –
but in case of functions and types that either contain or return
new data or static defaults known at compile-time
it can be enough.
Cows in the Wild
Knowing Cows in theory is fine and dandy,
but the examples we’ve seen so far
only give you a small glance at when they can be used in practice.
Sadly, as it turns out, not many Rust APIs expose Cows.
Maybe, because they are seen as a thing you can introduce when you have a performance bottleneck,
or maybe it’s because people don’t want to add lifetime annotations to their struct
s
(and they don’t want to or can’t use Cow<'static, T>
).
Mixed Static and Dynamic Strings
One very cool use-case for Cows is when dealing with functions that either return static strings (i.e., strings you literally write in your source code) or dynamic strings that get put together at run time. The Programming Rust book by Jim Blandy and Jason Orendorff contains an example like this:
use std::borrow::Cow;
fn describe(error: &Error) -> Cow<'static, str> {
match *error {
Error::NotFound => "Error: Not found".into(),
Error::Custom(e) => format!("Error: {}", e).into(),
}
}
Small aside:
See how we are using the Into
trait here
to make constructing cows super concise?
Into
is the inverse of From
and is implemented for all types that implement From
.
So, the compiler knows that we want a Cow<'static, str>
,
and gave it a String
or a &'static str
.
Lucky for us,
impl<'a> From<&'a str> for Cow<'a, str>
and impl<'a> From<String> for Cow<'a, str>
are in the standard library,
so rustc can find and call these!
Why is this a very cool example? Reddit user 0x7CFE put it like this:
The most important thing is that since
Cow<'static, str>
derefs2 to&str
it may act as a drop-in replacement everywhere, where&str
is expected. So, if last error variant was added to an already existing code base, all would just work without any major refactoring.In other languages like C++ you’d probably have to decide, [either] to return allocating version like
std::string
everywhere or get rid of the details and suffer from poor ergonomics, where you’d need to use such [a] method. Even worse, error entry with extra details may be very rare and yet, you’d need to make everything allocate just to stick it all in.Rust provides a solution that is zero cost for cases where extra details are not needed. It’s a brilliant example of “pay only for what you use” principle in action.
Benchmarks
One example for improving program performance by using a Cow is this part of the Regex Redux micro-benchmark. The trick is to store a reference to the data at first and replace it with owned data during the loop’s iterations.
Serde
A great example for how you can use the super powers of Cows
in your own structs
to refer to input data instead of copying it over
is Serde’s #[serde(borrow)]
attribute.
If you have a struct like
#[derive(Debug, Deserialize)]
struct Foo<'input> {
bar: Cow<'input, str>,
}
Serde will by default fill that bar
Cow with an owned String
(playground).
If you however write it like
#[derive(Debug, Deserialize)]
struct Foo<'input> {
#[serde(borrow)]
bar: Cow<'input, str>,
}
Serde will try to create a borrowed version of the Cow (playground).
This will only work, however, when the input string doesn’t need to be adjusted.
So, for example,
when you deserialize a JSON string that has escaped quotes in it3
Serde will have to allocate a new string to store the unescaped representation,
and will thus give you a Cow::Owned
(playground).
Thanks to Robert Balicki, Alex Kitchens, and Matt Brubeck for reviewing this post!
And also thanks to Brad Gibson for asking about a better explanation on the ?Sized
business
– which took me less than two years to resolve!
-
Yes, that’s right: Clone on write, not copy on write. That’s because in Rust, the
Copy
trait is guaranteed to be a simplememcpy
operation, whileClone
can also do custom logic (like recursively clone aHashMap<String, String>
. ↩ -
Thanks to an implementation of the
Deref
trait, you can use a reference to aCow<'static, str>
in place of a&str
. That means, aCow<'static, str>
can be seen a reference to a string without having to convert it. ↩ -
"\"Escaped strings contain backslashes\", he said."
↩