Nine Rules for Creating Fast, Safe, and Compatible Data Structures in Rust (Part 1)

This year, I developed a new Rust crate called [range-set-blaze](https://crates.io/crates/range-set-blaze)
that implements the range set data structure. A range set is a useful (if obscure) data structure that stores sets of integers as sorted and disjoint ranges. For example, it stores these three ranges:
100..=2_393, 20_303..=30_239_000, 501_000_013..=501_000_016
rather than 30,220,996 individual integers. In addition to potential memory savings, range-set-blaze
provides efficient set operations such as union, intersection, complement, difference, and symmetric difference.
While creating range-set-blaze
, I learned nine rules that can help you create Data Structures in Rust. Beyond data structures, many of the rules can help you improve the performance and compatibility of any Rust code.
The rules are:
- Plagiarize your API, documentation, and even code – from the standard library.
- Design constructors for ease-of-use, compatibility, and speed.
- Create more Rust iterators than you expect.
- Make illegal values unrepresentable with traits.
- Define generic iterators with guaranteed properties and useful methods.
Covered in Part 2:
6. Define operators and fast operations.
7. Follow the "Nine Rules of Good API Design" especially "write good documentation".
8. Optimize performance using representative data, Criterion Benchmarking, and profiling.
9. Test coverage, docs, traits, compiler errors, and correctness.
Before looking at the first five rules, let's see when range-set-blaze
might be useful, how its set operations work, and how it compares to other range set crates.
Usefulness: Imagine running ten billion statistical experiments on an unreliable cluster. Each task on the cluster runs a few experiments. Each experiment produces one line of output with an experiment number. So, one task might put this into a file:

What data structure would you use to find which experiments are missing and need to be re-submitted? One option: store the outputted experiment numbers in a [BTreeSet](https://doc.rust-lang.org/std/collections/struct.BTreeSet.html)
and then do a linear scan for gaps.
A faster and more memory-efficient option: use a range set. Eight years ago, I created [IntRangeSet](https://fastlmm.github.io/PySnpTools/#util-intrangeset)
, a range set in Python to solve this problem. Now, I would use range-set-blaze
in Rust (example code).
Set Operations: Here is a simple example of the union operator (|
):

use range_set_blaze::RangeSetBlaze;
// a is the set of integers from 100 to 499 (inclusive) and 501 to 1000 (inclusive)
let a = RangeSetBlaze::from_iter([100..=499, 501..=999]);
// b is the set of integers -20 and the range 400 to 599 (inclusive)
let b = RangeSetBlaze::from_iter([-20..=-20, 400..=599]);
// c is the union of a and b, namely -20 and 100 to 999 (inclusive)
let c = a | b;
assert_eq!(c, RangeSetBlaze::from_iter([-20..=-20, 100..=999]));
Aside: See the project's
[README.md](https://github.com/CarlKCarlK/range-set-blaze)
for another set operator example from biology. That example uses theRangeSetBlaze
struct to find the intron regions of a gene from the transcription region and the exon regions.
Comparison with other range-related crates
Benefits: Although Rust's crates.io already contains several range set crates, I hoped my version could offer full set operations while being performant. Through optimizations both fowl and fair, I believe it achieves these goals (see the benchmark report). For example, it can intake individual integers 75 times faster than the most popular range set crate (because the other crate doesn't special-case individual intake – but it could easily add this optimization). On another benchmark, range-set-blaze
– using a hybrid algorithm – is 30% to 600% faster at unioning two sets.
Deficits: Compared to other range-related crates, range-set-blaze
has two important deficits. First, it works only with Rust integer types. Most other crates handle any element that can be sorted (dates, floats, IP addresses, strings, etc.). Second, it only offers sets. Many other crates also handle mappings. With interest (and perhaps help) these deficits might be addressed in the future.
Creating a data structure requires many decisions. Based on my experience with range-set-blaze
, here are the decisions I recommend. To avoid wishy-washiness, I'll express these recommendations as rules. Of course, every data structure is different, so not every rule will apply to every data structure.
This article covers rules 1 to 5. Part 2 covers rules 6 to 9.
Rule 1: Plagiarize the API, Documentation, and even Code – from the Standard Library
Find a similar data structure in the standard library and study its documentation line-by-line. I picked [BTreeSet](https://doc.rust-lang.org/std/collections/struct.BTreeSet.html)
as my model. It can store sets of integers in a cache-efficient balanced tree.
Aside: Later, in benchmarking (Rule 8), we'll see that
range_set_blaze::_RangeSetBlaze
can be 1000's of times faster thanBTreeSet
_ on some ‘clumpy' integer sets.
BTreeSet
offers 28 methods, for example, [clear](https://doc.rust-lang.org/std/collections/struct.BTreeSet.html#method.clear)
and [is_subset](https://doc.rust-lang.org/std/collections/struct.BTreeSet.html#method.is_subset)
. It also implements 18 traits, for example, [FromIterator
. Here is the BTreeSet
documentation for clear
and the RangeSetBlaze
documentation for clear
:

You can see that I mostly just copied. I changed "elements" to "integer elements" to remind users what RangeSetBlaze
supports. I removed where A: Clone
because all integers are necessarily cloneable. Notice that Rust documentation includes a "source" link. This makes copying easy.
Copying offers these advantages:
- It tells you what methods to provide. In other words, it gives you a starting point for your API (application Programming interface). This saves design time. Moreover, users will understand and expect these methods. You may even be able to make your data structure a drop-in replacement for a standard data structure.
- You get documentation text and documentation tests for almost free.
- You may even be able to copy code. For example, here is the code for
is_superset
forBTreeSet
and, then,RangeSetBlaze
:
Rust">#[must_use]
#[stable(feature = "rust1", since = "1.0.0")]
pub fn is_superset(&self, other: &BTreeSet) -> bool
where
T: Ord,
{
other.is_subset(self)
}
#[must_use]
pub fn is_superset(&self, other: &RangeSetBlaze) -> bool {
other.is_subset(self)
}
The BTreeSet
code reminded me that superset can be defined in terms of subset and that #[must use]
is a thing and appropriate here.
You may decide not to support everything in the standard data structure. For example, I skipped new_in
, an experimental feature. Likewise, the standard library supports maps (not just sets), any sortable element (not just integers), and Serde serialization. For me, these are possible future features.
You may also decide to support something differently. For example, BTreeSet::first
returns an Option<&T>
but RangeSetBlaze::first
returns an Option
. I know T
is a cheap-to-clone integer and so it doesn't need to be a reference.
Aside: Doesn't Rust have a generic
Set
trait, that tells what methods all sets should implement and even provides some default implementations (for example, _is_superset
in terms ofis_subset
_?) No, but it is being discussed.
You will likely also decide to support more methods than the standard data structure. For example, RangeSetBlaze::len
, like BTreeSet::len
, returns the number of elements in the set. However, RangeSetBlaze
also offers ranges_len
, which returns the number of sorted, disjoint ranges in the set.
Rule 2: Design Constructors for Ease-of-Use, Compatibility, and Speed
If it makes sense to have an empty version of your data structure, you'll want to define a new
method and a [Default::default](https://doc.rust-lang.org/std/default/trait.Default.html)
method.
Similarly, if it makes sense to fill your data structure from an iterator, you'll want to define [FromIterator::from_iter](https://doc.rust-lang.org/std/iter/trait.FromIterator.html)
methods. These automatically define collect
methods, too. Like BTreeSet
, RangeSetBlaze
accepts iterators of integers and references to integers. (Accepting references is important because many Rust iterators provide references.) Here is an example of from_iter
and collect
in use:
let a0 = RangeSetBlaze::from_iter([3, 2, 1, 100, 1]);
let a1: RangeSetBlaze = [3, 2, 1, 100, 1].into_iter().collect();
assert!(a0 == a1 && a0.to_string() == "1..=3, 100..=100");
RangeSetBlaze
also accepts iterators of inclusive ranges and references to such ranges. It places no constraints on the input ranges. They can be out-of-order, overlapping, empty, or repeated.
#[allow(clippy::reversed_empty_ranges)]
let a0 = RangeSetBlaze::from_iter([1..=2, 2..=2, -10..=-5, 1..=0]);
#[allow(clippy::reversed_empty_ranges)]
let a1: RangeSetBlaze = [1..=2, 2..=2, -10..=-5, 1..=0].into_iter().collect();
assert!(a0 == a1 && a0.to_string() == "-10..=-5, 1..=2");
Finally, consider defining additional From::from
methods. These automatically define an into
methods. For example, for compatibility with BTreeSet
, RangeSetBlaze
defines a From::from
method on arrays.
let a0 = RangeSetBlaze::from([3, 2, 1, 100, 1]);
let a1: RangeSetBlaze = [3, 2, 1, 100, 1].into();
assert!(a0 == a1 && a0.to_string() == "1..=3, 100..=100")
RangeSetBlaze
also defines from_sorted_disjoint/into_range_set_blaze
for iterators of ranges that are guaranteed to be sorted and disjoint. (We'll see in Rule 5, how we enforce this guarantee with special traits and the Rust compiler.)
let a0 = RangeSetBlaze::from_sorted_disjoint(CheckSortedDisjoint::from([-10..=-5, 1..=2]));
let a1: RangeSetBlaze = CheckSortedDisjoint::from([-10..=-5, 1..=2]).into_range_set_blaze();
assert!(a0 == a1 && a0.to_string() == "-10..=-5, 1..=2");
Aside: Why _
from_sorted_disjoint
/into_range_set_blaze
instead offrom_iter /collect
orfrom/into
_? See [this discussion](https://stackoverflow.com/questions/37347311/how-is-there-a-conflicting-implementation-of-from-when-using-a-generic-type) and this discussion.
For each of your constructors, think about possible speed ups and optimizations. RangeSetBlaze
implements this optimization in from_iter
:
- collects adjacent (possibly unsorted) integers/ranges into disjoint ranges, O(n₁)
- sort the disjoint ranges by their start, O(n₂ log n₂)
- merge adjacent ranges, O(n₂)
- create a
BTreeMap
from the now sorted & disjoint ranges, O(n₃ log n₃)
where _n_₁ is the number of input integers/ranges, n₂ is the number of disjoint & unsorted ranges, and _n_₃ is the final number of sorted & disjoint ranges.
What is the effect of ‘clumpy' integers? If n₂ ≈ sqrt(n₁), then construction is O(n₁). (Indeed, as long as n₂ ≤ n₁/ln(n₁), construction is O(n₁).) In benchmarks, this turns into a 700 time speed up over HashSet
and BTreeSet
on clumpy integer iterators.
Rule 3: Create More Rust Iterators than You Expect
How many different iterator types would you guess the standard BTreeSet
defines?
The answer is eight: Iter
, IntoIter
, DrainFilter
, Range
, Difference
, SymmetricDifference
, Intersection
, and Union
. Many non-Rust programming languages can turn any method into an iterator/generator, with a few "yield" statements. Rust, however, doesn't offer this (but it is under discussion). So, pretty much every method related to iteration will require you to define a new iterator struct type. These structs will, at a minimum, implement a next
method that returns either Some(
value)
or None
.
RangeSetBlaze
and its related types define 13 iterators structs. Let's look at two of them.
First, a user can call ranges
and iterate through the integers as a sequence of sorted, disjoint ranges. (Recall that RangeSetBlaze
accepts unsorted, overlapping ranges but stores sorted, disjoint ranges.)
use range_set_blaze::RangeSetBlaze;
let set = RangeSetBlaze::from_iter([30..=40, 15..=25, 10..=20]);
let mut ranges = set.ranges();
assert_eq!(ranges.next(), Some(10..=25));
assert_eq!(ranges.next(), Some(30..=40));
assert_eq!(ranges.next(), None);
Internally, RangeSetBlaze
uses a standard BTreeMap
to hold the range information. So, the RangeSetBlaze::ranges
method constructs a RangesIter
struct containing a BTreeMap::Iter
. We then have the RangesIter::next
method call BTreeMap::Iter
‘s next
method and translates the result into the desired type. Here is the code:
impl RangeSetBlaze {
pub fn ranges(&self) -> RangesIter<'_, T> {
RangesIter {
iter: self.btree_map.iter(),
}
}
}
#[derive(Clone, Debug)]
#[must_use = "iterators are lazy and do nothing unless consumed"]
pub struct RangesIter<'a, T: Integer> {
pub(crate) iter: btree_map::Iter<'a, T, T>,
}
impl<'a, T: Integer> Iterator for RangesIter<'a, T> {
type Item = RangeInclusive;
fn next(&mut self) -> Option {
self.iter.next().map(|(start, end)| *start..=*end)
}
fn size_hint(&self) -> (usize, Option) {
self.iter.size_hint()
}
}
Second, a user may want to call iter
and iterate through the integers one at a time, in sorted order. In that case, we'll return a struct called Iter
that contains a RangeIter
and then iterates the integers inside the ranges one at a time. Here is the original code for Iter::next
. It is followed by a discussion of points of interest.
impl Iterator for Iter
where
I: Iterator- > + SortedDisjoint,
{
type Item = T;
fn next(&mut self) -> Option
{
loop {
if let Some(range) = self.option_range.clone() {
let (start, end) = range.into_inner();
debug_assert!(start <= end && end <= T::safe_max_value());
if start < end {
self.option_range = Some(start + T::one()..=end);
} else {
self.option_range = None;
}
return Some(start);
} else if let Some(range) = self.iter.next() {
self.option_range = Some(range);
continue;
} else {
return None;
}
}
}
The SortedDisjoint
trait relates to guaranteeing that the inside iterator provides sorted, disjoint ranges. We'll discuss it in Rule 5.
The option_range
field holds the range (if any) from which we are currently returning integers. We use loop
and continue
to fill option_range
when it is empty. This loop never loops more than twice, so I could have used recursion. However, some of the other iterators recurse enough to cause a stack overflow. So, …
Tail recursion optimization is not guaranteed in Rust. My policy: Never use recursion in next
functions.
Aside: Thanks to Michael Roth, the current version of
Iter::next
is now shorter. His pull request is here.
Both BTreeSet
and RangeSetBlaze
define an into_iter
iterator method in addition to the iter
method. Similarly, RangeSetBlaze
defines an into_ranges
iterator method in addition to its ranges
method. These into
_whatever methods take ownership of the underlying RangeSetBlaze
which is sometimes useful.
Rule 4: Make Illegal Values Unrepresentable with Traits
I said that RangeSetBlaze
only works on integers, but what stops you applying it to characters like so?
use range_set_blaze::RangeSetBlaze;
fn _some_fn() {
let _char_set = RangeSetBlaze::from_iter(['a', 'b', 'c', 'd']);
}
The answer? The compiler stops you. It returns this error message:
let _char_set = RangeSetBlaze::from_iter(['a', 'b', 'c', 'd']);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| the trait `Integer` is not implemented for `char`
|
= help: the following other types implement trait `Integer`:
i128
i16
i32
i64
i8
isize
u128
u16
and $N others
To make this happen, RangeSetBlaze
defines a trait it calls Integer
. Here is that definition (with all the super traits that I found that I needed):
pub trait Integer:
num_integer::Integer
+ FromStr
+ fmt::Display
+ fmt::Debug
+ std::iter::Sum
+ num_traits::NumAssignOps
+ FromStr
+ Copy
+ num_traits::Bounded
+ num_traits::NumCast
+ Send
+ Sync
+ OverflowingSub
+ SampleUniform
{
// associated type SafeLen definition not shown ...
fn safe_len(range: &RangeInclusive) -> ::SafeLen;
fn safe_max_value() -> Self { Self::max_value() }
fn f64_to_safe_len(f: f64) -> Self::SafeLen;
fn safe_len_to_f64(len: Self::SafeLen) -> f64;
fn add_len_less_one(a: Self, b: Self::SafeLen) -> Self;
fn sub_len_less_one(a: Self, b: Self::SafeLen) -> Self;
}
I, next, implemented trait Integer
on all the integer types of interest (u8
to u128
including usize
, i8
to i128
including isize)
. For example,
impl Integer for i32 {
#[cfg(target_pointer_width = "64")]
type SafeLen = usize;
fn safe_len(r: &RangeInclusive) -> ::SafeLen {
r.end().overflowing_sub(*r.start()).0 as u32 as ::SafeLen + 1
}
fn safe_len_to_f64(len: Self::SafeLen) -> f64 {len as f64}
fn f64_to_safe_len(f: f64) -> Self::SafeLen {f as Self::SafeLen}
fn add_len_less_one(a: Self, b: Self::SafeLen) -> Self {a + (b - 1) as Self}
fn sub_len_less_one(a: Self, b: Self::SafeLen) -> Self {a - (b - 1) as Self}
}
With this, I could make the code generic on
, as seen in the code sample in Rule 3.
Aside: Why doesn't Rust offer a single standard ‘integer' trait that does everything? Here is a discussion.
Rule 5: Define Generic Iterators with Guaranteed Properties and Useful Methods
RangeSetBlaze
‘s from_sorted_disjoint
constructor assumes an input of sorted and disjoint ranges. This lets RangeSetBlaze
avoid work. But what if the assumption is wrong? For example, what happens if we give it unsorted ranges, like so?
use range_set_blaze::RangeSetBlaze;
fn _some_fn() {
let not_guaranteed = [5..=6, 1..=3, 3..=4].into_iter();
let _range_set_int = RangeSetBlaze::from_sorted_disjoint(not_guaranteed);
}
As in Rule 4, the compiler catches the error and returns a useful message:
7 | let _range_set_int = RangeSetBlaze::from_sorted_disjoint(not_guaranteed);
| ----------------------------------- ^^^^^^^^^^^^^^
the trait `SortedDisjoint<_>` is not implemented for `std::array::IntoIter, 3>`
| |
| required by a bound introduced by this call
|
= help: the following other types implement trait `SortedDisjoint`:
CheckSortedDisjoint ...
To make this happen, RangeSetBlaze
defines trait SortedDisjoint
. Here is the relevant definition:
pub trait SortedStarts: Iterator- > {}
pub trait SortedDisjoint
: SortedStarts {
// methods not shown, yet
}
This says SortedDisjoint
is generic across integers and that every SortedDisjoint
must be a SortedStarts
. Moreover, all SortedStarts
are iterators of ranges of integers.
Aside: My project needs two new traits because I need to guarantee two different properties. A project that needs to guarantee only one property will need only one new trait.
So, what's the point? Why introduce new traits when we could just use Iterator
directly? As I learned from Rüdiger Klaehn's wonderful sorted-iter crate, we can use these new traits to enforce guarantees. For example, this constructor function uses a where
clause to accept only guaranteed (sorted and disjoint) iterators of integers:
impl RangeSetBlaze {
pub fn from_sorted_disjoint(iter: I) -> Self
where
I: SortedDisjoint,
{
// ... code omitted ...
}
}
And how do guaranteed iterators get the required SortedDisjoint
trait? They implement it! For example, we know the RangeSetBlaze::ranges
method returns a iterator RangesIter
of sorted and disjoint ranges, so we have RangesIter
implement the SortedDisjoint
trait like so:
impl SortedStarts for RangesIter<'_, T> {}
impl SortedDisjoint for RangesIter<'_, T> {}
That's it. We have just marked RangesIter
as being SortedDisjoint
. The compiler will do the rest.
Not guaranteed to guaranteed: I also marked an iterator called CheckSortedDisjoint
as being SortedDisjoint
. Interestingly, it iterates over a not guaranteed internal iterator. How can that be OK? Well, while it iterates, it also checks – panicking if it finds any unsorted or overlapping ranges. The result is a guaranteed iterator.
Sometimes guaranteed outside iterators: What about an iterator that is sometimes SortedDisjoint
and sometimes not? The popular [Itertools::tee](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.tee)
method, for example, turns any iterator into two iterators with the same contents. Here is how we say that if its input iterator is SortedDisjoint
, its output iterators will be, too:
impl> SortedDisjoint for Tee {}
Defining methods: Almost as a bonus, we can define methods on the generic SortedDisjoint
iterator. For example, here we define complement
, a method that produces every range of sorted & disjoint integers not in the current iterator.
pub trait SortedDisjoint: SortedStarts {
fn complement(self) -> NotIter
where
Self: Sized,
{
NotIter::new(self)
}
}
And here is a use example from the complement
documentation:
use range_set_blaze::prelude::*;
let a = CheckSortedDisjoint::from([-10i16..=0, 1000..=2000]);
let complement = a.complement();
assert_eq!(complement.to_string(), "-32768..=-11, 1..=999, 2001..=32767");
The complement
method uses NotIter
, yet another iterator (see Rule 3). NotIter
also implements SortedDisjoint
. The example additionally uses to_string
, another SortedDisjoint
method.
To use complement
and to_string
, the user must either
use range_set_blaze::SortedDisjoint;
or use range_set_blaze::prelude::*;
.`The prelude works because the project's l
ib.rssays p
ub mod prelude;and p
relude.rsfile contains this p
ub usethat includes S
ortedDisjoint:`
pub use crate::{
intersection_dyn, union_dyn, CheckSortedDisjoint, DynSortedDisjoint, MultiwayRangeSetBlaze,
MultiwayRangeSetBlazeRef, MultiwaySortedDisjoint, RangeSetBlaze, SortedDisjoint,
};
Those are the first five rules for creating a data structure in Rust. See Part 2 for rules 6 to 9.
Aside: If you're interested in future articles, please follow me on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.