Nine Rules for Accessing Cloud Files from Your Rust Code
Would you like your Rust program to seamlessly access data from files in the cloud? When I refer to "files in the cloud," I mean data housed on web servers or within cloud storage solutions like Aws S3, Azure Blob Storage, or Google Cloud Storage. The term "read", here, encompasses both the sequential retrieval of file contents – be they text or binary, from beginning to end -and the capability to pinpoint and extract specific sections of the file as needed.
Upgrading your program to access cloud files can reduce annoyance and complication: the annoyance of downloading to local storage and the complication of periodically checking that a local copy is up to date.
Sadly, upgrading your program to access cloud files can also increase annoyance and complication: the annoyance of URLs and credential information, and the complication of asynchronous programming.
Bed-Reader is a Python package and Rust crate for reading PLINK Bed Files, a binary format used in bioinformatics to store genotype (DNA) data. At a user's request, I recently updated Bed-Reader to optionally read data directly from Cloud Storage. Along the way, I learned nine rules that can help you add cloud-file support to your programs. The rules are:
- Use crate
[object_store](https://crates.io/crates/object_store)
(and, perhaps,[cloud-file](https://crates.io/crates/cloud-file)
) to sequentially read the bytes of a cloud file. - Sequentially read text lines from cloud files via two nested loops.
- Randomly access cloud files, even giant ones, with "range" methods, while respecting server-imposed limits.
- Use URL strings and option strings to access HTTP, Local Files, AWS S3, Azure, and Google Cloud.
- Test via
[tokio](https://crates.io/crates/tokio)::test
on http and local files.
If other programs call your program – in other words, if your program offers an API (application program interface) – four additional rules apply:
- For maximum performance, add cloud-file support to your Rust library via an async API.
- Alternatively, for maximum convenience, add cloud-file support to your Rust library via a traditional ("synchronous") API.
- Follow the rules of good API design in part by using hidden lines in your doc tests.
- Include a runtime, but optionally.
Aside: To avoid wishy-washiness, I call these "rules", but they are, of course, just suggestions.
Rule 1: Use crate object_store
(and, perhaps, cloud-file
) to sequentially read the bytes of a cloud file.
The powerful [object_store](https://crates.io/crates/object_store)
crate provides full content access to files stored on http, AWS S3, Azure, Google Cloud, and local files. It is part of the Apache Arrow project and has over 2.4 million downloads.
For this article, I also created a new crate called [cloud-file](https://crates.io/crates/cloud-file)
. It simplifies the use of the object_store
crate. It wraps and focuses on a useful subset of object_store
‘s features. You can either use it directly, or pull-out its code for your own use.
Let's look at an example. We'll count the lines of a cloud file by counting the number of newline characters it contains.
use cloud_file::{CloudFile, CloudFileError};
use futures_util::StreamExt; // Enables `.next()` on streams.
async fn count_lines(cloud_file: &CloudFile) -> Result {
let mut chunks = cloud_file.stream_chunks().await?;
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Ok(newline_count)
}
#[tokio::main]
async fn main() -> Result<(), CloudFileError> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam";
let options = [("timeout", "10s")];
let cloud_file = CloudFile::new_with_options(url, options)?;
let line_count = count_lines(&cloud_file).await?;
println!("line_count: {line_count}");
Ok(())
}
When we run this code, it returns:
line_count: 500
Some points of interest:
- We use
async
(and, here,[tokio](https://docs.rs/tokio/latest/tokio/)
). We'll discuss this choice more in Rules 6 and 7. - We turn a URL string and string options into a
CloudFile
instance withCloudFile::new_with_options(url, options)?
. We use?
to catch malformed URLs). - We create a stream of binary chunks with
cloud_file.stream_chunks().await?
. This is the first place that the code tries to access the cloud file. If the file doesn't exist or we can't open it, the?
will return an error. - We use
chunks.next().await
to retrieve the file's next binary chunk. (Note theuse futures_util::StreamExt;
.) Thenext
method returnsNone
after all chunks have been retrieved. - What if there is a next chunk but also a problem retrieving it? We'll catch any problem with
let chunk = chunk?;
. - Finally, we use the fast
[bytecount](https://docs.rs/bytecount/latest/bytecount/)
crate to count newline characters.
In contrast with this cloud solution, think about how you would write a simple line counter for a local file. You might write this:
use std::fs::File;
use std::io::{self, BufRead, BufReader};
fn main() -> io::Result<()> {
let path = "examples/line_counts_local.rs";
let reader = BufReader::new(File::open(path)?);
let mut line_count = 0;
for line in reader.lines() {
let _line = line?;
line_count += 1;
}
println!("line_count: {line_count}");
Ok(())
}
Between the cloud-file version and the local-file version, three differences stand out. First, we can easily read local files as text. By default, we read cloud files as binary (but see Rule 2). Second, by default, we read local files synchronously, blocking program execution until completion. On the other hand, we usually access cloud files asynchronously, allowing other parts of the program to continue running while waiting for the relatively slow network access to complete. Third, iterators such as lines()
support for
. However, streams such as stream_chunks()
do not, so we use while let
.
I mentioned earlier that you didn't need to use the cloud-file
wrapper and that you could use the object_store
crate directly. Let's see what it looks like when we count the newlines in a cloud file using only object_store
methods:
use futures_util::StreamExt; // Enables `.next()` on streams.
pub use object_store::path::Path as StorePath;
use object_store::{parse_url_opts, ObjectStore};
use std::sync::Arc;
use url::Url;
async fn count_lines(
object_store: &Arc>,
store_path: StorePath,
) -> Result {
let mut chunks = object_store.get(&store_path).await?.into_stream();
let mut newline_count: usize = 0;
while let Some(chunk) = chunks.next().await {
let chunk = chunk?;
newline_count += bytecount::count(&chunk, b'n');
}
Ok(newline_count)
}
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/toydata.5chrom.fam";
let options = [("timeout", "10s")];
let url = Url::parse(url)?;
let (object_store, store_path) = parse_url_opts(&url, options)?;
let object_store = Arc::new(object_store); // enables cloning and borrowing
let line_count = count_lines(&object_store, store_path).await?;
println!("line_count: {line_count}");
Ok(())
}
You'll see the code is very similar to the cloud-file
code. The differences are:
- Instead of one
CloudFile
input, most methods take two inputs: anObjectStore
and aStorePath
. BecauseObjectStore
is a non-cloneable trait, here thecount_lines
function specifically uses&Arc
. Alternatively, we could make the function generic and use> &Arc
. - Creating the
ObjectStore
instance, theStorePath
instance, and the stream requires a few extra steps compared to creating aCloudFile
instance and a stream. - Instead of dealing with one error type (namely,
CloudFileError
), multiple error types are possible, so we fall back to using the[anyhow](https://crates.io/crates/anyhow)
crate.
Whether you use object_store
(with 2.4 million downloads) directly or indirectly via cloud-file
(currently, with 124 downloads