It is Fast with Rust.

Sudipta Joardar
3 min readJun 23, 2023

--

Johannes Köster, once known as ‘full time python guy’, writer of the popular workflow named ‘Snakemate’, stuck with an obstacle requiring a level of computational tool where Python was not applicable anymore.

I am talking about the not much known programming language ‘Rust’. Rust combines the performance of C++ with friendlier syntax, focusing on code safety simplifying development. For instance, certain portions of Mozilla’s Firefox are written implementing this language. The code-sharing site GitHub addresses Rust as the second-fastest-growing language among others on the platform in 2019. Python, R and Matlab are the popular languages implemented in dealing with biological problems. These are good for data exploration, but they are not a good fit to speed up the process. Through the dramatic development of science and technology, the algorithms developed in Rust are observed to be used in bioinformatics (Köster’s Rust-Bio), geosciences (the Geo-Rust project) and mathematics (nalgebra). Undoubtedly, Rust is fast. According to Heng Li, Bioinformatician at the Dana Farber Cancer Institute at Harvard, Rust is the best choice. “The beauty of Rust is, it makes the task of debugging very easy, because memory management is much, much better,” as said by Avi Srivastava, a post-doctoral researcher at the New York Genome Center.

Rustacean (A Rust Programmer) community is excellent, providing careful documentation encompassing online support. Besides, an online reference called the Book and a ‘Cookbook’ guiding in solving common problems is also available. Using a single tool, Cargo, one can compile Rust code, run tests, auto-generate the documentation process, upload a package to a repository easily and the list is so on. Certain Rust plug-ins are available to develop environments to execute specific tasks. For example, Microsoft’s Visual Studio Code, JetBrains’ IntelliJ, and Rust ‘playground’. Bioinformatics libraries are often published for multiple programming languages, i.e. SeqAn for C++, Biopython, Bioperl and BioRuby. C and C++, the low-level system programming languages can execute optimized performance in lieu of higher degree of complexity — quite the contrary. On the other hand, Python or Perl, so-called higher-level languages can provide a more concise syntax. At recent time, the combination of a high-level language incorporating careful engineered implementations of a bioinformatics library is a promising choice to deal with a computational puzzle.

In ASCII encoding, biological sequences are represented as vectors or slices of bytes. Rust-Bio is mainly concerned with the algorithms and data structures for biological sequences. We can take ‘Varlociraptor’, as an instant, created by Köster, is used to compare millions of sequences reads against billions of genetic bases leading to the identification of genome variant. A centralized part of Rust-Bio are alphabets, allowing to check in linear time whether a given sequence is a word over the alphabet. Besides, it provides the scope to transform symbols as per their lexicographical ranks and has ability to perform bit-encoding to get the memory saved or iterate over q-grams. Rust-Bio can read and write commonly known file formats such as FASTA, FASTQ and BED. In case of SAM/BAM, CRAM and VCF/BCF support, it is complemented by Rust-HTSlib. For better understanding it is advisable to go through the example written in reference no. 1 mentioned below which will provide the reader with how to create a simple read mapper using Rust-Bio.

Selected References:

[1]. Köster, J., 2016. Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), pp.444–446.

[2]. JE, O. and CT, T., 2020. Why scientists are turning to Rust. Nature, 588, p.185.

Connect me on https://www.biopryx.com/ & linkedin.com/in/sudipta-joardar-3a578b124.

--

--

Sudipta Joardar

Driven by Science, Influenced by Writing! I enjoy the Biology-Computer interface. For more visit biopryx.com