Recursion Releases Open-Source Data from Largest Ever Dataset of Biological Images, Inviting Data Science Community to Develop New and Improved Machine Learning Algorithms for the Life Sciences Industry

Clinical-stage biotech offering unique access to cell biology images
with the goal of driving more effective artificial intelligence methods
in drug discovery and development

SALT LAKE CITY–(BUSINESS WIRE)–lt;a href=”” target=”_blank”gt;#AIlt;/agt;–Recursion,
a Fast Company “Most Innovative Company” and leader in the artificial
intelligence for drug discovery movement, today announced it will
open-source a glimpse of the massive biological dataset the company has
been building for more than five years. At more than two petabytes, and
across more than 10 million different biological contexts, Recursion’s
data is the world’s largest image-based dataset designed specifically
for the development of machine learning algorithms in experimental
biology and drug discovery.

The announcement was made at the global machine learning conference, ICLR
, and will be accompanied by a competition available through the
NeurIPS 2019 Competition Track and co-sponsored by NVIDIA and Google
Cloud. The goal of the competition is to inspire the development of
effective machine learning methods that can identify representations of
biology from the complex experimental dataset, called RxRx1.

“To answer fundamental questions facing biology and disease, and
reimagine the drug discovery paradigm, we’re building the world’s
largest, relatable, empirical biological dataset,” said Chris Gibson,
Ph.D., CEO, Recursion. “The RxRx1 dataset we’re announcing today
represents an important resource for the machine learning community,
with more than 100,000 images and 300-plus gigabytes of data
representing diverse biological contexts. Yet despite the massive scale
of this dataset, it represents just 0.4 percent of what we generate at
Recursion on a weekly basis. We expect that the richness of this
dataset, combined with the context surrounding the scale of our efforts,
will inspire the world’s machine learning and AI community to help us in
our mission to decode biology to radically improve lives.”

Added Gibson, “If we are successful in our collective efforts, not only
will new treatments make it to market faster, but more companies will be
incentivized to develop new drugs for smaller markets, such as rare
diseases, where many patients still face a major unmet need.”

The RxRx1 dataset is composed of images of human cells from more than
1,000 experimental conditions with dozens of biological replicates
produced weeks and months apart in a variety of human cell types. These
data were generated at multiple Recursion sites under the highly
controlled experimental procedures characteristic of Recursion’s
process. However, each batch of experimental data contains unique
experimental variations, giving data scientists a rich proving ground to
experiment with methods to tackle the noise inherent in even the most
well-run empirical studies.

Experimental complexity and variability are major challenges in the
application of machine learning to biological datasets, particularly in
drug discovery. While machine learning approaches have the potential to
accelerate drug discovery, fundamental challenges remain in combating
the complexity and variability in biological datasets and to ensure
algorithms are tuned in to fundamental biology and not to experimental
heterogeneity in the data.

“This dataset provides a great playground for those working in multiple
areas of machine learning research, such as domain adaptation and k-shot
learning,” said Berton Earnshaw, Vice President of Data Science,
Recursion. “Developing methods to account for the non-random
experimental noise is something that should be of interest to those
beyond just the life science community.”

New methods – including those derived from the NeurIPS competition –
that effectively control for experimental heterogeneity in machine
learning datasets will revolutionize large-scale biological data
analysis, and lead to greatly improved drug discovery applications and

“Advances in machine learning methods outside of the life sciences have
already been accelerated through the availability of large-scale public
datasets, such as ImageNet and COCO, among many others,” said Mason
Victors, Chief Technology Officer and Chief Product Officer, Recursion.
“Like these initiatives, we aim to create resources that will enable the
community to collectively identify and adopt new machine learning
methods that benefit the entire life sciences industry. We are excited
to provide the data science community with the first
longitudinally-generated, human cell biology image dataset to facilitate
new machine learning applications. Best of luck to those in the
competition, we’re rooting for you.”

For more information on Recursion please visit
To see if you are eligible to participate in the NeurIPS 2019
Competition as part of ICLR, please visit

About Recursion

Recursion is a clinical-stage biotechnology company combining
experimental biology and automation with artificial intelligence in a
massively parallel system to efficiently discover potential drugs for
diverse indications, including genetic disease, inflammation,
immunology, and infectious disease. Recursion applies causative
perturbations to human cells to generate disease models and associated
biological image data. Recursion’s rich, relatable database of more than
two petabytes of biological images generated in-house on the company’s
robotics platform enables advanced machine learning approaches to reveal
drug candidates, mechanisms of action, and potential toxicity, with the
eventual goal of decoding biology and advancing new therapeutics to
radically improve lives. Recursion is headquartered in Salt Lake City
and in 2019 was designated a Fast Company “Most Innovative Company.”
Learn more at,
or connect on Twitter,
and LinkedIn.


Jessica Yingling, Ph.D.
President, Little Dog Communications Inc.

error: Content is protected !!