single-cell

Screen Shot 2021-06-18 at 21 45 02

WormBase has developed two tools for exploring published C. elegans single cell RNA sequencing (scRNAseq) data: scdefg for interactive differential expression on integrated datasets and wormcells-viz for visualization of gene expression. These tools have been deployed at WormBase with public C. elegans datasets and will continue to be updated as new datasets are published. Source code is available at github.com/WormBase/scdefg and github.com/WormBase/wormcells-viz, together with instructions on how to deploy these tools with any scRNAseq dataset.

For a detailed overview, see the Single cell tools for WormBase preprint (July 2021).

For additional discussion see this 45 min talk from May 2021: [talk, slides].

Integrated Differential Expression: scdefg.textpressolab.com

Three datasets (CeNGEN, Packer 2019, Ben-David 2021) have been integrated and can be compared with differential expression. More information about each dataset is at the bottom of this page. Additionally, you can also visualize gene expression on the annotated cell types of each datasets using the links below

Visualize CeNGEN L4 neuron dataset: cengen.textpressolab.com

Visualize Packer 2019 embryogenesis dataset: packer2019.textpressolab.com

Visualize Ben-David 2021 L2 larvae dataset: bendavid2021.textpressolab.com

About the apps

The scdefg app is written in Python using Flask, and provides a single web page with an interface for selecting two groups of cells according to the existing annotations in the data. For example, the user can select a group according to a combination of cell type, sample, tissue and experimental group. Results are displayed in the form of an interactive volcano plot (log fold change vs p-value) and MA plot (log fold change vs mean expression) that display gene descriptions upon mouseover, and two sortable tabular views of the p-values and log fold changes of expression levels showing enriched and depleted genes. The tabular results can be downloaded in csv and Excel format or copied to the clipboard. The app can be launched from the command line by specifying the path to a trained scVI model and the user may specify data annotations by which the groups may be stratified (e.g. cell type, experiment). Differential expression is performed on the fly and can be done in reasonable time without using GPUs. We have deployed the app on a cloud instance with only 8GB RAM and 2 vCPUs and observed this configuration is sufficient for handling a few concurrent users with results being returned in about 15s.

The wormcells-viz app is written in Javascript and Python and uses React.js and D3.js for providing interactive and responsive visualizations of heatmaps, gene expression histograms and swarm plots (see below). Deploying the app requires having the pre-computed gene expression values stored in three custom anndata files as described in the the wormcells-viz repository. The following visualizations are currently implemented.

Heatmap

Visualization of scVI inferred expression rates for a selection of cell types and genes. The expression rates can be shown as either a traditional heatmap, or as a monochrome dotplot.

Gene expression histogram

Histograms of the scVI inferred expression rates for a given gene across all cell types in the data. The histogram bin counts are computed from the scVI inferred expression rates for each cell.

Swarm plot

For a given cell type, swarm plots visualize the relative expression of a set of genes across all cells annotated in a dataset. These plots are useful for identifying candidate marker genes.

The Y axis displays the set of selected genes, and the X axis displays the log fold change in gene expression between the cell type of interest and all other cell types. This is computed by doing pairwise differential expression of each annotated cell type vs the cell type of interest.

Y axis: a set of selected genes, evenly spaced
X axis: the log fold change of expression of that gene on all cell types, relative to the cell of interest.
0 = baseline expression on reference cell type,
below 0 means lower expression in that cell type relative to reference
above 0 means higher expression in cell type relative to reference

A Colab tutorial on how to make swarm plots is available here

How WormBase processes single cell RNA data: scvi-tools

There are currently hundreds of software tools and pipelines developed for scRNAseq data (see https://www.scrna-tools.org). For processing single cell data at WormBase we have chosen to use the scvi-tools.org framework. scvi-tools is different from most other scRNAseq tools in that it uses variational autoencoders to learn the distribution underlying the input data and create a generative model. Interested readers can learn more in about the framework in the scvi-tools documentation. Here we briefly highlight a few considerations that lead to our choice of using the the framework for driving scRNAseq analysis.

Scalability: Using a GPU, scvi-tools can scale to datasets with millions of cells. These large models can be trained in about an hour.
Consistent Development and Contributors: The scvi-tools codebase https://github.com/YosefLab/scvi-tools was first introduced in 2017, and published in 2018 by Lopez et al. It currently boasts 35 unique contributors and 52 releases.
Extensible Framework for Analysis: Because the generative model of the data (formally, a hierarchical bayesian network) can be modified to reflect our assumptions about underlying processes, the framework can be extended to model other aspects of scRNAseq data. Currently, extensions include cell type classification and label transfer across batches, modelling single cell protein measurements, performing gene imputation in spatial data, and using a linear decoder to allow for interpretation of the learned latent space. Several peer reviewed articles have been published describing these extensions (see https://scvi-tools.org/press).

WormBase deployment philosophy for single cell tools

At the moment, the majority of scRNAseq data is generated using the 10X Genomics Chromium technology, with v2 and v3 chemistry. This is also true for C. elegans scRNAseq data. For the time being WormBase will focus development efforts on scRNAseq tools on 10X Genomics data. Two considerations drive this:

Data integration of different batches with scvi-tools is more robust when there is more data, and when the technology and biological system of each batch is the same or similar. Attempting to integrate a small number of cells from unique technologies and unique biological systems can make it impossible to discern biological differences from technical artifacts.
The 10X Genomics data has a widely used, validated and commercially supported data pre-processing workflow, from FASTQ files to gene count matrices. This can enable WormBase to uniformly reprocess the FASTQ files in a single pipeline in the future.

List of all C. elegans single cell datasets in anndata format

The anndata format (extension .h5ad) was published in 2018 as a generic class for handling annotated data matrices, with a focus on scRNA-seq data and Python support for machine learning, and with integration with the SCANPY analysis framework. Anndata is an efficient storage format because it uses HDF5 compression, and has come to be the standard format for manipulating scRNAseq data in Python, as well as providing support in R (see also zellkonverter).

Owing to the advantages of anndata and its popularity, WormBase adopted a convention for structuring published C. elegans scRNAseq data into anndata files with standard field names, to streamline their reuse in code pipelines. The guidelines used when wrangling data into the WormBase anndata convention are described in the supplemental tables and maintained at github.com/WormBase/anndata-wrangling.

Here we provide a curated collection of all C. elegans single cell RNA seq high throughput data wrangled into WormBase anndata standard fields. For completeness, we also list other low throughput single cell datasets that were not wrangled.

</tr>

Short Name	Total cells	Method	h5ad	Summary	Article/preprint	Original Data	Notes
Taylor 2020	100,955	10x v2/v3	Download at Caltech Data	L4 larvae neurons selected via flow cytometry	Molecular topography of an entire nervous system.	GSE136049	CeNGEN website Shiny R app to explore the data
Ben-David 2021	55,508	10x v2	Download at Caltech Data	L2 larvae	Whole-organism mapping of the genetics of gene expression at cellular resolution biorxiv 2020.	PRJNA658829	Gene count matrix was kindly provided by the authors on request
Packer 2019	89,701	10x v2	Download at Caltech Data	Several timepoints of embryo development	A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution Science 2019.	GSE126954	VisCello app for data exploration
Cao 2017	35,987	sci-RNA-seq	Download at Caltech Data	L2 larvae	A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution Science 2019.	GSE98561 and GSM4318946 (reprocessed)	GSM4318946 release was a reannotation of the data
Tintori 2016	216	SMARTer kit	Not wrangled	Embryo through the 16-cell stage	A Transcriptional Lineage of the Early C. elegans Embryo Dev Cell 2016.	GSE77944	They made a custom visualizer at tintori.bio.unc.edu.
Hashimshony 2012	96	CEL-Seq	Not wrangled	Blastomere cells	CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification Cell Rep. 2012	SRP014672	This was one of the pioneering works in scRNAseq and introduced the CEL-Seq technique.