WormBase has developed two tools for exploring published C. elegans single cell RNA sequencing (scRNAseq) data: scdefg
for interactive differential expression on integrated datasets and wormcells-viz
for visualization of gene expression. These tools have been deployed at WormBase with public C. elegans datasets and will continue to be updated as new datasets are published. Source code is available at github.com/WormBase/scdefg and github.com/WormBase/wormcells-viz, together with instructions on how to deploy these tools with any scRNAseq dataset.
For a detailed overview, see the Single cell tools for WormBase preprint (July 2021).
For additional discussion see this 45 min talk from May 2021: [talk, slides].
Three datasets (CeNGEN, Packer 2019, Ben-David 2021) have been integrated and can be compared with differential expression. More information about each dataset is at the bottom of this page. Additionally, you can also visualize gene expression on the annotated cell types of each datasets using the links below
The scdefg
app is written in Python using Flask, and provides a single web page with an interface for selecting two groups of cells according to the existing annotations in the data. For example, the user can select a group according to a combination of cell type, sample, tissue and experimental group. Results are displayed in the form of an interactive volcano plot (log fold change vs p-value) and MA plot (log fold change vs mean expression) that display gene descriptions upon mouseover, and two sortable tabular views of the p-values and log fold changes of expression levels showing enriched and depleted genes. The tabular results can be downloaded in csv and Excel format or copied to the clipboard. The app can be launched from the command line by specifying the path to a trained scVI model and the user may specify data annotations by which the groups may be stratified (e.g. cell type, experiment). Differential expression is performed on the fly and can be done in reasonable time without using GPUs. We have deployed the app on a cloud instance with only 8GB RAM and 2 vCPUs and observed this configuration is sufficient for handling a few concurrent users with results being returned in about 15s.
The wormcells-viz
app is written in Javascript and Python and uses React.js and D3.js for providing interactive and responsive visualizations of heatmaps, gene expression histograms and swarm plots (see below). Deploying the app requires having the pre-computed gene expression values stored in three custom anndata files as described in the the wormcells-viz repository. The following visualizations are currently implemented.
Visualization of scVI inferred expression rates for a selection of cell types and genes. The expression rates can be shown as either a traditional heatmap, or as a monochrome dotplot.
Histograms of the scVI inferred expression rates for a given gene across all cell types in the data. The histogram bin counts are computed from the scVI inferred expression rates for each cell.
For a given cell type, swarm plots visualize the relative expression of a set of genes across all cells annotated in a dataset. These plots are useful for identifying candidate marker genes.
The Y axis displays the set of selected genes, and the X axis displays the log fold change in gene expression between the cell type of interest and all other cell types. This is computed by doing pairwise differential expression of each annotated cell type vs the cell type of interest.
A Colab tutorial on how to make swarm plots is available here
There are currently hundreds of software tools and pipelines developed for scRNAseq data (see https://www.scrna-tools.org). For processing single cell data at WormBase we have chosen to use the scvi-tools.org framework. scvi-tools is different from most other scRNAseq tools in that it uses variational autoencoders to learn the distribution underlying the input data and create a generative model. Interested readers can learn more in about the framework in the scvi-tools documentation. Here we briefly highlight a few considerations that lead to our choice of using the the framework for driving scRNAseq analysis.
Scalability: Using a GPU, scvi-tools can scale to datasets with millions of cells. These large models can be trained in about an hour.
Consistent Development and Contributors: The scvi-tools codebase https://github.com/YosefLab/scvi-tools was first introduced in 2017, and published in 2018 by Lopez et al. It currently boasts 35 unique contributors and 52 releases.
Extensible Framework for Analysis: Because the generative model of the data (formally, a hierarchical bayesian network) can be modified to reflect our assumptions about underlying processes, the framework can be extended to model other aspects of scRNAseq data. Currently, extensions include cell type classification and label transfer across batches, modelling single cell protein measurements, performing gene imputation in spatial data, and using a linear decoder to allow for interpretation of the learned latent space. Several peer reviewed articles have been published describing these extensions (see https://scvi-tools.org/press).
At the moment, the majority of scRNAseq data is generated using the 10X Genomics Chromium technology, with v2 and v3 chemistry. This is also true for C. elegans scRNAseq data. For the time being WormBase will focus development efforts on scRNAseq tools on 10X Genomics data. Two considerations drive this:
Data integration of different batches with scvi-tools is more robust when there is more data, and when the technology and biological system of each batch is the same or similar. Attempting to integrate a small number of cells from unique technologies and unique biological systems can make it impossible to discern biological differences from technical artifacts.
The 10X Genomics data has a widely used, validated and commercially supported data pre-processing workflow, from FASTQ files to gene count matrices. This can enable WormBase to uniformly reprocess the FASTQ files in a single pipeline in the future.
The anndata format (extension .h5ad
) was published in 2018 as a generic class for handling annotated data matrices, with a focus on scRNA-seq data and Python support for machine learning, and with integration with the SCANPY analysis framework. Anndata is an efficient storage format because it uses HDF5 compression, and has come to be the standard format for manipulating scRNAseq data in Python, as well as providing support in R (see also zellkonverter).
Owing to the advantages of anndata and its popularity, WormBase adopted a convention for structuring published C. elegans scRNAseq data into anndata files with standard field names, to streamline their reuse in code pipelines. The guidelines used when wrangling data into the WormBase anndata convention are described in the supplemental tables and maintained at github.com/WormBase/anndata-wrangling.
Here we provide a curated collection of all C. elegans single cell RNA seq high throughput data wrangled into WormBase anndata standard fields. For completeness, we also list other low throughput single cell datasets that were not wrangled.
Short Name | Total cells | Method | h5ad | Summary | Article/preprint | Original Data | Notes |
---|---|---|---|---|---|---|---|
Taylor 2020 | 100,955 | 10x v2/v3 | Download at Caltech Data | L4 larvae neurons selected via flow cytometry | Molecular topography of an entire nervous system. | GSE136049 | CeNGEN website Shiny R app to explore the data |
Ben-David 2021 | 55,508 | 10x v2 | Download at Caltech Data | L2 larvae | Whole-organism mapping of the genetics of gene expression at cellular resolution biorxiv 2020. | PRJNA658829 | Gene count matrix was kindly provided by the authors on request |
Packer 2019 | 89,701 | 10x v2 | Download at Caltech Data | Several timepoints of embryo development | A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution Science 2019. | GSE126954 | VisCello app for data exploration |
Cao 2017 | 35,987 | sci-RNA-seq | Download at Caltech Data | L2 larvae | A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution Science 2019. | GSE98561 and GSM4318946 (reprocessed) | GSM4318946 release was a reannotation of the data | </tr>
Tintori 2016 | 216 | SMARTer kit | Not wrangled | Embryo through the 16-cell stage | A Transcriptional Lineage of the Early C. elegans Embryo Dev Cell 2016. | GSE77944 | They made a custom visualizer at tintori.bio.unc.edu. |
Hashimshony 2012 | 96 | CEL-Seq | Not wrangled | Blastomere cells | CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification Cell Rep. 2012 | SRP014672 | This was one of the pioneering works in scRNAseq and introduced the CEL-Seq technique. |