Seurat to Anndata conversions

A lot of single cell data packages are built in R, and the standard data formats in commonly used packages such as Seurat and SingleCellExperiment package count data with metadata in a single object. When moving the data over to python, we can preserve this structure using the Anndata format.

Seurat formatting (as of Seurat v4.04)

In particular there is an excellent and well maintained python package, SeuratDisk, for doing the conversion from Seurat format to Anndata.

However Seurat itself is a complicated data structure so the conversion needs to be done carefully to get the correct information ported over. The best documentation I’ve found of the Seurat object parts is within their wiki.

The Seurat object contains multiple ‘assays’ (by default, only the ‘RNA’ assay) which are meant to store different types of data. Within each assay there are ‘slots’, which contain the raw data and/or different transformations of each ‘assay’. Note the ‘assays’ and ‘slots’ can be overwritten by the user, so their content is not necessarily fixed. As well, certain data transformation on the original seurat object will auto populate the other ‘assay’ slots without prompting. In summary,

  • when initialized in R, the raw data is usually stored in the ‘counts’ slot of the default assay which is by default called ‘RNA’.
  • ‘RNA’ is the active assay when working with this object in R, unless the user sets it manually using the command DefaultAssay
  • When doing transformation operations on a seurat object with only ‘counts’ populated in R, seurat will automatically add add the data transformations to other slots. For example, the most common normalization operations include LogNormalize and sctransform which will autopopulate the ‘data’ and ‘scale.data’ slots of the active assay, respectively.

Converting between Anndata and Seurat

Recently the wonderful SeuratDisk package was released, which allows for the h5Seurat format that acts as an intermediary between Anndata and Seurat objects. The documentation is quite complete, but there are some subtle points.

For an example, I’ll use their stxBrain dataset to show what happens, following their SeuratDisk conversion manual:

Note that conversion from Seurat to Anndata using SeuratDisk would populate the fields as follows (as documented on the bottom of their manual:

  • The anndata .X slot will be filled with the ‘scale.data’ slot corresponding to the active assay in R, and if the latter is not present it will be filled with the ‘data’ slot.
  • ‘var will be filled with meta.features only for the features present in X; for example, if X is filled with scale.data, then var will contain only features that have been scaled’. This is why sometimes the seurat object will appear to change size when converted.
  • ‘raw.X will be filled with data if X is filled with scale.data; otherwise, it will be filled with counts. If counts is not present, then raw will not be filled’: there is also a slot ‘raw’ in the Anndata object which
library(Seurat)
packageVersion("Seurat")
library(SeuratDisk)
library(SeuratWrappers)

data("pbmc_small")
pbmc_small

pbmc_small
# An object of class Seurat
# 230 features across 80 samples within 1 assay
# Active assay: RNA (230 features, 20 variable features)
#  2 dimensional reductions calculated: pca, tsne

# add the slot 'data' to the active assay 'RNA'
pbmc_small <- NormalizeData(object = pbmc_small)
pbmc_small <-Seurat::SCTransform(SeuratObject::pbmc_small, verbose=F) 
pbmc_small
# An object of class Seurat
# 450 features across 80 samples within 2 assays
# Active assay: SCT (220 features, 220 variable features)
#  1 other assay present: RNA
#  2 dimensional reductions calculated: pca, tsne
SaveH5Seurat(pbmc_small, filename = file, overwrite = TRUE)
# Creating h5Seurat file for version 3.1.5.9900
# Adding counts for RNA
# Adding data for RNA
# Adding scale.data for RNA
# Adding variable features for RNA
# Adding feature-level metadata for RNA
# Adding counts for SCT
# Adding data for SCT
# Adding scale.data for SCT
# Adding variable features for SCT
# No feature-level metadata found for SCT
# Writing out SCTModel.list for SCT
# Adding cell embeddings for pca
# Adding loadings for pca
# Adding projected loadings for pca
# Adding standard deviations for pca
# Adding JackStraw information for pca
# Adding cell embeddings for tsne
# No loadings for tsne
# No projected loadings for tsne
# No standard deviations for tsne
# No JackStraw data for tsne
Convert(file, dest = "h5ad")
# Validating h5Seurat file
# Adding scale.data from SCT as X
# Adding data from SCT as raw
# Transfering meta.data to obs
# Adding dimensional reduction information for tsne (global)

And here is that same object when converted over to Anndata and viewed in Python, where we can see that since ‘SCT’ is the active assay, it is added over to the anndata object.


import numpy as np
import pandas as pd
import anndata as ad
pyfile = ad.read_h5ad(file)
pyfile
# AnnData object with n_obs × n_vars = 80 × 220
#     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_snn_res.0.8', 'letter.idents', 'groups', 'RNA_snn_res.1', 'nCount_SCT', 'nFeature_SCT'
#     var: '_index', 'features'
#     obsm: 'X_tsne'
pyfile.raw.shape
# (80, 220): this is the same shape as the SCT assay (220 features) instead of the RNA assay (230 features)

Other differences between Anndata and Seurat
  • Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the later case, they are referred to as “view”.
  • When going from anndata to seurat, column names in the ‘.obs’ slot need to be free of certain symbols such as dashes (in my experience) or the conversion to h5Seurat will not work. This issue is documented here.
Written on October 10, 2021