About ENIGMA

The general strategy of ENIGMA about performing bulk gene expression deconvolution to infer CSE on sample-level is borrowing the idea from Recommender Systems. We regarded this task as a matrix completion problem, with bulk gene expression matrix could be represented as the linear combination of several cell type-specific expression matrix. The linear combination coefficient is cell type fraction (θ_i), which could be estimated through using a cell type fraction estimation algorithm or experimentally measurements through cell sorting. Our goal is to minimize the distance between observed bulk gene expression matrix and reconstituted bulk expression matrix through linear combination of inferred CSE.

Besides the minimization of the main object function, we also added two additional constraint terms to make the inference more robust. First, we considered each aggregated CSE should be as close as to the corresponding reference, which could also be regarded as the prior expression information derived from reference matrix. Second, we used regularization term to constraint the rank of inferred CSE. In matrix completion problem, low-rank property of inferred signal is ubiquitous in real-world application. Also, previous analysis of the spectrum of bulk DNA methylation profile enlighten that the mixing profile has the latent variation/eigenvector that associates with the cell type fraction variation among samples. Therefore, assuming the relatively low-rank property in CSE is consistent with our known knowledge.

ENIGMA has three main steps. First, ENIGMA requires cell type reference expression matrix (signature matrix), which could be derived from either FACS RNA-seq or scRNA-seq datasets through calculating the average expression value of each gene from each cell type. Previous researches have widely used reference matrix curated from different platforms, for instance, Newman et al. used LM22 immune signature matrix which derived from microarray platform to deconvolute bulk RNA-seq dataset. However, we have to note that use references from different platforms would introduce unwanted batch effect between reference and bulk RNA-seq matrix, especially for the reference matrix derived from low coverage scRNA-seq dataset. To overcome this challenge, we used previously presented method that is specifically designed for correcting batch effect among bulk RNA-seq matrix and reference matrix. Second, ENIGMA applied robust linear regression model to estimate each cell type fractions among samples based on reference matrix derived from the first step. Third, based on reference matrix and cell type fraction matrix estimated from step 1 and step 2, ENIGMA applied constrained matrix completion algorithm to deconvolute bulk RNA-seq matrix into CSE on sample-level. In order to constraint the model complexity to prevent overfitting, we proposed to use two different norm penalty functions to regularize resulted CSE. Finally, the returned CSE could be used to identify cell type-specific DEG, visualize each gene’s expression pattern on the cell type-specific manifold space (e.g. t-SNE, UMAP), and build the cell type-specific co-expression network to identify modules that relevant to phenotypes of interest.

See bioRxiv for a detailed exposition of the methods.