Can cellular imaging predict gene/protein expression?

Jul 08, 2023

Today, lets take a look at an interesting method of predicting proteomic and transcriptomic expression from high contrast cellular images. Is it possible to learn the underlying gene/protein abundance from cellular imaging?

Multi-omics Prediction from High-content Cellular Imaging with Deep Learning

Link: https://arxiv.org/abs/2306.09391

Introduction

Overview of Image2Omics. Figure 1 from paper

This paper from GSK.ai describes a novel technique - Image2Omics - that can predict transcriptomic and proteomic expression from high-content cellular imaging. Evaluation was conducted on hiPSC-derived macrophages under two stimulation conditions (M1 and M2) and used 152 CRISPR based perturbations, covering a wide expression profile. The results provided in the introduction seem to point to a promising outcome for proteomic abundance.

Opening thoughts, this technique is very interesting and novel. The authors hypothesize that cellular imaging can provide the necessary variability in data to derive the proteomic and transcriptomic expressions. Here the authors refer to bulk-proteomics and bulk-transcriptomics. For anyone who have worked with these datasets, like myself, you will instantly be wondering how it could be possible to capture the variability over thousands of features in the underlying system with cell level imaging. And what about the low abundance proteins, will there be cellular changes that can be captured in imaging that reflect hihg/low expression of a low abundance protein? What about isoforms? And finally, can we scale this to sample or tissue level? Lets dive into the methods to understand more and see if we can get some answers.

In this article, I will be focusing on the modeling and computational work only, mostly ignoring the data acquisition methods assuming that they were well thought out with adequate QC.

Two details to add,

low expressed genes /proteins were filtered out based on a threshold on the normalized counts. I believe the authors filter out the corresponding protein for a gene based on this method. We can safely say that low abundance proteins are thus not going to be predicted since our dataset will most likely lack this protein due to filtering.
Cellular images were perprocessed for shading correction and creation of cell-centered patches by identifying the nuclei centers of each cell in a well. Each well is one of 152 CRISPR experiment under M1 or M2. We can note that a well contains multiple nuclei, which are all captured, thus providing us with the necessary dataset size to make predictions.

Image embedding

Image2Omics is a ResNet18 model with two modifications.

What’s a ResNet? ResNet is a convolution network implemented by Microsoft and published in 20151. ResNet employed the use of Residual connections that allowed it to win 1st place on ILSVRC 2015 classification task. Below is the architecture block diagram from fast.ai forums.

ResNet18 architecture diagram. Notice the “average pool” and “fully connected” layers that the authors modify for this experiment. Soure: https://forums.fast.ai/t/visualizing-resnet18-activations/43916

The two (there should be three) modifications were:

“average pool” layers replaced with “global average pooling” layer. This allows all convolutions to be flattened with global average as opposed to local averages over each convolution.
replace 1000-d fully connected layer with two layers - 1024-d and 128-d .
remove the final softmax layer as this is not a classification task. This modification was not listed, but may not be obvious (at least not to me!)

The output of this modified ResNet18 model is a 128-d embedding for each cell-centered image in a well. Additionally, mean aggregation is performed on all cell-centered images in a well to produce a 128-d well embedding. Here a well is a CRISPR perturbed experiment under condition M1 or M2.

Finally, training is performed using Multiple Triplet Loss function. This is a loss function I am not very familiar with but seems to provide good results in facial recognition2.

Multi-omic prediction

Once embedding are produced for each well of CRISPR experiments, the authors fine tune individual models for transcriptomics and proteomics. For each of the two conditions separately (giving us a total of 2x2=4 models). Using the ResNet model obtained above, a linear model is trained on the embedding to produce each the transcript or protein expression for each well.

This secondary linear model is an interesting approach that I am curious to understand better. The authors seem to use a linear model to predict the abundance of each gene/protein.

Share Nischal’s Substack

Model evaluation and results

The authors present several evaluations of the four models. I discuss two that I believe provide an overview of the technique, correlation of prediction with observed, comparing with the mean. A few other evaluations include predictability of subcellular localisation, pathway memberships, and image embedding evaluation using UMAP. Each provides a unique perspective into the use-cases for this technique. I leave that the last three for the reader to explore further.

Below are the two images that present the big picture.

First, the authors examine the correlation coefficient (r2) between the observed omic and predicted omic expression by Image2Omics on a held out test set for each state. Each dot is a gene/protein. Protein prediction seems significantly better then the transcript prediction with a high median r2 consistent across the two states.

Correlation coefficients between observed and predicted gene/protein abundance for M1 and M2. Figure 2 from paper

Second, the authors employ a one-sided Mann–Whitney-Wilcoxon test at 0.05 level to test if Image2Omics can significantly improve predictability of a gene/protein over a mean prediction. Here too they observe that proteins predictions are significantly better for a larger fraction of proteins over transcripts. This points to the possibility of the technique being a better imputation of missing data compared a well practiced method - mean imputation. I would have also like to see a comparison with a median predictor.

Percent of genes/proteins that are significantly more predictable compared to mean predictor. Figure 3 in paper

Closing remarks

The results in the paper are promising! I am definitely excited at the possibility of being able to obtain protein expression from cellular imaging. I believe the overall scores for transcript might be improved if only focused on landmark 1000 genes as opposed to the 26k features targeted. More data generation experiments would be needed on a larger scale for better predictability. Low abundance proteins will still be a challenge if they are not consistently recorded in the data acquisition phase.

Use cases

In addition to prediction of multi-omics, which could be possible with more large scale experiemtns. This technique could be employed as a QC tool. To expand, a model could be used for:

Imputation in cases where some proteins not recorded for a sample by the mass spec in proteomics.
QC for proteomic data acquisition process. For example, across runs or TMT labels to ensure highest quality data handling and sample preparation.
With sufficient testing it could be possible to use prediction for error correction for poorly handled samples that may be problematic and would otherwise be discarded.

If you did like this review, please do subscribe to receive my next review directly in your mailbox! Thanks for reading.

Deep Residual Learning for Image Recognition https://arxiv.org/pdf/1512.03385.pdf

FaceNet: A Unified Embedding for Face Recognition and Clustering https://arxiv.org/abs/1503.03832

Multimodal