The model I used was 2 hidden layers with each successive layer containing four times less nodes than the previous. The bottleneck layer contains 64 nodes. Because the autoencoder is an unsupervised clustering machine learning task, train_test_split()
was not needed and the entire dataset was used to train.
In order to validate this model, the lower representation of the data extracted from the bottleneck layer was further reduced via UMAP to two dimensions for visualization. The 2D representation weas then fit through a logistic regression model with the cell types taken from metadata as targets. The score of the model was then used as an accuracy metric, as a gauge of how well the clustering was.
The 2D representation was also plotted as a scatter plot with the cell type identities as different colours.
From the plot, it looks the autoencoder managed to separate clusters, having NK, Erythocytes and B cells as distinct clusters. However, there are still some limitations. Multiple clusters of cells contain more than 1 cell type, such as Memory CD4 T cells, Naive CD4 T cells and CD8 T cells. This might be due to the similar transcriptome of these few cell types and the autoencoder clustered these different cell types together.
As a result, in the future, multiple modalities such as protein expression and ATAC-Seq could be incorporated into the model to better separate these cells with similar transcriptome. However, with current technologies, sequencing multiple modalities and not falling to the curse of dimensionality is simply not possible. Perhaps, as sequencing technologies get more advanced and at a reduced cost, integrating di, tri or even quad modal analysis on single cell profiling is possible.
This short project shows the usage of Tensorflow's Autoencoders being applied on scRNA-seq data and is adequate in clustering cells together.