{ "cells": [ { "cell_type": "markdown", "id": "b3e79fa9", "metadata": {}, "source": [ "# Getting started with Starling (ST)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2a06e71b", "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "%pip install biostarling\n", "%pip install lightning_lite\n", "\n", "import anndata as ad\n", "import pandas as pd\n", "import torch\n", "from starling import starling, utility\n", "from lightning_lite import seed_everything\n", "import pytorch_lightning as pl\n", "\n" ] }, { "cell_type": "markdown", "id": "b615eb39", "metadata": {}, "source": [ "## Setting seed for everything\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e83f4cce", "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "seed_everything(10, workers=True)" ] }, { "cell_type": "markdown", "id": "d1f5142d", "metadata": {}, "source": [ "## Loading annData objects\n" ] }, { "cell_type": "markdown", "id": "69ef8c1d", "metadata": {}, "source": [ "The example below runs Kmeans with 10 clusters read from \"sample_input.h5ad\" object." ] }, { "cell_type": "code", "execution_count": null, "id": "8c4a203f", "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "!wget https://github.com/camlab-bioml/starling/raw/main/docs/source/tutorial/sample_input.h5ad\n", "\n", "adata = utility.init_clustering(\"KM\", ad.read_h5ad(\"sample_input.h5ad\"), k=10)\n" ] }, { "cell_type": "markdown", "id": "52d3d9fb", "metadata": {}, "source": [ "- The input anndata object should contain a cell-by-protein matrix of segmented single-cell expression profiles in the `.X` position. Optionally, cell size information can also be provided as a column of the `.obs` DataFrame. In this case `model_cell_size` should be set to `True` and the column specified in the `cell_size_col_name`argument.\n", "- Users might want to arcsinh protein expressions in \\*.h5ad (for example, `sample_input.h5ad`).\n", "- The `utility.py` provides an easy setup of GMM, KM (Kmeans) or PG (PhenoGraph).\n", "- Default settings are applied to each method.\n", "- k can be omitted when PG is used.\n" ] }, { "cell_type": "markdown", "id": "7fd11c15", "metadata": {}, "source": [ "## Setting initializations\n" ] }, { "cell_type": "markdown", "id": "6effd2b9", "metadata": {}, "source": [ "The example below uses defualt parameter settings based on benchmarking results (more details in manuscript).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "eff9a063", "metadata": {}, "outputs": [], "source": [ "st = starling.ST(adata)" ] }, { "cell_type": "markdown", "id": "923d2e71", "metadata": {}, "source": [ "A list of parameters are shown:\n", "\n", "- adata: annDATA object of the sample\n", "- dist_option (default: 'T'): T for Student-T (df=2) and N for Normal (Gaussian)\n", "- singlet_prop (default: 0.6): the proportion of anticipated segmentation error free cells \n", "- model_cell_size (default: 'Y'): Y for incoporating cell size in the model and N otherwise\n", "- cell_size_col_name (default: 'area'): area is the column name in anndata.obs dataframe\n", "- model_zplane_overlap (default: 'Y'): Y for modeling z-plane overlap when cell size is modelled and N otherwise\n", " Note: if the user sets model_cell_size = 'N', then model_zplane_overlap is ignored\n", "- model_regularizer (default: 1): Regularizier term impose on synthetic doublet loss (BCE)\n", "- learning_rate (default: 1e-3): The learning rate of ADAM optimizer for STARLING\n", "\n", "Equivalent to the above example:\n", "```python\n", "st = starling.ST(adata, 'T', 'Y', 'area', 'Y', 1, 1e-3)\n", "```\n" ] }, { "cell_type": "markdown", "id": "63939215", "metadata": {}, "source": [ "## Setting training log\n" ] }, { "cell_type": "markdown", "id": "d721258f", "metadata": {}, "source": [ "Once training starts, a new directory 'log' will be created." ] }, { "cell_type": "code", "execution_count": null, "id": "a217070c", "metadata": {}, "outputs": [], "source": [ "## log training results via tensorboard\n", "log_tb = pl.loggers.TensorBoardLogger(save_dir=\"log\")" ] }, { "cell_type": "markdown", "id": "ae8e46ea", "metadata": {}, "source": [ "One could view the training information via tensorboard. Please refer to torch lightning (https://lightning.ai/docs/pytorch/stable/api_references.html#profiler) for other possible loggers.\n" ] }, { "cell_type": "markdown", "id": "914bcd5c", "metadata": {}, "source": [ "## Setting early stopping criterion\n" ] }, { "cell_type": "code", "execution_count": null, "id": "90877a9c", "metadata": {}, "outputs": [], "source": [ "## set early stopping criterion\n", "cb_early_stopping = pl.callbacks.EarlyStopping(monitor=\"train_loss\", mode=\"min\", verbose=False)" ] }, { "cell_type": "markdown", "id": "ac4c7459", "metadata": {}, "source": [ "Training loss is monitored.\n" ] }, { "cell_type": "markdown", "id": "bb32a46b", "metadata": {}, "source": [ "## Training Starling\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8f49c63c", "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "## train ST\n", "st.train_and_fit(\n", " callbacks=[cb_early_stopping],\n", " logger=[log_tb],\n", ")" ] }, { "cell_type": "markdown", "id": "3ba887b2", "metadata": {}, "source": [ "## Appending STARLING results to the annData object\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3082c69a", "metadata": { "scrolled": true }, "outputs": [], "source": [ "## retrive starling results\n", "result = st.result()" ] }, { "cell_type": "markdown", "id": "a705d895", "metadata": {}, "source": [ "## The following information can be retrived from the annData object:\n", "\n", "- st.adata.varm['init_exp_centroids'] -- initial expression cluster centroids (P x C matrix)\n", "- st.adata.varm['st_exp_centroids'] -- ST expression cluster centroids (P x C matrix)\n", "- st.adata.uns['init_cell_size_centroids'] -- initial cell size centroids if STARLING models cell size\n", "- st.adata.uns['st_cell_size_centroids'] -- initial & ST cell size centroids if ST models cell size\n", "- st.adata.obsm['assignment_prob_matrix'] -- cell assignment probability (N x C maxtrix)\n", "- st.adata.obsm['gamma_prob_matrix'] -- gamma probabilitiy of two cells (N x C x C maxtrix)\n", "- st.adata.obs['doublet'] -- doublet indicator\n", "- st.adata.obs['doublet_prob'] -- doublet probabilities\n", "- st.adata.obs['init_label'] -- initial assignments\n", "- st.adata.obs['st_label'] -- ST assignments\n", "- st.adata.obs['max_assign_prob'] -- ST max probabilites of assignments\n", "\n", "_N: # of cells; C: # of clusters; P: # of proteins_\n" ] }, { "cell_type": "markdown", "id": "4ab8cb0a", "metadata": {}, "source": [ "## Saving the model\n" ] }, { "cell_type": "code", "execution_count": null, "id": "204cad47", "metadata": {}, "outputs": [], "source": [ "## st object can be saved\n", "torch.save(st, \"model.pt\")" ] }, { "cell_type": "markdown", "id": "980dad28", "metadata": {}, "source": [ "model.pt will be saved in the same directory as this notebook.\n" ] }, { "cell_type": "markdown", "id": "ad7e5fc0", "metadata": {}, "source": [ "## Showing STARLING results\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c7e67d1d", "metadata": { "tags": [ "scroll-output" ] }, "outputs": [], "source": [ "display(result)" ] }, { "cell_type": "markdown", "id": "53e32d26", "metadata": {}, "source": [ "One could easily perform further analysis such as co-occurance, enrichment analysis and etc.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b601be72", "metadata": {}, "outputs": [], "source": [ "result.obs" ] }, { "cell_type": "markdown", "id": "af541283", "metadata": {}, "source": [ "Starling provides doublet probabilities and cell assignment if it were a singlet for each cell.\n" ] }, { "cell_type": "markdown", "id": "80e61208", "metadata": {}, "source": [ "## Showing initial expression centroids:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a2be0fcc", "metadata": {}, "outputs": [], "source": [ "## initial expression centroids (p x c) matrix\n", "pd.DataFrame(result.varm[\"init_exp_centroids\"], index=result.var_names)" ] }, { "cell_type": "markdown", "id": "03424211", "metadata": {}, "source": [ "There are 10 centroids since we set Kmeans (KM) as k = 10 earlier.\n" ] }, { "cell_type": "markdown", "id": "f0bc41a8", "metadata": {}, "source": [ "## Showing Starling expression centroids:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a11a5334", "metadata": {}, "outputs": [], "source": [ "## starling expression centroids (p x c) matrix\n", "pd.DataFrame(result.varm[\"st_exp_centroids\"], index=result.var_names)" ] }, { "cell_type": "markdown", "id": "a2cccf9d", "metadata": {}, "source": [ "From here one could easily annotate cluster centroids to cell type.\n" ] }, { "cell_type": "markdown", "id": "993eb08b", "metadata": {}, "source": [ "## Showing Assignment Distributions:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "75f8b562", "metadata": {}, "outputs": [], "source": [ "## assignment distributions (n x c maxtrix)\n", "pd.DataFrame(result.obsm[\"assignment_prob_matrix\"], index=result.obs.index)" ] }, { "cell_type": "markdown", "id": "b203933c", "metadata": {}, "source": [ "Currently, we assign a cell label based on the maximum probability among all possible clusters. However, these could be mislabeled because maximum and second highest probabilies can be very close." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 5 }