{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b3e79fa9",
   "metadata": {},
   "source": [
    "# Getting started with Starling (ST)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a06e71b",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "%pip install biostarling\n",
    "%pip install lightning_lite\n",
    "\n",
    "import anndata as ad\n",
    "import pandas as pd\n",
    "import torch\n",
    "from starling import starling, utility\n",
    "from lightning_lite import seed_everything\n",
    "import pytorch_lightning as pl\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b615eb39",
   "metadata": {},
   "source": [
    "## Setting seed for everything\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e83f4cce",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "seed_everything(10, workers=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1f5142d",
   "metadata": {},
   "source": [
    "## Loading annData objects\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69ef8c1d",
   "metadata": {},
   "source": [
    "The example below runs Kmeans with 10 clusters read from \"sample_input.h5ad\" object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c4a203f",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "!wget https://github.com/camlab-bioml/starling/raw/main/docs/source/tutorial/sample_input.h5ad\n",
    "\n",
    "adata = utility.init_clustering(\"KM\", ad.read_h5ad(\"sample_input.h5ad\"), k=10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52d3d9fb",
   "metadata": {},
   "source": [
    "- The input anndata object should contain a cell-by-protein matrix of segmented single-cell expression profiles in the `.X` position. Optionally, cell size information can also be provided as a column of the `.obs` DataFrame. In this case `model_cell_size` should be set to `True` and the column specified in the `cell_size_col_name`argument.\n",
    "- Users might want to arcsinh protein expressions in \\*.h5ad (for example, `sample_input.h5ad`).\n",
    "- The `utility.py` provides an easy setup of GMM, KM (Kmeans) or PG (PhenoGraph).\n",
    "- Default settings are applied to each method.\n",
    "- k can be omitted when PG is used.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fd11c15",
   "metadata": {},
   "source": [
    "## Setting initializations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6effd2b9",
   "metadata": {},
   "source": [
    "The example below uses defualt parameter settings based on benchmarking results (more details in manuscript).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eff9a063",
   "metadata": {},
   "outputs": [],
   "source": [
    "st = starling.ST(adata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "923d2e71",
   "metadata": {},
   "source": [
    "A list of parameters are shown:\n",
    "\n",
    "- adata: annDATA object of the sample\n",
    "- dist_option (default: 'T'): T for Student-T (df=2) and N for Normal (Gaussian)\n",
    "- singlet_prop (default: 0.6): the proportion of anticipated segmentation error free cells \n",
    "- model_cell_size (default: 'Y'): Y for incoporating cell size in the model and N otherwise\n",
    "- cell_size_col_name (default: 'area'): area is the column name in anndata.obs dataframe\n",
    "- model_zplane_overlap (default: 'Y'): Y for modeling z-plane overlap when cell size is modelled and N otherwise\n",
    "  Note: if the user sets model_cell_size = 'N', then model_zplane_overlap is ignored\n",
    "- model_regularizer (default: 1): Regularizier term impose on synthetic doublet loss (BCE)\n",
    "- learning_rate (default: 1e-3): The learning rate of ADAM optimizer for STARLING\n",
    "\n",
    "Equivalent to the above example:\n",
    "```python\n",
    "st = starling.ST(adata, 'T', 'Y', 'area', 'Y', 1, 1e-3)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63939215",
   "metadata": {},
   "source": [
    "## Setting training log\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d721258f",
   "metadata": {},
   "source": [
    "Once training starts, a new directory 'log' will be created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a217070c",
   "metadata": {},
   "outputs": [],
   "source": [
    "## log training results via tensorboard\n",
    "log_tb = pl.loggers.TensorBoardLogger(save_dir=\"log\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae8e46ea",
   "metadata": {},
   "source": [
    "One could view the training information via tensorboard. Please refer to torch lightning (https://lightning.ai/docs/pytorch/stable/api_references.html#profiler) for other possible loggers.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "914bcd5c",
   "metadata": {},
   "source": [
    "## Setting early stopping criterion\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90877a9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "## set early stopping criterion\n",
    "cb_early_stopping = pl.callbacks.EarlyStopping(monitor=\"train_loss\", mode=\"min\", verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac4c7459",
   "metadata": {},
   "source": [
    "Training loss is monitored.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb32a46b",
   "metadata": {},
   "source": [
    "## Training Starling\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f49c63c",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "## train ST\n",
    "st.train_and_fit(\n",
    "    callbacks=[cb_early_stopping],\n",
    "    logger=[log_tb],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ba887b2",
   "metadata": {},
   "source": [
    "## Appending STARLING results to the annData object\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3082c69a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "## retrive starling results\n",
    "result = st.result()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a705d895",
   "metadata": {},
   "source": [
    "## The following information can be retrived from the annData object:\n",
    "\n",
    "- st.adata.varm['init_exp_centroids'] -- initial expression cluster centroids (P x C matrix)\n",
    "- st.adata.varm['st_exp_centroids'] -- ST expression cluster centroids (P x C matrix)\n",
    "- st.adata.uns['init_cell_size_centroids'] -- initial cell size centroids if STARLING models cell size\n",
    "- st.adata.uns['st_cell_size_centroids'] -- initial & ST cell size centroids if ST models cell size\n",
    "- st.adata.obsm['assignment_prob_matrix'] -- cell assignment probability (N x C maxtrix)\n",
    "- st.adata.obsm['gamma_prob_matrix'] -- gamma probabilitiy of two cells (N x C x C maxtrix)\n",
    "- st.adata.obs['doublet'] -- doublet indicator\n",
    "- st.adata.obs['doublet_prob'] -- doublet probabilities\n",
    "- st.adata.obs['init_label'] -- initial assignments\n",
    "- st.adata.obs['st_label'] -- ST assignments\n",
    "- st.adata.obs['max_assign_prob'] -- ST max probabilites of assignments\n",
    "\n",
    "_N: # of cells; C: # of clusters; P: # of proteins_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ab8cb0a",
   "metadata": {},
   "source": [
    "## Saving the model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "204cad47",
   "metadata": {},
   "outputs": [],
   "source": [
    "## st object can be saved\n",
    "torch.save(st, \"model.pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "980dad28",
   "metadata": {},
   "source": [
    "model.pt will be saved in the same directory as this notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad7e5fc0",
   "metadata": {},
   "source": [
    "## Showing STARLING results\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7e67d1d",
   "metadata": {
    "tags": [
     "scroll-output"
    ]
   },
   "outputs": [],
   "source": [
    "display(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53e32d26",
   "metadata": {},
   "source": [
    "One could easily perform further analysis such as co-occurance, enrichment analysis and etc.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b601be72",
   "metadata": {},
   "outputs": [],
   "source": [
    "result.obs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af541283",
   "metadata": {},
   "source": [
    "Starling provides doublet probabilities and cell assignment if it were a singlet for each cell.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80e61208",
   "metadata": {},
   "source": [
    "## Showing initial expression centroids:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2be0fcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "## initial expression centroids (p x c) matrix\n",
    "pd.DataFrame(result.varm[\"init_exp_centroids\"], index=result.var_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03424211",
   "metadata": {},
   "source": [
    "There are 10 centroids since we set Kmeans (KM) as k = 10 earlier.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0bc41a8",
   "metadata": {},
   "source": [
    "## Showing Starling expression centroids:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a11a5334",
   "metadata": {},
   "outputs": [],
   "source": [
    "## starling expression centroids (p x c) matrix\n",
    "pd.DataFrame(result.varm[\"st_exp_centroids\"], index=result.var_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2cccf9d",
   "metadata": {},
   "source": [
    "From here one could easily annotate cluster centroids to cell type.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "993eb08b",
   "metadata": {},
   "source": [
    "## Showing Assignment Distributions:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75f8b562",
   "metadata": {},
   "outputs": [],
   "source": [
    "## assignment distributions (n x c maxtrix)\n",
    "pd.DataFrame(result.obsm[\"assignment_prob_matrix\"], index=result.obs.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b203933c",
   "metadata": {},
   "source": [
    "Currently, we assign a cell label based on the maximum probability among all possible clusters. However, these could be mislabeled because maximum and second highest probabilies can be very close."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}