# Getting started with Starling (ST)


In [None]:
%pip install biostarling
%pip install lightning_lite

import anndata as ad
import pandas as pd
import torch
from starling import starling, utility
from lightning_lite import seed_everything
import pytorch_lightning as pl



## Setting seed for everything


In [None]:
seed_everything(10, workers=True)

## Loading annData objects


The example below runs Kmeans with 10 clusters read from "sample_input.h5ad" object.

In [None]:
!wget https://github.com/camlab-bioml/starling/raw/main/docs/source/tutorial/sample_input.h5ad

adata = utility.init_clustering("KM", ad.read_h5ad("sample_input.h5ad"), k=10)


- The input anndata object should contain a cell-by-protein matrix of segmented single-cell expression profiles in the `.X` position. Optionally, cell size information can also be provided as a column of the `.obs` DataFrame. In this case `model_cell_size` should be set to `True` and the column specified in the `cell_size_col_name`argument.
- Users might want to arcsinh protein expressions in \*.h5ad (for example, `sample_input.h5ad`).
- The `utility.py` provides an easy setup of GMM, KM (Kmeans) or PG (PhenoGraph).
- Default settings are applied to each method.
- k can be omitted when PG is used.


## Setting initializations


The example below uses defualt parameter settings based on benchmarking results (more details in manuscript).


In [None]:
st = starling.ST(adata)

A list of parameters are shown:

- adata: annDATA object of the sample
- dist_option (default: 'T'): T for Student-T (df=2) and N for Normal (Gaussian)
- singlet_prop (default: 0.6): the proportion of anticipated segmentation error free cells 
- model_cell_size (default: 'Y'): Y for incoporating cell size in the model and N otherwise
- cell_size_col_name (default: 'area'): area is the column name in anndata.obs dataframe
- model_zplane_overlap (default: 'Y'): Y for modeling z-plane overlap when cell size is modelled and N otherwise
 Note: if the user sets model_cell_size = 'N', then model_zplane_overlap is ignored
- model_regularizer (default: 1): Regularizier term impose on synthetic doublet loss (BCE)
- learning_rate (default: 1e-3): The learning rate of ADAM optimizer for STARLING

Equivalent to the above example:
```python
st = starling.ST(adata, 'T', 'Y', 'area', 'Y', 1, 1e-3)
```


## Setting training log


Once training starts, a new directory 'log' will be created.

In [None]:
## log training results via tensorboard
log_tb = pl.loggers.TensorBoardLogger(save_dir="log")

One could view the training information via tensorboard. Please refer to torch lightning (https://lightning.ai/docs/pytorch/stable/api_references.html#profiler) for other possible loggers.


## Setting early stopping criterion


In [None]:
## set early stopping criterion
cb_early_stopping = pl.callbacks.EarlyStopping(monitor="train_loss", mode="min", verbose=False)

Training loss is monitored.


## Training Starling


In [None]:
## train ST
st.train_and_fit(
 callbacks=[cb_early_stopping],
 logger=[log_tb],
)

## Appending STARLING results to the annData object


In [None]:
## retrive starling results
result = st.result()

## The following information can be retrived from the annData object:

- st.adata.varm['init_exp_centroids'] -- initial expression cluster centroids (P x C matrix)
- st.adata.varm['st_exp_centroids'] -- ST expression cluster centroids (P x C matrix)
- st.adata.uns['init_cell_size_centroids'] -- initial cell size centroids if STARLING models cell size
- st.adata.uns['st_cell_size_centroids'] -- initial & ST cell size centroids if ST models cell size
- st.adata.obsm['assignment_prob_matrix'] -- cell assignment probability (N x C maxtrix)
- st.adata.obsm['gamma_prob_matrix'] -- gamma probabilitiy of two cells (N x C x C maxtrix)
- st.adata.obs['doublet'] -- doublet indicator
- st.adata.obs['doublet_prob'] -- doublet probabilities
- st.adata.obs['init_label'] -- initial assignments
- st.adata.obs['st_label'] -- ST assignments
- st.adata.obs['max_assign_prob'] -- ST max probabilites of assignments

_N: # of cells; C: # of clusters; P: # of proteins_


## Saving the model


In [None]:
## st object can be saved
torch.save(st, "model.pt")

model.pt will be saved in the same directory as this notebook.


## Showing STARLING results


In [None]:
display(result)

One could easily perform further analysis such as co-occurance, enrichment analysis and etc.


In [None]:
result.obs

Starling provides doublet probabilities and cell assignment if it were a singlet for each cell.


## Showing initial expression centroids:


In [None]:
## initial expression centroids (p x c) matrix
pd.DataFrame(result.varm["init_exp_centroids"], index=result.var_names)

There are 10 centroids since we set Kmeans (KM) as k = 10 earlier.


## Showing Starling expression centroids:


In [None]:
## starling expression centroids (p x c) matrix
pd.DataFrame(result.varm["st_exp_centroids"], index=result.var_names)

From here one could easily annotate cluster centroids to cell type.


## Showing Assignment Distributions:


In [None]:
## assignment distributions (n x c maxtrix)
pd.DataFrame(result.obsm["assignment_prob_matrix"], index=result.obs.index)

Currently, we assign a cell label based on the maximum probability among all possible clusters. However, these could be mislabeled because maximum and second highest probabilies can be very close.