intNMF examples

Load required libraries. int_nmf_model must be in the same directory. If it is not it can be added to pythons path

[10]:

import anndata as ad
import scanpy as sc
import numpy as np
import scipy
from scripts.int_nmf_model import intNMF, log_tf_idf
import anndata as ad
from matplotlib import pyplot as plt
import scipy.io
import pandas as pd

This notebook contains some examples of what intNMF can do including how to load multiome data stored as three different file types.

Example 1 - anndata

Probably the simplest case where the multiome data is stored as anndata object. https://anndata.readthedocs.io/en/latest/. Dataset from recent neurips competition

Add file locations and load the data

[11]:

atac_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_mod1.h5ad'  #comp data multi
rna_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_mod2.h5ad'
res_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_solution.h5ad'

[12]:

res

[12]:

'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_solution.h5ad'

[13]:

atac_mat = ad.read_h5ad(atac_loc)
rna_mat = ad.read_h5ad(rna_loc)

[14]:

res_mat =  ad.read_h5ad(res_loc)

[15]:

res_mat

[15]:

AnnData object with n_obs × n_vars = 42492 × 13431
    obs: 'batch', 'cell_type', 'pseudotime_order_GEX', 'pseudotime_order_ATAC'
    var: 'gene_ids', 'feature_types'
    uns: 'dataset_id', 'organism'

[16]:

atac_mat.obs['cell_type'] = res_mat.obs['cell_type']

In this case the atac data is binary and the rna data are real numbers. This data has already been preprocessed and filtered. Before carrying out the matrix factorisation log_tf_idf transform the data

[17]:

atac_mat_tfidf_log = log_tf_idf(atac_mat.X)
rna_mat_tfidf_log = log_tf_idf(rna_mat.X)

initialise the nmf model then fit the model. This minimises the error X - WH for the atac and rna matrices respectively

[18]:

nmf_model = intNMF(20, epochs = 20, init = 'random')
nmf_model.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)

Theta holds the topic activities in each cell. Numpy array of shape cell x topic

[19]:

nmf_model.theta.shape

[19]:

(42492, 20)

phi rna and phi atac hold the topic loadings or ‘definitions’ which facilitate recontruction of X from the latent topics. Numpy array of shape topic x feature (e.g. accessible chromatin or gene expression)

[20]:

print(nmf_model.phi_rna.shape)
nmf_model.phi_atac.shape

(20, 116490)

[20]:

(20, 13431)

Example plots to view the NMF matrices

View topic x gene matrix. These matrices have been thresholded at one to improve visualisation

[21]:

plt.imshow(np.where(nmf_model.phi_rna > 1, 1, nmf_model.phi_rna), aspect=nmf_model.phi_rna.shape[1]/nmf_model.phi_rna.shape[0], interpolation=None)
cbar = plt.colorbar()

../_images/notebooks_example_nmf_21_0.png

View topic x peak matrix.

[22]:

plt.imshow(np.where(nmf_model.phi_atac > 1, 1, nmf_model.phi_atac), aspect=nmf_model.phi_atac.shape[1]/nmf_model.phi_atac.shape[0], interpolation=None)
cbar = plt.colorbar()

../_images/notebooks_example_nmf_23_0.png

View cell x topic matrix

[23]:

plt.imshow(np.where(nmf_model.theta > 1, 1, nmf_model.theta), aspect=nmf_model.theta.shape[1]/nmf_model.theta.shape[0], interpolation=None)
cbar = plt.colorbar()

../_images/notebooks_example_nmf_25_0.png

Plots to view histograms of individual topics or whole matrices.

Whole matrix

[24]:

plt.hist(nmf_model.theta.ravel())

[24]:

(array([8.14293e+05, 2.89050e+04, 3.71900e+03, 1.12200e+03, 7.99000e+02,
        6.88000e+02, 2.22000e+02, 5.70000e+01, 2.20000e+01, 1.30000e+01]),
 array([1.25888058e-15, 1.25888058e+00, 2.51776115e+00, 3.77664173e+00,
        5.03552231e+00, 6.29440288e+00, 7.55328346e+00, 8.81216404e+00,
        1.00710446e+01, 1.13299252e+01, 1.25888058e+01]),
 <BarContainer object of 10 artists>)

../_images/notebooks_example_nmf_28_1.png

Just the first topic

[25]:

plt.hist(nmf_model.theta[:, 0].ravel())

[25]:

(array([3.8342e+04, 1.9830e+03, 4.9800e+02, 2.7700e+02, 2.8300e+02,
        3.8400e+02, 3.9600e+02, 2.4500e+02, 7.3000e+01, 1.1000e+01]),
 array([1.27215510e-15, 2.27702815e-01, 4.55405629e-01, 6.83108444e-01,
        9.10811259e-01, 1.13851407e+00, 1.36621689e+00, 1.59391970e+00,
        1.82162252e+00, 2.04932533e+00, 2.27702815e+00]),
 <BarContainer object of 10 artists>)

../_images/notebooks_example_nmf_30_1.png

Or the gene contributions to the first topic

[26]:

plt.hist(nmf_model.phi_rna[0, :].ravel())

[26]:

(array([8.5861e+04, 1.9619e+04, 7.0380e+03, 2.5430e+03, 9.1300e+02,
        3.3200e+02, 1.2300e+02, 3.4000e+01, 2.2000e+01, 5.0000e+00]),
 array([7.91979667e-16, 5.09246303e-01, 1.01849261e+00, 1.52773891e+00,
        2.03698521e+00, 2.54623152e+00, 3.05547782e+00, 3.56472412e+00,
        4.07397042e+00, 4.58321673e+00, 5.09246303e+00]),
 <BarContainer object of 10 artists>)

../_images/notebooks_example_nmf_32_1.png

Or the chromtin peak contributions to the first topic

[27]:

plt.hist(nmf_model.phi_atac[0, :].ravel())

[27]:

(array([1.2585e+04, 4.5900e+02, 1.5500e+02, 8.7000e+01, 4.1000e+01,
        4.4000e+01, 3.1000e+01, 1.7000e+01, 7.0000e+00, 5.0000e+00]),
 array([1.07812024e-15, 3.34391016e-01, 6.68782032e-01, 1.00317305e+00,
        1.33756406e+00, 1.67195508e+00, 2.00634610e+00, 2.34073711e+00,
        2.67512813e+00, 3.00951914e+00, 3.34391016e+00]),
 <BarContainer object of 10 artists>)

../_images/notebooks_example_nmf_34_1.png

Plot to view loss over time

[28]:

fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model.epoch_times), nmf_model.loss)

[28]:

[<matplotlib.lines.Line2D at 0x7f8140541970>]

../_images/notebooks_example_nmf_36_1.png

Using Scanpys plotting methods

For this dataset there are ground truth labels (although I think I’ve got a version without cell labels here). Below is an example of how to visualise the NMF embedding of the data using UMAP. This closely mirrors this tutorial https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html.

[29]:

atac_mat.obsm['nmf'] = nmf_model.theta
rna_mat.obsm['nmf'] = nmf_model.theta

[30]:

atac_mat

[30]:

AnnData object with n_obs × n_vars = 42492 × 13431
    obs: 'batch', 'size_factors', 'cell_type'
    var: 'gene_ids', 'feature_types'
    uns: 'dataset_id', 'organism'
    obsm: 'nmf'
    layers: 'counts'

[31]:

sc.pp.neighbors(atac_mat, use_rep='nmf', n_pcs=100)
sc.tl.umap(atac_mat)
sc.tl.leiden(atac_mat)

[32]:

atac_mat.obs['max_topic'] = np.argmax(atac_mat.obsm['nmf'], axis=1)
atac_mat.obs['max_topic']  = atac_mat.obs['max_topic'].astype("category")

[33]:

sc.pl.embedding(atac_mat, 'X_umap', color=['batch', 'leiden', 'max_topic', 'cell_type'])

../_images/notebooks_example_nmf_42_0.png

[34]:

for i in range(atac_mat.obsm['nmf'].shape[1]):
    plt.scatter(atac_mat.obsm['X_umap'][:, 0], atac_mat.obsm['X_umap'][:, 1], c=atac_mat.obsm['nmf'][:, i], s=0.5)
    plt.colorbar()
    plt.show()

../_images/notebooks_example_nmf_43_0.png

../_images/notebooks_example_nmf_43_1.png

../_images/notebooks_example_nmf_43_2.png

../_images/notebooks_example_nmf_43_3.png

../_images/notebooks_example_nmf_43_4.png

../_images/notebooks_example_nmf_43_5.png

../_images/notebooks_example_nmf_43_6.png

../_images/notebooks_example_nmf_43_7.png

../_images/notebooks_example_nmf_43_8.png

../_images/notebooks_example_nmf_43_9.png

../_images/notebooks_example_nmf_43_10.png

../_images/notebooks_example_nmf_43_11.png

../_images/notebooks_example_nmf_43_12.png

../_images/notebooks_example_nmf_43_13.png

../_images/notebooks_example_nmf_43_14.png

../_images/notebooks_example_nmf_43_15.png

../_images/notebooks_example_nmf_43_16.png

../_images/notebooks_example_nmf_43_17.png

../_images/notebooks_example_nmf_43_18.png

../_images/notebooks_example_nmf_43_19.png

Or use PCA embedding to visualise topic scores

[35]:

sc.tl.pca(atac_mat, svd_solver='arpack')
sc.pp.neighbors(atac_mat, n_neighbors=50, n_pcs=100, use_rep='X_pca', key_added='pca_nb')
sc.tl.umap(atac_mat, neighbors_key='pca_nb')

[36]:

sc.pl.embedding(atac_mat, 'X_umap', color=['batch', 'leiden', 'max_topic', 'cell_type'])

../_images/notebooks_example_nmf_46_0.png

Example 2 - 10x HDF5 data

10x data pbmc 10k sorted. HDF5 https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/advanced/h5_matrices

[37]:

import collections
import scipy.sparse as sp_sparse
import tables
import scanpy as sc
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])

function from above link

[38]:

def get_matrix_from_h5(filename):
    with tables.open_file(filename, 'r') as f:
        mat_group = f.get_node(f.root, 'matrix')
        barcodes = f.get_node(mat_group, 'barcodes').read()
        data = getattr(mat_group, 'data').read()
        indices = getattr(mat_group, 'indices').read()
        indptr = getattr(mat_group, 'indptr').read()
        shape = getattr(mat_group, 'shape').read()
        matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)

        feature_ref = {}
        feature_group = f.get_node(mat_group, 'features')
        feature_ids = getattr(feature_group, 'id').read()
        feature_names = getattr(feature_group, 'name').read()
        feature_types = getattr(feature_group, 'feature_type').read()
        feature_ref['id'] = feature_ids
        feature_ref['name'] = feature_names
        feature_ref['feature_type'] = feature_types
        tag_keys = [key.decode() for key in getattr(feature_group, '_all_tag_keys').read()]
        for key in tag_keys:
            feature_ref[key] = getattr(feature_group, key).read()

        return CountMatrix(feature_ref, barcodes, matrix)

[39]:

filtered_matrix_h5 = "data/10x/hdf5/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5"
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)

[40]:

filtered_feature_bc_matrix[0]

[40]:

{'id': array([b'ENSG00000243485', b'ENSG00000237613', b'ENSG00000186092', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26'),
 'name': array([b'MIR1302-2HG', b'FAM138A', b'OR4F5', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26'),
 'feature_type': array([b'Gene Expression', b'Gene Expression', b'Gene Expression', ...,
        b'Peaks', b'Peaks', b'Peaks'], dtype='|S15'),
 'genome': array([b'GRCh38', b'GRCh38', b'GRCh38', ..., b'GRCh38', b'GRCh38',
        b'GRCh38'], dtype='|S6'),
 'interval': array([b'chr1:29553-30267', b'chr1:36080-36081', b'chr1:65418-69055', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26')}

[41]:

filtered_feature_bc_matrix[1]

[41]:

array([b'AAACAGCCAAGGAATC-1', b'AAACAGCCAATCCCTT-1',
       b'AAACAGCCAATGCGCT-1', ..., b'TTTGTTGGTTGCAGTA-1',
       b'TTTGTTGGTTGGTTAG-1', b'TTTGTTGGTTTGCAGA-1'], dtype='|S18')

[42]:

filtered_feature_bc_matrix[2]

[42]:

<180488x11898 sparse matrix of type '<class 'numpy.int32'>'
        with 127525374 stored elements in Compressed Sparse Column format>

[43]:

multi_10x = ad.AnnData( filtered_feature_bc_matrix[2].T,
                        filtered_feature_bc_matrix[1],
                        filtered_feature_bc_matrix[0])

[44]:

multi_10x.var

[44]:

	id	name	feature_type	genome	interval
0	b'ENSG00000243485'	b'MIR1302-2HG'	b'Gene Expression'	b'GRCh38'	b'chr1:29553-30267'
1	b'ENSG00000237613'	b'FAM138A'	b'Gene Expression'	b'GRCh38'	b'chr1:36080-36081'
2	b'ENSG00000186092'	b'OR4F5'	b'Gene Expression'	b'GRCh38'	b'chr1:65418-69055'
3	b'ENSG00000238009'	b'AL627309.1'	b'Gene Expression'	b'GRCh38'	b'chr1:120931-133723'
4	b'ENSG00000239945'	b'AL627309.3'	b'Gene Expression'	b'GRCh38'	b'chr1:91104-91105'
...	...	...	...	...	...
180483	b'KI270713.1:21434-22340'	b'KI270713.1:21434-22340'	b'Peaks'	b'GRCh38'	b'KI270713.1:21434-22340'
180484	b'KI270713.1:26201-26826'	b'KI270713.1:26201-26826'	b'Peaks'	b'GRCh38'	b'KI270713.1:26201-26826'
180485	b'KI270713.1:29624-30457'	b'KI270713.1:29624-30457'	b'Peaks'	b'GRCh38'	b'KI270713.1:29624-30457'
180486	b'KI270713.1:31340-32243'	b'KI270713.1:31340-32243'	b'Peaks'	b'GRCh38'	b'KI270713.1:31340-32243'
180487	b'KI270713.1:36927-37836'	b'KI270713.1:36927-37836'	b'Peaks'	b'GRCh38'	b'KI270713.1:36927-37836'

180488 rows × 5 columns

[45]:

rna_mat_10x = multi_10x[:, multi_10x.var['feature_type'] == b'Gene Expression']

[46]:

atac_mat_10x = multi_10x[:, multi_10x.var['feature_type'] == b'Peaks']

[47]:

rna_mat_10x

[47]:

View of AnnData object with n_obs × n_vars = 11898 × 36601
    obs: 0
    var: 'id', 'name', 'feature_type', 'genome', 'interval'

[48]:

atac_mat_10x

[48]:

View of AnnData object with n_obs × n_vars = 11898 × 143887
    obs: 0
    var: 'id', 'name', 'feature_type', 'genome', 'interval'

scanpys read_10x_h5 function does not appear to work. Only loads atac features.

[49]:

h5_10x= sc.read_10x_h5(filtered_matrix_h5)
h5_10x.var

Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.

[49]:

	gene_ids	feature_types	genome
MIR1302-2HG	ENSG00000243485	Gene Expression	GRCh38
FAM138A	ENSG00000237613	Gene Expression	GRCh38
OR4F5	ENSG00000186092	Gene Expression	GRCh38
AL627309.1	ENSG00000238009	Gene Expression	GRCh38
AL627309.3	ENSG00000239945	Gene Expression	GRCh38
...	...	...	...
AC141272.1	ENSG00000277836	Gene Expression	GRCh38
AC023491.2	ENSG00000278633	Gene Expression	GRCh38
AC007325.1	ENSG00000276017	Gene Expression	GRCh38
AC007325.4	ENSG00000278817	Gene Expression	GRCh38
AC007325.2	ENSG00000277196	Gene Expression	GRCh38

36601 rows × 3 columns

[50]:

atac_mat_tfidf_log = log_tf_idf(atac_mat_10x.X)
rna_mat_tfidf_log = log_tf_idf(rna_mat_10x.X)

[51]:

nmf_model_10x = intNMF(20, epochs = 20)
nmf_model_10x.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)

[52]:

fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model_10x.epoch_times), nmf_model_10x.loss)

[52]:

[<matplotlib.lines.Line2D at 0x7f806c063490>]

../_images/notebooks_example_nmf_65_1.png

[53]:

nmf_model_10x.theta.shape

[53]:

(11898, 20)

[54]:

nmf_model_10x.phi_rna.shape
nmf_model_10x.phi_atac.shape

[54]:

(20, 143887)

split into two matrices

Example 3 - 10x ‘dir’ data

10x data pbmc 10k sorted. Dir

[55]:

mat_dir = "data/10x/dir/"

[56]:

mat = mat = scipy.io.mmread(mat_dir + 'matrix.mtx.gz' )

[57]:

obs = pd.read_csv(mat_dir + 'barcodes.tsv.gz', header=None, sep='\t')
var = pd.read_csv(mat_dir + 'features.tsv.gz', names=['feature', 'name', 'feature_type', 'chr', 'str', 'stop'], sep='\t')

[58]:

mat

[58]:

<180488x11898 sparse matrix of type '<class 'numpy.int64'>'
        with 127525374 stored elements in COOrdinate format>

[59]:

obs

[59]:

	0
0	AAACAGCCAAGGAATC-1
1	AAACAGCCAATCCCTT-1
2	AAACAGCCAATGCGCT-1
3	AAACAGCCACACTAAT-1
4	AAACAGCCACCAACCG-1
...	...
11893	TTTGTTGGTGTTAAAC-1
11894	TTTGTTGGTTAGGATT-1
11895	TTTGTTGGTTGCAGTA-1
11896	TTTGTTGGTTGGTTAG-1
11897	TTTGTTGGTTTGCAGA-1

11898 rows × 1 columns

[60]:

var

[60]:

	feature	name	feature_type	chr	str	stop
0	ENSG00000243485	MIR1302-2HG	Gene Expression	chr1	29553	30267
1	ENSG00000237613	FAM138A	Gene Expression	chr1	36080	36081
2	ENSG00000186092	OR4F5	Gene Expression	chr1	65418	69055
3	ENSG00000238009	AL627309.1	Gene Expression	chr1	120931	133723
4	ENSG00000239945	AL627309.3	Gene Expression	chr1	91104	91105
...	...	...	...	...	...	...
180483	KI270713.1:21434-22340	KI270713.1:21434-22340	Peaks	KI270713.1	21434	22340
180484	KI270713.1:26201-26826	KI270713.1:26201-26826	Peaks	KI270713.1	26201	26826
180485	KI270713.1:29624-30457	KI270713.1:29624-30457	Peaks	KI270713.1	29624	30457
180486	KI270713.1:31340-32243	KI270713.1:31340-32243	Peaks	KI270713.1	31340	32243
180487	KI270713.1:36927-37836	KI270713.1:36927-37836	Peaks	KI270713.1	36927	37836

180488 rows × 6 columns

[61]:

multi_10x_dir = ad.AnnData(mat.T,
                           obs,
                           var)

/srv/conda/envs/saturn/lib/python3.9/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

[62]:

multi_10x_dir

[62]:

AnnData object with n_obs × n_vars = 11898 × 180488
    obs: 0
    var: 'feature', 'name', 'feature_type', 'chr', 'str', 'stop'

[63]:

rna_dir = multi_10x[:, multi_10x.var['feature_type'] == b'Gene Expression']

[64]:

atac_dir = multi_10x[:, multi_10x.var['feature_type'] == b'Peaks']

[65]:

atac_mat_tfidf_log = log_tf_idf(atac_dir.X)
rna_mat_tfidf_log = log_tf_idf(rna_dir.X)

[66]:

nmf_model_10x_dir = intNMF(20, epochs = 20, init = 'random')
nmf_model_10x_dir.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)

[67]:

fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model_10x.epoch_times), nmf_model_10x.loss)

[67]:

[<matplotlib.lines.Line2D at 0x7f800e618790>]

../_images/notebooks_example_nmf_82_1.png

[68]:

nmf_model_10x.theta.shape

[68]:

(11898, 20)

[69]:

print(nmf_model_10x.phi_rna.shape)
nmf_model_10x.phi_atac.shape

(20, 36601)

[69]:

(20, 143887)

[ ]: