intNMF examples

Load required libraries. int_nmf_model must be in the same directory. If it is not it can be added to pythons path

[10]:
import anndata as ad
import scanpy as sc
import numpy as np
import scipy
from scripts.int_nmf_model import intNMF, log_tf_idf
import anndata as ad
from matplotlib import pyplot as plt
import scipy.io
import pandas as pd

This notebook contains some examples of what intNMF can do including how to load multiome data stored as three different file types.

Example 1 - anndata

Probably the simplest case where the multiome data is stored as anndata object. https://anndata.readthedocs.io/en/latest/. Dataset from recent neurips competition

Add file locations and load the data

[11]:
atac_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_mod1.h5ad'  #comp data multi
rna_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_mod2.h5ad'
res_loc = 'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_solution.h5ad'
[12]:
res
[12]:
'data/anndata/openproblems_bmmc_multiome_phase2.censor_dataset.output_solution.h5ad'
[13]:
atac_mat = ad.read_h5ad(atac_loc)
rna_mat = ad.read_h5ad(rna_loc)
[14]:
res_mat =  ad.read_h5ad(res_loc)
[15]:
res_mat
[15]:
AnnData object with n_obs × n_vars = 42492 × 13431
    obs: 'batch', 'cell_type', 'pseudotime_order_GEX', 'pseudotime_order_ATAC'
    var: 'gene_ids', 'feature_types'
    uns: 'dataset_id', 'organism'
[16]:
atac_mat.obs['cell_type'] = res_mat.obs['cell_type']

In this case the atac data is binary and the rna data are real numbers. This data has already been preprocessed and filtered. Before carrying out the matrix factorisation log_tf_idf transform the data

[17]:
atac_mat_tfidf_log = log_tf_idf(atac_mat.X)
rna_mat_tfidf_log = log_tf_idf(rna_mat.X)

initialise the nmf model then fit the model. This minimises the error X - WH for the atac and rna matrices respectively

[18]:
nmf_model = intNMF(20, epochs = 20, init = 'random')
nmf_model.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)

Theta holds the topic activities in each cell. Numpy array of shape cell x topic

[19]:
nmf_model.theta.shape
[19]:
(42492, 20)

phi rna and phi atac hold the topic loadings or ‘definitions’ which facilitate recontruction of X from the latent topics. Numpy array of shape topic x feature (e.g. accessible chromatin or gene expression)

[20]:
print(nmf_model.phi_rna.shape)
nmf_model.phi_atac.shape
(20, 116490)
[20]:
(20, 13431)

Example plots to view the NMF matrices

View topic x gene matrix. These matrices have been thresholded at one to improve visualisation

[21]:
plt.imshow(np.where(nmf_model.phi_rna > 1, 1, nmf_model.phi_rna), aspect=nmf_model.phi_rna.shape[1]/nmf_model.phi_rna.shape[0], interpolation=None)
cbar = plt.colorbar()
../_images/notebooks_example_nmf_21_0.png

View topic x peak matrix.

[22]:
plt.imshow(np.where(nmf_model.phi_atac > 1, 1, nmf_model.phi_atac), aspect=nmf_model.phi_atac.shape[1]/nmf_model.phi_atac.shape[0], interpolation=None)
cbar = plt.colorbar()
../_images/notebooks_example_nmf_23_0.png

View cell x topic matrix

[23]:
plt.imshow(np.where(nmf_model.theta > 1, 1, nmf_model.theta), aspect=nmf_model.theta.shape[1]/nmf_model.theta.shape[0], interpolation=None)
cbar = plt.colorbar()
../_images/notebooks_example_nmf_25_0.png

Plots to view histograms of individual topics or whole matrices.

Whole matrix

[24]:
plt.hist(nmf_model.theta.ravel())
[24]:
(array([8.14293e+05, 2.89050e+04, 3.71900e+03, 1.12200e+03, 7.99000e+02,
        6.88000e+02, 2.22000e+02, 5.70000e+01, 2.20000e+01, 1.30000e+01]),
 array([1.25888058e-15, 1.25888058e+00, 2.51776115e+00, 3.77664173e+00,
        5.03552231e+00, 6.29440288e+00, 7.55328346e+00, 8.81216404e+00,
        1.00710446e+01, 1.13299252e+01, 1.25888058e+01]),
 <BarContainer object of 10 artists>)
../_images/notebooks_example_nmf_28_1.png

Just the first topic

[25]:
plt.hist(nmf_model.theta[:, 0].ravel())
[25]:
(array([3.8342e+04, 1.9830e+03, 4.9800e+02, 2.7700e+02, 2.8300e+02,
        3.8400e+02, 3.9600e+02, 2.4500e+02, 7.3000e+01, 1.1000e+01]),
 array([1.27215510e-15, 2.27702815e-01, 4.55405629e-01, 6.83108444e-01,
        9.10811259e-01, 1.13851407e+00, 1.36621689e+00, 1.59391970e+00,
        1.82162252e+00, 2.04932533e+00, 2.27702815e+00]),
 <BarContainer object of 10 artists>)
../_images/notebooks_example_nmf_30_1.png

Or the gene contributions to the first topic

[26]:
plt.hist(nmf_model.phi_rna[0, :].ravel())
[26]:
(array([8.5861e+04, 1.9619e+04, 7.0380e+03, 2.5430e+03, 9.1300e+02,
        3.3200e+02, 1.2300e+02, 3.4000e+01, 2.2000e+01, 5.0000e+00]),
 array([7.91979667e-16, 5.09246303e-01, 1.01849261e+00, 1.52773891e+00,
        2.03698521e+00, 2.54623152e+00, 3.05547782e+00, 3.56472412e+00,
        4.07397042e+00, 4.58321673e+00, 5.09246303e+00]),
 <BarContainer object of 10 artists>)
../_images/notebooks_example_nmf_32_1.png

Or the chromtin peak contributions to the first topic

[27]:
plt.hist(nmf_model.phi_atac[0, :].ravel())
[27]:
(array([1.2585e+04, 4.5900e+02, 1.5500e+02, 8.7000e+01, 4.1000e+01,
        4.4000e+01, 3.1000e+01, 1.7000e+01, 7.0000e+00, 5.0000e+00]),
 array([1.07812024e-15, 3.34391016e-01, 6.68782032e-01, 1.00317305e+00,
        1.33756406e+00, 1.67195508e+00, 2.00634610e+00, 2.34073711e+00,
        2.67512813e+00, 3.00951914e+00, 3.34391016e+00]),
 <BarContainer object of 10 artists>)
../_images/notebooks_example_nmf_34_1.png

Plot to view loss over time

[28]:
fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model.epoch_times), nmf_model.loss)
[28]:
[<matplotlib.lines.Line2D at 0x7f8140541970>]
../_images/notebooks_example_nmf_36_1.png

Using Scanpys plotting methods

For this dataset there are ground truth labels (although I think I’ve got a version without cell labels here). Below is an example of how to visualise the NMF embedding of the data using UMAP. This closely mirrors this tutorial https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html.

[29]:
atac_mat.obsm['nmf'] = nmf_model.theta
rna_mat.obsm['nmf'] = nmf_model.theta
[30]:
atac_mat
[30]:
AnnData object with n_obs × n_vars = 42492 × 13431
    obs: 'batch', 'size_factors', 'cell_type'
    var: 'gene_ids', 'feature_types'
    uns: 'dataset_id', 'organism'
    obsm: 'nmf'
    layers: 'counts'
[31]:
sc.pp.neighbors(atac_mat, use_rep='nmf', n_pcs=100)
sc.tl.umap(atac_mat)
sc.tl.leiden(atac_mat)
[32]:
atac_mat.obs['max_topic'] = np.argmax(atac_mat.obsm['nmf'], axis=1)
atac_mat.obs['max_topic']  = atac_mat.obs['max_topic'].astype("category")
[33]:
sc.pl.embedding(atac_mat, 'X_umap', color=['batch', 'leiden', 'max_topic', 'cell_type'])
../_images/notebooks_example_nmf_42_0.png
[34]:
for i in range(atac_mat.obsm['nmf'].shape[1]):
    plt.scatter(atac_mat.obsm['X_umap'][:, 0], atac_mat.obsm['X_umap'][:, 1], c=atac_mat.obsm['nmf'][:, i], s=0.5)
    plt.colorbar()
    plt.show()
../_images/notebooks_example_nmf_43_0.png
../_images/notebooks_example_nmf_43_1.png
../_images/notebooks_example_nmf_43_2.png
../_images/notebooks_example_nmf_43_3.png
../_images/notebooks_example_nmf_43_4.png
../_images/notebooks_example_nmf_43_5.png
../_images/notebooks_example_nmf_43_6.png
../_images/notebooks_example_nmf_43_7.png
../_images/notebooks_example_nmf_43_8.png
../_images/notebooks_example_nmf_43_9.png
../_images/notebooks_example_nmf_43_10.png
../_images/notebooks_example_nmf_43_11.png
../_images/notebooks_example_nmf_43_12.png
../_images/notebooks_example_nmf_43_13.png
../_images/notebooks_example_nmf_43_14.png
../_images/notebooks_example_nmf_43_15.png
../_images/notebooks_example_nmf_43_16.png
../_images/notebooks_example_nmf_43_17.png
../_images/notebooks_example_nmf_43_18.png
../_images/notebooks_example_nmf_43_19.png

Or use PCA embedding to visualise topic scores

[35]:
sc.tl.pca(atac_mat, svd_solver='arpack')
sc.pp.neighbors(atac_mat, n_neighbors=50, n_pcs=100, use_rep='X_pca', key_added='pca_nb')
sc.tl.umap(atac_mat, neighbors_key='pca_nb')

[36]:
sc.pl.embedding(atac_mat, 'X_umap', color=['batch', 'leiden', 'max_topic', 'cell_type'])
../_images/notebooks_example_nmf_46_0.png

Example 2 - 10x HDF5 data

10x data pbmc 10k sorted. HDF5 https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/advanced/h5_matrices

[37]:
import collections
import scipy.sparse as sp_sparse
import tables
import scanpy as sc
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])

function from above link

[38]:
def get_matrix_from_h5(filename):
    with tables.open_file(filename, 'r') as f:
        mat_group = f.get_node(f.root, 'matrix')
        barcodes = f.get_node(mat_group, 'barcodes').read()
        data = getattr(mat_group, 'data').read()
        indices = getattr(mat_group, 'indices').read()
        indptr = getattr(mat_group, 'indptr').read()
        shape = getattr(mat_group, 'shape').read()
        matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)

        feature_ref = {}
        feature_group = f.get_node(mat_group, 'features')
        feature_ids = getattr(feature_group, 'id').read()
        feature_names = getattr(feature_group, 'name').read()
        feature_types = getattr(feature_group, 'feature_type').read()
        feature_ref['id'] = feature_ids
        feature_ref['name'] = feature_names
        feature_ref['feature_type'] = feature_types
        tag_keys = [key.decode() for key in getattr(feature_group, '_all_tag_keys').read()]
        for key in tag_keys:
            feature_ref[key] = getattr(feature_group, key).read()

        return CountMatrix(feature_ref, barcodes, matrix)
[39]:
filtered_matrix_h5 = "data/10x/hdf5/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5"
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)
[40]:
filtered_feature_bc_matrix[0]
[40]:
{'id': array([b'ENSG00000243485', b'ENSG00000237613', b'ENSG00000186092', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26'),
 'name': array([b'MIR1302-2HG', b'FAM138A', b'OR4F5', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26'),
 'feature_type': array([b'Gene Expression', b'Gene Expression', b'Gene Expression', ...,
        b'Peaks', b'Peaks', b'Peaks'], dtype='|S15'),
 'genome': array([b'GRCh38', b'GRCh38', b'GRCh38', ..., b'GRCh38', b'GRCh38',
        b'GRCh38'], dtype='|S6'),
 'interval': array([b'chr1:29553-30267', b'chr1:36080-36081', b'chr1:65418-69055', ...,
        b'KI270713.1:29624-30457', b'KI270713.1:31340-32243',
        b'KI270713.1:36927-37836'], dtype='|S26')}
[41]:
filtered_feature_bc_matrix[1]
[41]:
array([b'AAACAGCCAAGGAATC-1', b'AAACAGCCAATCCCTT-1',
       b'AAACAGCCAATGCGCT-1', ..., b'TTTGTTGGTTGCAGTA-1',
       b'TTTGTTGGTTGGTTAG-1', b'TTTGTTGGTTTGCAGA-1'], dtype='|S18')
[42]:
filtered_feature_bc_matrix[2]
[42]:
<180488x11898 sparse matrix of type '<class 'numpy.int32'>'
        with 127525374 stored elements in Compressed Sparse Column format>
[43]:
multi_10x = ad.AnnData( filtered_feature_bc_matrix[2].T,
                        filtered_feature_bc_matrix[1],
                        filtered_feature_bc_matrix[0])

[44]:
multi_10x.var
[44]:
id name feature_type genome interval
0 b'ENSG00000243485' b'MIR1302-2HG' b'Gene Expression' b'GRCh38' b'chr1:29553-30267'
1 b'ENSG00000237613' b'FAM138A' b'Gene Expression' b'GRCh38' b'chr1:36080-36081'
2 b'ENSG00000186092' b'OR4F5' b'Gene Expression' b'GRCh38' b'chr1:65418-69055'
3 b'ENSG00000238009' b'AL627309.1' b'Gene Expression' b'GRCh38' b'chr1:120931-133723'
4 b'ENSG00000239945' b'AL627309.3' b'Gene Expression' b'GRCh38' b'chr1:91104-91105'
... ... ... ... ... ...
180483 b'KI270713.1:21434-22340' b'KI270713.1:21434-22340' b'Peaks' b'GRCh38' b'KI270713.1:21434-22340'
180484 b'KI270713.1:26201-26826' b'KI270713.1:26201-26826' b'Peaks' b'GRCh38' b'KI270713.1:26201-26826'
180485 b'KI270713.1:29624-30457' b'KI270713.1:29624-30457' b'Peaks' b'GRCh38' b'KI270713.1:29624-30457'
180486 b'KI270713.1:31340-32243' b'KI270713.1:31340-32243' b'Peaks' b'GRCh38' b'KI270713.1:31340-32243'
180487 b'KI270713.1:36927-37836' b'KI270713.1:36927-37836' b'Peaks' b'GRCh38' b'KI270713.1:36927-37836'

180488 rows × 5 columns

[45]:
rna_mat_10x = multi_10x[:, multi_10x.var['feature_type'] == b'Gene Expression']
[46]:
atac_mat_10x = multi_10x[:, multi_10x.var['feature_type'] == b'Peaks']
[47]:
rna_mat_10x
[47]:
View of AnnData object with n_obs × n_vars = 11898 × 36601
    obs: 0
    var: 'id', 'name', 'feature_type', 'genome', 'interval'
[48]:
atac_mat_10x
[48]:
View of AnnData object with n_obs × n_vars = 11898 × 143887
    obs: 0
    var: 'id', 'name', 'feature_type', 'genome', 'interval'

scanpys read_10x_h5 function does not appear to work. Only loads atac features.

[49]:
h5_10x= sc.read_10x_h5(filtered_matrix_h5)
h5_10x.var
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[49]:
gene_ids feature_types genome
MIR1302-2HG ENSG00000243485 Gene Expression GRCh38
FAM138A ENSG00000237613 Gene Expression GRCh38
OR4F5 ENSG00000186092 Gene Expression GRCh38
AL627309.1 ENSG00000238009 Gene Expression GRCh38
AL627309.3 ENSG00000239945 Gene Expression GRCh38
... ... ... ...
AC141272.1 ENSG00000277836 Gene Expression GRCh38
AC023491.2 ENSG00000278633 Gene Expression GRCh38
AC007325.1 ENSG00000276017 Gene Expression GRCh38
AC007325.4 ENSG00000278817 Gene Expression GRCh38
AC007325.2 ENSG00000277196 Gene Expression GRCh38

36601 rows × 3 columns

[50]:
atac_mat_tfidf_log = log_tf_idf(atac_mat_10x.X)
rna_mat_tfidf_log = log_tf_idf(rna_mat_10x.X)
[51]:
nmf_model_10x = intNMF(20, epochs = 20)
nmf_model_10x.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)
[52]:
fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model_10x.epoch_times), nmf_model_10x.loss)
[52]:
[<matplotlib.lines.Line2D at 0x7f806c063490>]
../_images/notebooks_example_nmf_65_1.png
[53]:
nmf_model_10x.theta.shape
[53]:
(11898, 20)
[54]:
nmf_model_10x.phi_rna.shape
nmf_model_10x.phi_atac.shape
[54]:
(20, 143887)

split into two matrices

Example 3 - 10x ‘dir’ data

10x data pbmc 10k sorted. Dir

[55]:
mat_dir = "data/10x/dir/"
[56]:
mat = mat = scipy.io.mmread(mat_dir + 'matrix.mtx.gz' )
[57]:
obs = pd.read_csv(mat_dir + 'barcodes.tsv.gz', header=None, sep='\t')
var = pd.read_csv(mat_dir + 'features.tsv.gz', names=['feature', 'name', 'feature_type', 'chr', 'str', 'stop'], sep='\t')
[58]:
mat
[58]:
<180488x11898 sparse matrix of type '<class 'numpy.int64'>'
        with 127525374 stored elements in COOrdinate format>
[59]:
obs
[59]:
0
0 AAACAGCCAAGGAATC-1
1 AAACAGCCAATCCCTT-1
2 AAACAGCCAATGCGCT-1
3 AAACAGCCACACTAAT-1
4 AAACAGCCACCAACCG-1
... ...
11893 TTTGTTGGTGTTAAAC-1
11894 TTTGTTGGTTAGGATT-1
11895 TTTGTTGGTTGCAGTA-1
11896 TTTGTTGGTTGGTTAG-1
11897 TTTGTTGGTTTGCAGA-1

11898 rows × 1 columns

[60]:
var
[60]:
feature name feature_type chr str stop
0 ENSG00000243485 MIR1302-2HG Gene Expression chr1 29553 30267
1 ENSG00000237613 FAM138A Gene Expression chr1 36080 36081
2 ENSG00000186092 OR4F5 Gene Expression chr1 65418 69055
3 ENSG00000238009 AL627309.1 Gene Expression chr1 120931 133723
4 ENSG00000239945 AL627309.3 Gene Expression chr1 91104 91105
... ... ... ... ... ... ...
180483 KI270713.1:21434-22340 KI270713.1:21434-22340 Peaks KI270713.1 21434 22340
180484 KI270713.1:26201-26826 KI270713.1:26201-26826 Peaks KI270713.1 26201 26826
180485 KI270713.1:29624-30457 KI270713.1:29624-30457 Peaks KI270713.1 29624 30457
180486 KI270713.1:31340-32243 KI270713.1:31340-32243 Peaks KI270713.1 31340 32243
180487 KI270713.1:36927-37836 KI270713.1:36927-37836 Peaks KI270713.1 36927 37836

180488 rows × 6 columns

[61]:
multi_10x_dir = ad.AnnData(mat.T,
                           obs,
                           var)

/srv/conda/envs/saturn/lib/python3.9/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
[62]:
multi_10x_dir
[62]:
AnnData object with n_obs × n_vars = 11898 × 180488
    obs: 0
    var: 'feature', 'name', 'feature_type', 'chr', 'str', 'stop'
[63]:
rna_dir = multi_10x[:, multi_10x.var['feature_type'] == b'Gene Expression']
[64]:
atac_dir = multi_10x[:, multi_10x.var['feature_type'] == b'Peaks']
[65]:
atac_mat_tfidf_log = log_tf_idf(atac_dir.X)
rna_mat_tfidf_log = log_tf_idf(rna_dir.X)
[66]:
nmf_model_10x_dir = intNMF(20, epochs = 20, init = 'random')
nmf_model_10x_dir.fit(rna_mat_tfidf_log, atac_mat_tfidf_log)
[67]:
fig, ax = plt.subplots()

ax.plot(np.cumsum(nmf_model_10x.epoch_times), nmf_model_10x.loss)
[67]:
[<matplotlib.lines.Line2D at 0x7f800e618790>]
../_images/notebooks_example_nmf_82_1.png
[68]:
nmf_model_10x.theta.shape
[68]:
(11898, 20)
[69]:
print(nmf_model_10x.phi_rna.shape)
nmf_model_10x.phi_atac.shape
(20, 36601)
[69]:
(20, 143887)
[ ]: