Large scale graph analysis for genomic data

Biological networks encode results of high throughput experiments in massive, heterogeneous structures called Knowledge graphs. Navigating and analyzing these graphs is very valuable, since it can lead to new knowledge and biological insights. One such graph is KnowEng, that encodes gene-gene and gene function relationships. A task that is of importance in this graph is one of predicting gene set membership. Given a set of genes known to be related one another through, say, co-occurrence in a disease, is it possible to predict from this set, any other genes in the KnowEng graph that are likely to belong to this set? We navigate the graph to identify topological similarity between the genes of the data set, that can then be used as a signature to predict other genes in the set.
This work was done in collaboration with Mayo clinic, as part of the NIH sponsored BD2K (Big Data to Knowledge) center, KnowEng, from UIUC.

Geneset MAPR scales to very large sized heterogeneous knowledge graphs.

The methods employed in Geneset MAPR can be used for knowledge graphs outside of the KnowEng knowledge graph that it was designed for. It can be used for community detection, anomaly/outlier detection and similar analyses in knowledge graphs. It has been used to find new genes that contribute to breast cancer from the breast cancer data set provided by Mayo clinic.

We have invented Geneset MAPR, a software methodology and tool to predict gene set membership in a heterogeneous knowledge graph. The techniques that have previously been useful in similar tasks have been random walk with restart etc., that are unmindful of the heterogeneity of the graph. They consider all paths to be the same. With Geneset MAPR, we have shown that different edge types contribute to different, better notions of similarity. This can be used in general to find similarity in the nodes of very large scale graphs.
Geneset MAPR