In cancer research, the situation is typically such that we have a large volume of data, for example, from genome-wide expression, copy number, or methylation measurements, but the number of samples is relatively small. This situation prevents the use of many datum-mining algorithms, and we end up in a familiar situation, where genome-wide measurements result in the identification of one gene or of another feature, such as a mutation, instead of giving a deeper understanding of what is occurring in the gene regulatory network of cancerous cells. Thus, we argue that the inclusion of domain knowledge is necessary in limiting the degrees of freedom to a level where machine learning and big data can be fully used in cancer research.
One particularly successful way of introducing domain knowledge to genome-wide data is by using gene regulatory, signaling, metabolic, or context-specific pathways. These pathways contain information on genes, in relation to each other, that often work in concert towards a phenotypic action, such as cell movement, or something more intangible, such as growth signaling, giving the cell’s interaction a biological context. If we refer to the previous example, measurements of genes are much like measurements of the radio’s components or other properties. They are ambiguous without context and do not offer much help in understanding how the entire system works or why it does not work the way it should. Continuing with the analogy, pathways give genes a context whereby they influence the system, much like circuit diagrams show connections of the components in the radio, an inarguably valuable tool in understanding and fixing the device.
Pathway activities or inactivity are often more interesting than individual genes or mutations. This is not only because different aberrations can bring similar cancer phenotypes, for example the mouse double minute 2 homolog (MDM2)-p53 interaction, but also because pathways are closer to the phenotype that is often the context of interest in cancer research, such as “are the cells dividing after the treatment or not?” Computationally, the appealing feature in pathways is that they can reduce the dimensionality of gene expression data from tens of thousands of genes to hundreds of pathways. This reduction also makes genome-wide data much more digestible for researchers. Therefore, in many applications, researchers rather work on the pathway level than on the gene level, and many different pathway analyses have been implemented to date.
Recent genomic and molecular characterizations of cancer, especially the findings reported by the TCGA project, have shed light on cancer heterogeneity and potential targeted therapeutics, for example by recognizing new subtypes of gastric cancer [6]. In general, targeted computational methods, which make effective use of the available multimodal biological information, can significantly improve our ability to identify candidate biomarkers and targets and to conduct functional analyses [7]. For example, reducing the redundancy in enrichment analysis helps to reveal gene ontology modules efficiently and systematically [8].
Example 2. Using pathway-level information connected with TCGA data
To become malignant cancer cells, normal cells must acquire a set of mutations that confer “hallmark” traits, such as increased proliferation, immortality, and invasiveness [9]. Usually, a single mutation is not enough to result in malignant growth, but several genes contributing to the process need to be “hit” before a pivotal phenotypic change takes place. Acquiring these traits can be conveniently described and understood as alterations in pathways. In many settings, cancers are classified based on the status of a single (often actionable) gene, such as Her2 amplification-positive breast cancers and KIT mutation-positive gastrointestinal tumors, but almost all cancers have a characteristic set of somatic mutations that can be used to identify and classify the tumors and even to learn something about their clinical behavior.
After profiling a cohort of tumors, such as in the TCGA project, the common follow-up analysis is to cluster the tumors into subgroups based on genomic features, such as gene expression. Sometimes the projection of clinical data on top of the clusters reveals different clinical courses for the patients in each group, but more often, there is no clear difference in physiology. A similar deduction can also be conducted beginning from the gene expression that makes each group different, except that we do not currently understand the function of many genes, and even if we do, we do not know how to connect their molecular functions to the tumor phenotype. Furthermore, gene expression data can be quite different for many similar tumors because aberrations in different genes can cause similar phenotypic effects.
Statistical tools for investigating set enrichment can reveal hidden trends in gene lists that differ between tumor subgroups; for example, surprisingly, many radiation resistance-related genes differentiate one subgroup from others. The set enrichment tools also have the advantage of not being very sensitive to noisy gene expression data. Pathway-level results might hint as to why the clinical course for these patients might have been different than others. If enrichment analysis is performed systematically for hundreds of pathways, those data can be used to profile each subgroup. Alternatively, if we compute the enriched pathways for each sample prior to clustering, we can use the enrichment data to cluster tumors into subgroups that might be easier to interpret and understand. Pathway analysis offers intriguing opportunities; for example, if we know that the pathway activation profiles of two subgroups of different cancers are similar, we might hypothesize that both can be treated effectively by the same drug [10].