Cancer3D Tutorial

Guidelines to interpret the results in Cancer3D

Both our algorithms, e-Driver and e-Drug, provide two-sided p values are derived from binomial and Wilcoxon tests respectively. As stated previously by many others, a P value below any predetermined threshold does not necessarily mean that there is a true effect. Moreover, one has to take into account the issues related to multiple testing, as Cancer3D provides information from multiple projects on 14,700 proteins (or over 100,000 when including all available isoforms).

While we believe that the simple implementation of any multiple-testing correction would not really help you in interpreting the results obtained with our algorithms, we also think that using any predetermined P value threshold will not be the best use of this website and that you should take into account issues related to multiple-testing in your analysis. This is why we created this section discussing which thresholds we think you should apply in each scenario.

a) Interpreting results from e-Driver

As explained in the paper describing e-Driver, this algorithm uses a binomial test to determine whether a protein functional region is enriched or depleted in mutations. The method assumes that all regions of a protein are equally likely to be mutated and that, therefore, any observed bias in the mutation rate of a protein is likely to be caused by the region being biologically relevant in the cancer samples being analyzed. We are aware that mutation rates in cancer genomes are not constant and that correlate with various features, such as a gene's replication time or expression levels. While important at genome-wide scale, these effects have never been shown to affect mutation rates at scales as small as the length of a single gene. In fact, methods that take into account this effects to look for cancer-driver genes, such as MutSigCV, explicitly assume that mutation rates within a gene remain constant.

Another feature that could affect the results obtained by e-Driver is the GC content of the analyzed region. Since different cancer types have different mutation patterns (lung cancers are enriched in C -> A mutations for example), it could be that e-Driver identifies a region because it has a bias in its GC content and, hence, is more "mutable" in a specific cancer type. However, we have shown that regions identified by e-Driver (at least in the Pancancer analysis) have the same GC content than the rest of the protein (Figure 1).

Sorry, there has been an error

Figure 1 - GC content of the regions identified by e-Driver in the Pancancer analysis

Regarding multiple-testing corrections, we applied the Benjamini-Hochberg method to all the P values obtained analyzing the longest isoforms of each protein and mutations from the Pancancer dataset. The correlation between the P values and the False Discovery Rate is shown in Figure 2. As shown there, any region with a P value below 1e-4 is likely to be a true positive. This corresponds, approximately, to a False Discovery Rate < 0.1 (green area in Figure 2). This does not mean that all results with P values above 1e-4 are a true negative, there likely are true positives in this group. For example, any region with a P value below 1e-3 has a FDR below 0.3 in the Pancancer dataset (yellow area in Figure 2). If a region you are studying falls in this range, there certainly is a chance that it is important in cancer, but we encourage you to use other resources or datasets to confirm it. Finally, if you are analyzing data from an individual cancer project, as these have less mutations, the statistical power will be lower. In this context we recommend to be a little more loose in the thresholds and extend the list of candidate regions to those with a P value below 0.01 (orange region in Figure 2), although, again, proceeding with caution as the P value lowers. We do not think that you should consider any region with a P value above 0.01 as a likely positive unless you have evidence suggesting otherwise.

Sorry, there has been an error

Figure 2 - Correlation between P values and FDR in the Pancancer analysis

b) Interpreting results from e-Drug

We believe that in order to better understand the results from e-Drug you should take into account the following:

1) We currently only have data from the Cancer Cell Line Encyclopedia. This is an excellent dataset, but it "only" contains pharmacological data for around 500 cancer cell lines. This limits our statistical power, as most (81%) regions have 3 or less mutations (Figure 3). In that scenario we think that it does not make sense to apply multiple testing corrections, as this most true positives would likely be discarded simply because of lack of statistical power. In this context we recommend to consider as potential positives all the regions with a P value < 0.01

Sorry, there has been an error

Figure 3 - Histogram showing the number of mutations in each region-drug analysis. Note that since we analyzed 24 different drugs and 8434 regions, there are a total of 202,416 possible combinations, which explains why the y-axis goes up to 60,000

2) This, obviously, does not mean that all the regions with P values below 0.01 are true positives. In order to help you interpret the results, e-Drug gives you the P values for two different comparisons (Figure 4). The first P value that you should check is the one from the comparison between cell lines with mutations in the region being analyzed (left box) and cell lines with no mutations in the protein (or "WT", right boxp). This first comparison will tell you whether mutations in the region correlate with changes in the drug's activity.

The second P value that you should check is the one from the comparison between the region being analyzed (left box) and the rest of the protein (middle box). This second P value will tell you whether the effect that you have seen in the first comparison is specific to the region or, instead, can be attributed to the protein. If there are no differences between the region and the rest of the protein it is likely that what correlates with the activity of the drug is the mutation of the gene (regardless of the region), not the region analyzed itself (which is probably also interesting!). If, on the other hand, you see also differences in this second comparison, it is more likely that the region is a true positive.

Sorry, there has been an error

Figure 4 - Screenshot showing the e-Drug results for PIK3CA and AEW451