Guidelines to interpret the results in Cancer3D
Both our algorithms, e-Driver and e-Drug, provide two-sided p values are derived from binomial and Wilcoxon tests respectively. As stated previously by many others, a P value below any predetermined threshold does not necessarily mean that there is a true effect. Moreover, one has to take into account the issues related to multiple testing, as Cancer3D provides information from multiple projects on 14,700 proteins (or over 100,000 when including all available isoforms).
While we believe that the simple implementation of any multiple-testing correction would not really help you in interpreting the results obtained with our algorithms, we also think that using any predetermined P value threshold will not be the best use of this website and that you should take into account issues related to multiple-testing in your analysis. This is why we created this section discussing which thresholds we think you should apply in each scenario.
As explained in the paper describing e-Driver, this algorithm uses a binomial test to determine whether a protein functional region is enriched or depleted in mutations. The method assumes that all regions of a protein are equally likely to be mutated and that, therefore, any observed bias in the mutation rate of a protein is likely to be caused by the region being biologically relevant in the cancer samples being analyzed. We are aware that mutation rates in cancer genomes are not constant and that correlate with various features, such as a gene's replication time or expression levels. While important at genome-wide scale, these effects have never been shown to affect mutation rates at scales as small as the length of a single gene. In fact, methods that take into account this effects to look for cancer-driver genes, such as MutSigCV, explicitly assume that mutation rates within a gene remain constant.
Another feature that could affect the results obtained by e-Driver is the GC content of the analyzed region. Since different cancer types have different mutation patterns (lung cancers are enriched in C -> A mutations for example), it could be that e-Driver identifies a region because it has a bias in its GC content and, hence, is more "mutable" in a specific cancer type. However, we have shown that regions identified by e-Driver (at least in the Pancancer analysis) have the same GC content than the rest of the protein (Figure 1).
Figure 1 - GC content of the regions identified by e-Driver in the Pancancer analysis
Figure 2 - Correlation between P values and FDR in the Pancancer analysis
We believe that in order to better understand the results from e-Drug you should take into account the following:
1) We currently only have data from the Cancer Cell Line Encyclopedia. This is an excellent dataset, but it "only" contains pharmacological data for around 500 cancer cell lines. This limits our statistical power, as most (81%) regions have 3 or less mutations (Figure 3). In that scenario we think that it does not make sense to apply multiple testing corrections, as this most true positives would likely be discarded simply because of lack of statistical power. In this context we recommend to consider as potential positives all the regions with a P value < 0.01
Figure 3 - Histogram showing the number of mutations in each region-drug analysis. Note that since we analyzed 24 different drugs and 8434 regions, there are a total of 202,416 possible combinations, which explains why the y-axis goes up to 60,000
2) This, obviously, does not mean that all the regions with P values below 0.01 are true positives. In order to help you interpret the results, e-Drug
gives you the P values for two different comparisons (Figure 4). The first P value that you should check is the one from the comparison between cell lines
with mutations in the region being analyzed (left box) and cell lines with no mutations in the protein (or "WT", right boxp). This first comparison will tell you whether
mutations in the region correlate with changes in the drug's activity.
The second P value that you should check is the one from the comparison between the region being analyzed (left box) and the rest of the protein (middle box). This second P value will tell you whether the effect that you have seen in the first comparison is specific to the region or, instead, can be attributed to the protein. If there are no differences between the region and the rest of the protein it is likely that what correlates with the activity of the drug is the mutation of the gene (regardless of the region), not the region analyzed itself (which is probably also interesting!). If, on the other hand, you see also differences in this second comparison, it is more likely that the region is a true positive.
Figure 4 - Screenshot showing the e-Drug results for PIK3CA and AEW451