Handling dependence or not in SNP-set testing approaches of Genome-Wide Association Studies

David Causeur (Agrocampus Ouest)
Thursday, January 17, 2019 - 10:30 to 12:00
Room Aurigny
Talk abstract: 

The proper way to handle dependence across features in high-throughput genomic data has rai-sed fundamental discussions with unclear general conclusions or nal recommendations. Oneof the most obvious illustration of this point is the tremendous eort of the statistics researchcommunity to address the impact of dependence on the False Discovery Rate (FDR)-controllingmethod by Benjamini and Hochberg (1995), which was initially designed under an independenceassumption. Another famous questioning example is provided by the strikingly good perfor-mance of a naïve Bayes procedure ignoring dependence in a comparative study of machinelearning methods by Dudoit et al. (2002) to predict classes from gene expression data.Addressing the dependence issue has often consisted in assessing its detrimental impact on theperformance of standard methods designed to be optimal under independence, and deduce ad-hoc improvements. To be valid for arbitrarily complex dependence patterns, such approaches inwhich dependence is viewed as a curse can lead to poorly powerful procedures. Therefore, bothfor machine learning and testing issues, a new generation of methods have emerged, advocatingfor an ad-hoc handling of dependence consisting in a preliminary whitening of the data (seeAhdesmäki and Strimmer, 2010, Hall and Jin, 2010). However, disentangling the dependentnoise and the true association signal is very challenging and decorrelation can then lead to analteration of the true association signal.For the purpose of global testing, where the objective is to test for the signicance of anassociation signal between a set of features and a covariate, Arias-Castro el al. (2011) suggeststhat the optimal handling of dependence shall be specic of the pattern of the true associationsignal, especially through its sparsity rate. The former global testing framework covers a widescope of applications, such as functional Analysis of Variance (fANOVA) and association testsbetween a region of the genome formed by contiguous Single Nucleotide Polymorphisms (SNP)and a case/control response variable in Genome Wide Association Studies. Interestingly, in thetwo former elds of applications, many popular methods are just based on simple aggregationof pointwise test statistics ignoring their dependence.In SNPset approaches of GWAS, both the dependence pattern and the association signal canbe very dierent between regions of the genome. After a general discussion on the performanceof testing methods ignoring dependence or whitening the pointwise test statistics, the presen-tation will show that those two extreme choices cannot be uniformly powerful over the varietyof dependence and association patterns. We therefore introduce a new class of aggregationmethods spanning the range between ignorance of dependence and complete decorrelation andpropose a method minimizing a distance between the null and non-null moment generatingfunctions of the test statistics within the former class to choose the more appropriate handlingof dependence. We also discuss the applications of the former general principles to predictionin high-dimension.Keywords: Dependence, Genome-Wide Association Studies, Global Testing, Functional Ana-lysis of Variance, High dimension, Statistical learning.