Chaichoompu K., Abegaz F., Tongsima S., Shaw P.J., Sakuntabhai A., Pereira L. and Van Steen K.
GIGA-R Medical Genomics - BIO3, University of Liege, Liege, Belgium
SNP-based information is used in several existing clustering methods to detect shared genetic ancestry or to identify population substructure (Price et al. 2006, Raj et al. 2016). Here, we present an unsupervised clustering algorithm called the iterative pruning method to capture population structure (IPCAPS). Our method supports ordinal data which can be applied directly to SNP data to identify fine-level population structure and it is built on the iterative pruning Principal Component Analysis (ipPCA) algorithm (Intarapanich et al. 2009). The IPCAPS involves an iterative process using multiple splits based on multivariate Gaussian mixture modeling of principal components and Clustering EM estimation as in Lebret et al. (2015). In each iteration, rough clusters and outliers are also identified using our own method called RubikClust.
The fixation index (FST) is known to measure a distance between populations and FST = 0.001 may be said to be genetically distinct among the European populations (Tian et al. 2008, Huckins et al. 2014). To observe fine-level population structure using FST, we examined simulated scenarios of one population, 500-8,000 individuals, 5,000-10,000 independent SNPs in HWE (Balding and Nichols 1995), with 100 replicates for each scenario. The simulated SNPs were encoded as additive coding and there was no missing genotype generated. We introduced negative control by subjecting individuals to be separated into two groups using kmeans. We observed that FST values of divided groups were lower than 0.0008, which can be defined as the minimum FST to detect fine-level population structure.
To evaluate the performance of our method, we tested different simulated data sets of 2-3 populations, 250 individuals per population, 10,000 independent SNPs in HWE, and FST=[0.0008,0.005], with 100 replicates for each data set. For real-life data sets, we applied the IPCAPS to Thai (Wangkumhang et al. 2013) and HapMap populations. Our method showed that a population classification accuracy was superior to the ipPCA in simulated scenarios of extremely subtle structure (FST=[0.0009,0.005]). In case of the Thai population, results to detect fine-level structure were obtained as well as in case of the HapMap populations. We are convinced that the IPCAPS has a potential to detect fine-level structure and it will be important in molecular reclassification studies of patients once underlying population structure has been removed.