Genome-wide association studies: Where we are heading?
We have witnessed tremendous success in genome-wide association studies (GWAS) in recent years. Since the identification of variants in the complement factor H gene on the risk of age-related macular degeneration, GWAS have become ubiquitous in genetic studies and have led to the identification of genetic variants that are associated with a variety of complex human diseases and traits. These discoveries have changed our understanding of the biological architecture of common, complex diseases and have also provided new hypotheses to test. New tools, such as next-generation sequencing, will be an important part of the future of genetics research; however, GWAS studies will continue to play an important role in disease gene discovery. Many traits have yet to be explored by GWAS, especially in minority populations, and large collaborative studies are currently being conducted to maximize the return from existing GWAS data. In addition, GWAS technology continues to improve, increasing genomic coverage for major global populations and decreasing the cost of experiments. Although much of the variance attributable to genetic factors for many important traits is still unexplained, GWAS technology has been instrumental in mapping over a thousand genes to hundreds of traits. More discoveries are made each month and the scale, quality and quantity of current work has a steady trend upward. We briefly review the current key trends in GWAS, which can be summarized with three goals: increase power, increase collaborations and increase populations.

Key Words: Genome-wide association studies; Single nucleotide polymorphisms; Sequencing; Genotype imputation; Meta-analysis; Genetic consortium


Genome-wide association studies (GWAS) were motivated by new thinking about approaches for mapping traits to genomic regions and several developments in large scientific projects, such as the completion of the homo sapiens reference sequence by the Human Genome Project[1] and the cataloging of common genetic variants by the International HapMap Project[2-5]. GWAS are based on the premise that densely genotyped common, or high frequency, alleles will have statistical power to detect causal associations with traits at nearby, ungenotyped common polymorphisms through short-range linkage disequilibrium (LD). LD is the nonrandom association between pairs of alleles[6]. The basis for this strategy is the common disease common variant (CDCV) hypothesis[7], in which it is proposed that high-prevalence traits are most likely determined by high-frequency genetic variants. This approach has been proven effective in many scenarios for mapping small genomic regions to traits (see the National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies,[8,9]. Many of these newly associated regions would not have been considered good candidates for targeted genotyping studies based on biological knowledge or previous linkage evidence, illustrating the difficulty of improvising a hypothesis based on the molecular biology of a gene and its products.

Since the identification of variants in the complement factor H (CFH) gene associating with the risk of age-related macular degeneration (AMD)[10], GWAS have become ubiquitous in genetic epidemiology and have led to the identification of genetic variants that are associated with a variety of human diseases and traits, such as type 1[11,12] and type 2 diabetes[13-15], inflammatory bowel disease[16], Crohn’s disease[17,18], breast cancer[19], human height[20] and body mass index[21], to name a few. It has revolutionized the search for genetic contributions to complex traits[22,23].

In GWAS, the tests of association with traits are conducted at between hundreds of thousands to millions of densely spaced single nucleotide polymorphisms (SNPs). GWAS require no a priori biological knowledge and are therefore an agnostic method for localizing the genetic effects of complex human diseases. These study designs rely on genotyping platforms which are designed by assay manufacturers and genotyping in cases and controls, families that contain multiple affected individuals or random subjects from the population if a quantitative trait is the focus of the investigation. These platforms come primarily from two manufacturers, Affymetrix ( and Illumina (, and the rationale for the SNPs assayed differs between these companies. The Illumina approach to GWAS design employs haplotype tagging to select SNPs based on local correlation with other nearby SNPs, such that redundant genetic variation containing very similar statistical information is not assayed. The Affymetrix platforms use a different design, where the human genome is saturated with SNPs that are selected based on their location between two restriction enzyme sites. Regardless of platform, the goal of GWAS is to evaluate the majority of common alleles for association with traits through pairwise correlation with assayed SNPs.

Despite the large size of GWAS data, computational tools make GWAS feasible to analyze on standard desktop computer hardware[24]. However, the large number of hypothesis tests in GWAS creates a challenge for statistical testing. An often-cited genome-wide significance level is 5 × 10-8, based on the assumption of one million independent pieces of genetic information in the human genome[25,26], and less stringent thresholds were also verified[27,28]. Few studies have adequate sample size to maintain the power needed to detect small to moderate effect sizes that predominate in GWAS. The current approach for elucidating genes that influence complex disease is to increase the power in GWAS through increased sample sizes assembled by collaboration among research groups[29,30].

As of June 01 2011, 906 publications have been documented and 4514 SNPs have been associated with human disease and traits at a significance level of 10-5 in the Catalog of the Published GWAS ([31]. Given the flood of GWAS publications in recent years, this review is not all-inclusive but highlights the key trends in current approaches to GWAS.

Increase sample size

The often-cited first success in GWAS (defined as at least 100K SNPs), the discovery of CFH in AMD, used a small data set (by current standards) of only 96 cases and 50 controls genotyped using the Affymetrix GeneChip Mapping 100K set of microarrays[10]. This study proved the concept of a “brute force” approach to scan the entire human genome for human diseases. Soon after, researchers started using larger sample sizes to augment power in GWAS. In 2007, the Wellcome Trust Case Control Consortium carried out GWAS of seven common diseases using 14 000 cases and 3000 shared controls[29]. The need for statistical power (through the incorporation of larger sample sizes) and the requirement for independent replication of association signals also motivated researchers to employ meta-analysis, often with the aid of genotype imputation, to overcome the limitations associated with each individual GWAS analysis.

Early meta-analyses in GWAS reported success in Parkinson’s disease[32] and Type 2 diabetes[33,34]. A meta-analysis combines results from multiple independent studies with similar data to address related research hypotheses. It is a more powerful approach to estimate the true effect size than analysis of data from a single study. In recent genetic studies, meta-analysis has led to many successful discoveries of genetic variants with different phenotypes, including type 1 diabetes[35], type 2 diabetes[34], chronic kidney disease[36], retinal microcirculation[37], serum lipid concentrations[38], glucose and insulin response[39], fasting glucose homeostasis[40], blood pressure and hypertension[41], atrial fibrillation[42], Crohn’s disease[43], metabolic syndrome[44], human height[20,45], body mass index[21] and blood pressure[41,46]. Meta-analyses of several thousands of samples for human diseases[36,47], and even a quarter-million individuals for common human traits[21], are becoming more common. In addition to increasing sample size, meta-analysis allows researchers to bypass the potential Institutional Review Board (IRB) issues of individual-level data sharing, as meta-data do not increase the risk of study subjects being re-identified and their personal information made public.

Increase genomic coverage

The density and number of assayed SNPs in GWAS products have improved rapidly, from the Affymetrix 100K array used in the AMD GWAS to the currently often used, the Affymetrix 6.0 (> 1 million markers) and the Illumina Human 1M (> 1 million markers). Leveraging the advances in the HapMap project[2,4,5] and the 1000 Genomes Project (1KGP)[48], the Illumina HumanOmni 2.5 (about 2.5 million markers) is also available and the Illumina HumanOmni 5M (about 5 million markers) will soon become a reality ( For estimates of genomic coverage for various platforms, see Barrett et al[8] and Li et al[49].

The recent invention of genotype imputation has become a cost-effective approach to increase genomic coverage in large genomic scans. It not only enables the pooling of GWAS results from different genotyping chips with different SNPs, which meta-analyses have benefited significantly from, but also increases the power of genome scans[50]. Genotype imputation methods utilize haplotypes inferred from a densely genotyped reference panel of subjects with known ethnicity to infer the conditional probabilities of missing genotypes in a study sample genotyped at a subset of SNPs[50,51]. Imputation of genotypes also leverages publicly available resources such as the International HapMap Project data[2,4,5] and resequencing data from the 1KGP[48].

Most of the meta-analyses to date have used the HapMap Phase II reference panels (about 3 million markers). The 1KGP reference panel, with about 16 million variant sites[48], will most likely become the reference panel of choice for future GWAS. This allows researchers to evaluate many more SNPs than are provided by GWAS manufacturers, or to fill-in SNPs that are only in one study in a meta-analysis without increasing the genotyping cost of the study.

The dilemma, that significant GWAS hits so far only explain a small proportion of heritability, has shifted researchers’ attention from GWAS genotyping chips to sequencing, with the belief that rare variants might be the culprit for the missing heritability. It was also predicted that DNA sequencing would become a routine tool in genetic research[52].

The cost of data generation, storage and processing and bioinformatics analysis add another level of difficulty to whole-genome sequencing experiments in large samples. The per-subject cost for generating individual-level genotype data from GWAS is still much less than the cost of resequencing at a depth that is sufficient for making genotype calls throughout the genome. As a result, especially for traits for which GWAS have not yet been conducted on a large-scale, we believe that array-based GWAS assays will continue to be important, especially with the aid of genotype imputation and new design of high-density GWAS chips.

Some recent research has shown that association testing from sequence data may provide slightly more statistical power than variant-based genotyping on a per-subject basis[53] using two recently developed tests of association[54,55]. However, we note that due to the large difference in the cost of resequencing to the cost of variant-based genotyping, on a per-unit of resources basis, many more subjects could be genotyped with variant-based methods than could be resequenced. As a result, the statistical power to detect an association might be better in a large sample of variant-based genotypes than in a small sample of sequence-based genotypes, utilizing the same resources.

Furthermore, once GWAS have elucidated novel regions, targeted resequencing for direct association of alleles in implicated regions can be performed at a fraction of the cost of whole-genome or whole-exome resequencing. Therefore, the use of GWAS can offer benefits at subsequent stages of an investigation and reduce the overall costs of novel locus discovery compared to an approach that relied exclusively on resequencing. The marriage between GWAS arrays and sequencing is likely to be the future, e.g. GWAS arrays followed by targeted sequencing or whole genome sequencing followed by GWAS custom arrays. Regardless of the strategy taken, good coverage of important loci and sufficient sample size to detect associations with rare alleles are indispensable.


The requirement for large sample sizes and replications have motivated massive scientific collaborations. Many new genetic consortia have arisen due to the challenges of conducting successful investigations with GWAS, e.g. the Diabetes Genetics Replication And Meta-analysis Consortium[34], the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium[30], the Meta-Analyses of Glucose and Insulin-related traits Consortium[40], the Genetic Investigation of ANthropometric Traits consortium[20], the Genetics of Obesity-related Liver Disease consortium[56], the Chronic Kidney Disease consortium[36], the Global Blood Pressure Genetics consortium[57], the Candidate-gene Association Resource consortium[58] and the Coronary Artery Disease (C4D) Genetics Consortium[59]. Genetic consortia targeting Asian populations have also been formed, e.g. the Asian Genetic Epidemiology Network consortium[46], which includes 12 GWAS studies of Asian participants ( By using prospective cohort studies, the CHARGE consortium has been very successful in producing numerous high-impact publications on a variety of phenotypes. Publications from these consortia sometimes are co-authored by over a hundred researchers, illustrating the collaborative nature of modern genetic epidemiology. This trend is unprecedented in the field and is likely to continue as technology matures and the cost of experiments using the latest tools increases beyond the ability of any single research group to afford highly-powered studies.


A rare trait allele may not be annotated in the databases of common variants maintained by the HapMap project or the National Center for Biotechnology website dbSNP (, thereby excluding the possibility of detecting that SNP through imputation and subsequent association analysis. The genetic determinants for a trait may also be unique for each population of human subjects, where sensitive functional gene or regulatory regions are perturbed by independent sets of rare mutations that occurred after geographic or cultural barriers led to increased genetic distance[60]. Thus, the same associated allele from GWAS across multiple ethnic groups does not necessarily imply the same underlying architecture of causal alleles in LD and it should not be expected that a causal allele in one population will have the same association in another population with a distinct demographic history.

Recent studies show that multi-ethnic GWAS can improve the power for novel locus discovery[61]. A recent example of the association of the variants in KCNQ1 with type 2 diabetes in East-Asian population samples[62,63] were not identified in earlier GWAS in European samples[64]. The associated SNP, rs2283228, has a minor allele frequency (MAF) of about 40% in East-Asian samples. However, the MAF in European samples is only about 5%. At this level of MAF, there is simply not enough power at the GWAS significance level of 5 × 10-8 to detect association in European samples conducted earlier than the two East-Asian samples[13-15,33,65]. Moreover, some risk alleles may be population-specific, which also highlights the importance of conducting GWAS in samples of non-European ancestry[46].

Early GWAS conducted in Parkinson disease’s (PD) did not yield results that reached genome-wide significance[66-68]. Associations with PD have been replicated in the candidate gene and GWAS contexts, including those described early in PD association studies, such as α-synuclein (SNCA)[69-75] and the microtubule-associated protein tau (MAPT) inversion region on chromosome 17 in European-ancestry subjects[76-89], as well as ubiquitin-specific protease 24[90-92], ELAV-like 4[90,93,94], monoamine oxidase B[95], Apolipoprotein E[96] and the mitochondrial haplogroups[97-104]. The consistency of results, particularly for SNCA and MAPT, suggest that the failure to reach genome-wide significance in previous studies is due to the relatively small GWAS datasets. More recently, GWAS-based investigations into the genetic determinants of PD have been more fruitful, definitively identifying several associated regions in the genes MAPT, SNCA, HLA-DRB5, BST1, GAK and LRRK2, ACMSD, STK39, MCCC1/LAMP3, SYT11, and CCDC62/HIP1R in both Caucasian and Asian patients, although the MAPT association seems to be the result of a chromosomal inversion only present in Europeans[105-110].

The public health impact and economic burden of obesity is substantial as obesity is associated with increased risks for type 2 diabetes mellitus, cardiovascular disease, dyslipidemia, hypertension, sleep apnea and several forms of cancer[111,112]. In the US, the obesity epidemic disproportionately affects certain ethnic minorities, including Mexican and African-Americans[113]. Mexican Americans are the fastest growing minority group in the US and are expected to represent 18% of the US population by 2025 ( Obesity and comorbid conditions such as diabetic retinopathy have higher prevalence in Mexican Americans than in European Americans[114-116], which will introduce significant social and economic costs if the corresponding genetic research is left far behind.

The PAGE network (Population Architecture using Genomics and Epidemiology) is a National Human Genome Research Institute funded initiative designed to characterize GWAS-identified variants in cohorts, including individuals of ancestral groups other than European-decent, to determine if the variants identified are globally associated with various complex traits[117]. Investigators in PAGE are exploring traits that have undergone extensive evaluation in GWAS including lipids, obesity, type II diabetes, stroke, and various cancers. More information about the PAGE network can be found at


The study of epidemics of heritable diseases and knowledge about the genetic architecture of complex human traits has developed rapidly in the last two decades. These advances have been primarily due to improvements in genotyping technology and a commensurate increase in the amount and availability of data with which to describe and understand the nature of genetic variation in human populations. During this period, genetic studies of human traits have moved away from a focus on assaying a small number of loci to identify regions of linkage to traits in family studies to samples of hundreds of thousands of study subjects assaying millions of SNPs for statistical association with traits using a variety of study designs. There is perhaps no better example of this than GWAS, a fundamental tool that has reshaped the way that studies are designed, collaborations are forged and thinking about the architecture of complex human traits.

Because of the rapid pace of discoveries resulting from GWAS and the promise of many more from newer technologies, it seems reasonable to look forward to a time when patients have their genomes genotyped or sequenced and analyzed to provide a personal profile of disease susceptibilities, drug compatibilities and other heritable traits. Approaches continue to rapidly evolve for employing GWAS but it is likely that the approach will be a viable way to discover the connections between inter-individual genetic variation and phenotypes for the foreseeable future.


