“In a period of less than five years, the GWAS experimental design in human populations has led to new discoveries about genes and pathways involved in common diseases and other complex traits, has provided a wealth of new biological insights, has led to discoveries with direct clinical utility, and has facilitated basic research in human genetics and genomics”.
Visscher et al. 2012
- Am J Hum Genet. 2012 Jan 13; 90(1): 7–24.
Concepts
-
Quantitative trait loci.
-
Linkage disequilibrium (LD).
-
Single marker regression.
-
Multiple testing.
-
Missing heritability.
-
Population structure.
-
Mixed linear model (MLM).
-
Multiple genetic association studies.
-
DNA Imputation
Slides
Software
Papers for discussion (by Marcio Resende)
Please find attached two papers for discussion next Tuesday (October, 3th). I believe these will re-emphasize and complement nicely a lot of the topics that were discussed so far. Please bring questions to get a nice discussion going and come prepared to explain some of the sections as well. Each student must bring three topic from the papers to be discussed during class. So, in a piece of paper write your name and the three bullet point you want to discuss (related to the papers) and bring that to the class.
PCA analysis (by Felipe Ferrão)
During class, it was brought an interesting discussion about PCA analysis and how we could use this approach to analyze the interrelationships among a large number of variables and, consequently, to correct a possible pattern of population structure. Discussion on PCA analysis, eventually, appears in forums on the internet. A nice (and funny) explanation about PCA and its importance was suggested in the following scenario:
Imagine a big family dinner, where everybody starts asking you about PCA. First you explain it to your great-grandmother; then to you grandmother; then to your mother; then to your spouse; finally, to your daughter (who is a mathematician). Each time the next person is less of a layman.
Here is how the conversation might go. The author uses graphs to explain his thoughts. I hope this example can give you an intuitive idea, without entering in many mathematical formalities.
Hands-on
Single Marker Regression (by Felipe Ferrão and Marcio Resende)
This hands-on aims to present basic concepts underlying the use of statistical genetic methods in genome wide association studies (GWAS). The data set presented during the class is available in the BGLR
package. More information can be accessed using the following command in R
:
library(BGLR)
?BGLR
citation("BGLR")
Mixed Linear Model using the GAPIT package (by Marcio Resende)
This hands-on aims to present basic concepts underlying the use of Mixed Linear Model in genome wide association studies (GWAS). Below is presented the codes discussed during the class using the GAPIT
package.
######################### R CODES #####################################
# -- Packages
library(BGLR)
library(multtest)
library(gplots)
require(LDheatmap)
require(genetics)
library(ape)
library(EMMREML)
library(compiler)
library(scatterplot3d)
source("http://zzlab.net/GAPIT/gapit_functions.txt")
source("http://zzlab.net/GAPIT/emma.txt")
# -- Dataset
setwd("~/Downloads/GAPIT")
myGD <- read.table("mdp_numeric.txt", head = TRUE)
myGM <- read.table("mdp_SNP_information.txt" , head = TRUE)
myY <- read.table("mdp_traits.txt", head = TRUE)
overlap = intersect(myGD[,1],myY[,1])
myGD_BGLR = read.table("mdp_numeric.txt", head = TRUE,row.names = 1)
# -- Original data set
X = scale(myGD_BGLR) # Molecular data
X = X[match(overlap,myGD[,1]),]
X[which(is.na(X)==TRUE)]=0
n.alleles = apply(X,2,function(x) length(table(x)))
X = X[,-which(n.alleles ==1)]
# -- Overlap phenotypic and SNP data
myY <- myY[match(overlap,myGD[,1]),]
myGD <- myGD[match(overlap,myGD[,1]),]
myGD = myGD[,-(which(n.alleles ==1)+1)]
myGM = myGM[-which(n.alleles ==1),]
# ------ POP.STRUCTURE -------- # -------- # -----
# -- Distance matrix (Gmatrix and Diss)
Y = myY[,2]
G = tcrossprod(X) # Gmatrix
D = dist(X, method = "minkowski", diag = TRUE, upper = TRUE) #distance
Dissimilarity = 1-as.matrix(D)
# --- PCA - Gmatrix
par(mfrow=c(1,2))
EVD<-eigen(G) # PCA
plot(EVD$vectors[,1:2])
# --- PCA - Dissimilarity
EVD<-eigen(Dissimilarity) # PCA
plot(EVD$vectors[,1:2])
# --- K-means
set.seed(1)
new_df = cbind(EVD$vectors[,1],EVD$vectors[,2])
fit = kmeans(new_df,3)
color = c("blue", "red", "green", "yellow")
plot(EVD$vectors[,1:2],col=color[fit$cluster])
col = rep(c("gray10", "gray60"),max(myGM[,2]))
# ------ GWAS -------- # -------- # -----
par(mfrow=c(1,3))
# -- GWAS without pop. structure (lm)
p.value = vector()
for (i in 1:ncol(X)){
print(paste0("Marker ----> ", i))
regression = (lm(Y~X[,i]))
p.value[i] = summary(regression)$coef[2,4]
}
plot(-log(p.value,base=10), pch =19, cex = 0.6,
ylab = expression(paste("-log"[10], "(p)", sep = "")),
xlab = "Markers", main = "GWAS - SMR - unadjusted",
ylim=c(0,8), col = col[as.numeric(myGM[,2])])
abline(h= 3,col="red", lty=2)
# -- GWAS Correcting for population structure
pop.cofactor = factor(fit$cluster)
p.value = vector()
for (i in 1:ncol(X)){
print(paste0("Marker ----> ", i))
regression = (lm(Y~pop.cofactor+X[,i]))
p.value[i] = summary(regression)$coef[3,4]
}
plot(-log(p.value,base=10), pch =19, cex = 0.6,
ylab = expression(paste("-log"[10], "(p)", sep = "")),
xlab = "Markers",main = "GWAS - SMR - adjusted",
ylim=c(0,8),col = col[as.numeric(myGM[,2])])
abline(h= 3,col="red", lty=2)
# -- GWAS: Mixed Linear Model (GAPIT)
myGAPIT <- GAPIT(Y=myY[,1:2],GD=myGD,GM=myGM,PCA.total=3,)
plot(-log(myGAPIT$P,base=10), pch =19, cex = 0.6,
ylab = expression(paste("-log"[10], "(p)", sep = "")),
xlab = "Markers",main = "GWAS - MLM", ylim=c(0,8),
col = col[as.numeric(myGM[,2])])
abline(h= 3,col="red", lty=2)
# -- GWAS (lm) using PCA from GAPIT
PCA = as.matrix(myGAPIT$PCA[,-1])
p.value = vector()
for (i in 1:ncol(X)){
#print(paste0("Marker ----> ", i))
regression = (lm(Y~ PCA +X[,i]))
p.value[i] = summary(regression)$coef[3,4]
}
plot(-log(p.value,base=10), pch =19, cex = 0.6,
ylab = expression(paste("-log"[10], "(p)", sep = "")),
xlab = "Markers",main = "GWAS",ylim=c(0,8),
col = col[as.numeric(myGM[,2])])
abline(h= 3,col="red", lty=2)
GWAS project
- Deadline: Tuesday, October 17 at 5 pm.
The primary goal of this project is to identify which regions of the genome harbor SNPs that affect the phenotype, considering the concepts discussed during the class. The data set is the same as that we have used during the hands-on. As in the QTL project, this data set was extensively discussed in the literature. You should take advantage of this.
In the BGLR
package, the environments were labeled as 1
,2
,4
and 5
. You can check it using the command: head (wheat.Y)
. During the lesson, we used the environment 5
(last column in the phenotype matrix) for the association analysis. Each student has been assigned to an environment (1,2 or 4) that should be analyzed for his/her project . In this link, you can see the environment assigned to you (the number in the column “environments” is associated with the column in the data wheat.Y
). For example:
Student | Environment |
---|---|
AVCI Muhsin | 2 |
Brewer Sarah E | 3 |
Cappai Francesco | 3 |
Chitwood Jessica L | 3 |
- AVCI Muhsin:
Y = wheat.Y[,2]
- Brewer Sarah E:
Y = wheat.Y[,3]
- Cappai Francesco:
Y = wheat.Y[,3]
- Chitwood Jessica L:
Y = wheat.Y[,3]
If you are planning to use the codes shown in the tutorial, be sure to change the command Y = wheat.Y [, 4]
. This command defines the environment considered for GWAS analysis.
Contents
Introduction (1pt)
- An introduction about GWAS - 300 to 600 words
GWAS analysis (6 pt)
-
Material used. Describe the population and markers used. Add a discussion considering the differences compared to our last project (QTL Mapping). Include comments about the power and resolution of the analysis. (2 pts)
-
Perform the GWAS analysis considering single marker analysis with and without accounting for population structure. Use graphics to represent your results. (2 pts).
-
Discuss the importance to consider population structure in GWAS analysis. Based on the literature, indicate different methods used to account for population structure. (2pt)
Conclusion (2pt)
- Interpretation about how GWAS and these results could be used in plant breeding scenario. You can use this section to expand your knowledge, discuss advantages and disadvantages performing single marker regression, methods to improve the analysis, etc. - 100 to 300 words
Please send the project to Ivone (idebem.oliveira@ufl.edu).