Dealing with population stratification, using 1000 genomes as a reference.
Association analyses of GWAS data.
Polygenic risk score (PRS) analyses.
In genetics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait.
grep "PROBLEM" plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt # This command generates a list of individuals with the status PROBLEM. plink --bfile HapMap_3_r3_5 --remove sex_discrepancy.txt --make-bed --out HapMap_3_r3_6 # This command removes the list of individuals with the status PROBLEM.
3: Generate a bfile with autosomal(常染色体) SNPs only and delete SNPs with a low minor allele frequency(次要等位基因频率) (MAF)
plink --bfile HapMap_3_r3_7 --maf 0.05 --make-bed --out HapMap_3_r3_8 # A conventional MAF threshold for a regular GWAS is between 0.01 or 0.05, depending on sample size.
After frequency and genotyping pruning, there are 1073226 SNPs
4: Delete SNPs which are not in Hardy-Weinberg equilibrium (HWE) & Check the distribution of HWE p-values of all SNPs.
1
plink --bfile HapMap_3_r3_8 --hardy
选择HWE p值低于0.00001的SNP,可以放大严重偏离的SNP
1
awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe
5: Generate a plot of the distribution of the heterozygosity rate(杂合率分布的图) of your subjects & remove individuals with a heterozygosity rate deviating more than 3 sd from the mean
# Check for potential strand issues. awk '{print$2,$5,$6}' 1kG_MDS7.bim > 1kGMDS7_tmp awk '{print$2,$5,$6}' HapMap-adj.bim > HapMap-adj_tmp sort 1kGMDS7_tmp HapMap-adj_tmp |uniq -u > all_differences.txt # 1624 differences between the files, some of these might be due to strand issues.
awk '{print$2,$5,$6}' corrected_hapmap.bim > corrected_hapmap_tmp sort 1kGMDS7_tmp corrected_hapmap_tmp |uniq -u > uncorresponding_SNPs.txt # This file demonstrates that there are 84 differences between the files
Remove problematic SNPs from HapMap and 1000 Genomes
1 2 3 4 5 6 7
awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exlusion.txt # The command above generates a list of the 42 SNPs which caused the 84 differences between the HapMap and the 1000 Genomes data sets after flipping and setting of the reference genome. # Remove the 42 problematic SNPs from both datasets. plink --bfile corrected_hapmap --exclude SNPs_for_exlusion.txt --make-bed --out HapMap_MDS2 plink --bfile 1kG_MDS7 --exclude SNPs_for_exlusion.txt --make-bed --out 1kG_MDS8 # Merge HapMap with 1000 Genomes Data. plink --bfile HapMap_MDS2 --bmerge 1kG_MDS8.bed 1kG_MDS8.bim 1kG_MDS8.fam --allow-no-sex --make-bed --out MDS_merge2
# Using a set of pruned SNPs plink --bfile MDS_merge2 --extract indepSNP.prune.in --genome --out MDS_merge2 plink --bfile MDS_merge2 --read-genome MDS_merge2.genome --cluster --mds-plot 10 --out MDS_merge2
MDS-plot
下载包含1000个基因组数据集的种群信息的文件
1 2
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel # The file 20100804.ALL.panel contains population codes of the individuals of 1000 genomes.
将人口代码转换为超级人口代码(即AFR,AMR,ASN和EUR)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
awk '{print$1,$1,$2}' 20100804.ALL.panel > race_1kG.txt sed 's/JPT/ASN/g' race_1kG.txt>race_1kG2.txt sed 's/ASW/AFR/g' race_1kG2.txt>race_1kG3.txt sed 's/CEU/EUR/g' race_1kG3.txt>race_1kG4.txt sed 's/CHB/ASN/g' race_1kG4.txt>race_1kG5.txt sed 's/CHD/ASN/g' race_1kG5.txt>race_1kG6.txt sed 's/YRI/AFR/g' race_1kG6.txt>race_1kG7.txt sed 's/LWK/AFR/g' race_1kG7.txt>race_1kG8.txt sed 's/TSI/EUR/g' race_1kG8.txt>race_1kG9.txt sed 's/MXL/AMR/g' race_1kG9.txt>race_1kG10.txt sed 's/GBR/EUR/g' race_1kG10.txt>race_1kG11.txt sed 's/FIN/EUR/g' race_1kG11.txt>race_1kG12.txt sed 's/CHS/ASN/g' race_1kG12.txt>race_1kG13.txt sed 's/PUR/AMR/g' race_1kG13.txt>race_1kG14.txt
#adjust plink --bfile HapMap_3_r3_13 -assoc --adjust --out adjusted_assoc_results # This file gives a Bonferroni corrected p-value, along with FDR and others.
## Permutation # This is a computational intensive step. # The reduce computational time we only perform this test on a subset of the SNPs from chromosome 22. # The EMP2 collumn provides the for multiple testing corrected p-value.
# Generate subset of SNPs awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt # Filter your bfile based on the subset of SNPs generated in the step above. plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm # Perform 1000000 perrmutations. plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result
# Order your data, from lowest to highest p-value. sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt # Check ordered permutation results head sorted_subset.txt
Generate Manhattan and QQ plots
1 2 3 4 5 6 7 8 9 10 11
install.packages("qqman",repos="http://cran.cnr.berkeley.edu/",lib="~" ) # location of installation can be changed but has to correspond with the library location library("qqman",lib.loc="~") results_log <- read.table("logistic_results.assoc_2.logistic", head=TRUE) jpeg("Logistic_manhattan.jpeg") manhattan(results_log,chr="CHR",bp="BP",p="P",snp="SNP", main = "Manhattan plot: logistic") dev.off()
install.packages("qqman",repos="http://cran.cnr.berkeley.edu/",lib="~" ) # location of installation can be changed but has to correspond with the library location library("qqman",lib.loc="~") results_log <- read.table("logistic_results.assoc_2.logistic", head=TRUE) jpeg("QQ-Plot_logistic.jpeg") qq(results_log$P, main = "Q-Q plot of GWAS p-values : log") dev.off()
results_as <- read.table("assoc_results.assoc", head=TRUE) jpeg("QQ-Plot_assoc.jpeg") qq(results_as$P, main = "Q-Q plot of GWAS p-values : log") dev.off()