GWAS Pipeline
The GWAS (Genome-Wide Association Study) pipeline in IDEAL-GENOM provides comprehensive tools for performing association analysis between genetic variants and phenotypes. The pipeline includes preparatory steps, statistical analysis using both fixed and mixed effects models, and gene annotation of significant findings.
Overview
The GWAS pipeline consists of three main components:
Preparatory Analysis: LD pruning and PCA decomposition to prepare data
GLM Analysis: Fixed effects association testing using generalized linear models
GLMM Analysis: Mixed model analysis accounting for relatedness and population structure
Key Features:
Automated LD pruning with high-LD region filtering
Principal component analysis for population stratification
Fixed effects (GLM) and random effects (GLMM) association testing
Independent signal identification using GCTA COJO
Automatic gene annotation with Ensembl or RefSeq
Support for binary (case-control) phenotypes
Resource-aware parallel processing
Prerequisites
Required Software:
PLINK 2.0 (for GLM analysis)
GCTA (for GLMM analysis and COJO)
Python 3.11+ with required packages
Required Data:
Quality-controlled PLINK binary files (.bed, .bim, .fam)
Phenotype data in .fam file (case=2, control=1, missing=0/-9)
High-LD regions file (auto-downloaded if not provided)
Recommended Preprocessing:
Run the QC pipeline first to ensure data quality:
# Run QC pipeline first
ideal-genom run --config qc_pipeline.yaml
# Then run GWAS on clean data
ideal-genom run --config gwas_pipeline.yaml
Quick Start
1. Get the GWAS Configuration Template
# Copy the GWAS template
cp yaml_configs/gwas_config_template.yaml my_gwas.yaml
2. Edit Configuration
Update paths to your QC-cleaned data:
pipeline:
name: "my_gwas"
base_output_dir: "/data/gwas_output"
steps:
- name: "preparatory"
enabled: true
init_params:
input_path: "/data/qc_output/variant_qc/clean_files"
input_name: "mydata_variantQCed"
output_path: "${base_output_dir}"
build: "38"
3. Run the Pipeline
ideal-genom run --config my_gwas.yaml
Pipeline Steps
Step 1: Preparatory Analysis
Prepares genetic data for association testing by performing LD pruning and PCA decomposition.
What it does:
Filters high-LD regions
Applies QC filters (MAF, missingness, HWE)
Performs LD-based pruning to identify independent SNPs
Computes principal components for population structure
Configuration:
- name: "preparatory"
enabled: true
module: "ideal_genom.gwas.preparatory"
class: "Preparatory"
init_params:
input_path: "/data/qc/clean_files"
input_name: "mydata_clean"
output_path: "${base_output_dir}"
output_name: "gwas_prep"
high_ld_regions_file: "auto"
build: "38"
execute_params:
# QC filters
mind: 0.1 # Max missing per individual
maf: 0.01 # Minor allele frequency
geno: 0.1 # Max missing per SNP
hwe: 5.0e-6 # Hardy-Weinberg p-value
# LD pruning
ind_pair: [50, 5, 0.2] # Window, step, r² threshold
# PCA
pca: 10 # Number of PCs to compute
# Resources
memory: null # Auto-detect
threads: null # Auto-detect
Parameters Explained:
mind (float, 0-1): Maximum missing genotype rate per individual
maf (float, 0-0.5): Minor allele frequency threshold for filtering
geno (float, 0-1): Maximum missing genotype rate per SNP
hwe (float, 0-1): Hardy-Weinberg equilibrium p-value threshold
ind_pair (list): LD pruning [window_size, step_size, r²_threshold]
window_size: Number of variants in sliding window (default: 50)
step_size: Window shift size in variants (default: 5)
r² threshold: Prune variants with r² > threshold (default: 0.2)
pca (int): Number of principal components to compute (typically 10-20)
memory (int or null): Memory in MB (null = auto-detect 2/3 available RAM)
threads (int or null): CPU threads (null = auto-detect cores - 2)
Output Files:
preparatory/
├── gwas_prep-prunning.bed/bim/fam # After QC filters
├── gwas_prep-prunning.prune.in # SNPs passing LD pruning
├── gwas_prep-prunning.prune.out # SNPs removed by LD pruning
├── gwas_prep-pruned.bed/bim/fam # Pruned dataset
├── mydata_clean.eigenvec # PC scores
└── mydata_clean.eigenval # Eigenvalues (variance explained)
Step 2: GLM Analysis
Performs fixed effects association testing using generalized linear models with PCA covariates.
What it does:
Runs logistic regression with PCA covariates
Identifies genome-wide significant variants (p < 5×10⁻⁸)
Uses GCTA COJO to find independent signals
Annotates significant variants with gene information
Configuration:
- name: "gwas_glm"
enabled: true
module: "ideal_genom.gwas.gen_linear_model"
class: "GWAS_GLM"
init_params:
input_path: "${steps.preparatory.input_path}"
input_name: "${steps.preparatory.input_name}"
output_path: "${base_output_dir}"
output_name: "gwas_glm_results"
recompute: true
execute_params:
# Association testing filters
maf: 0.01 # MAF threshold
mind: 0.1 # Missing per individual
hwe: 5.0e-6 # HWE p-value
ci: 0.95 # Confidence interval
# Annotation
gtf_path: null # Custom GTF (null = auto-download)
build: "38" # Genome build
anno_source: "ensembl" # "ensembl" or "refseq"
Parameters Explained:
maf (float): MAF threshold for variants included in analysis
mind (float): Maximum missing rate per individual
hwe (float): Hardy-Weinberg equilibrium p-value threshold
ci (float): Confidence interval for effect size estimates (0-1)
gtf_path (string or null): Path to custom GTF annotation file
build (string): “37” for GRCh37/hg19, “38” for GRCh38/hg38
anno_source (string): Annotation source - “ensembl” or “refseq”
recompute (bool): If false, skip if results already exist
Output Files:
gwas_glm/
├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid
│ # Full GWAS results with all tested variants
├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid.adjusted
│ # Results with adjusted p-values
├── cojo_file.ma
│ # Prepared for COJO analysis
├── gwas_glm_results-cojo.jma.cojo
│ # Independent genome-wide significant variants
└── top_hits_annotated.tsv
# Annotated top hits with gene names, positions, effects
Step 3: GLMM Analysis
Performs mixed model association testing accounting for population structure and cryptic relatedness.
What it does:
Computes genetic relationship matrix (GRM) from pruned SNPs
Creates sparse GRM for computational efficiency
Runs fastGWA mixed model with PCA and sex covariates
Identifies independent signals using GCTA COJO
Annotates significant variants with gene information
Configuration:
- name: "gwas_glmm"
enabled: true
module: "ideal_genom.gwas.gen_linear_mix_model"
class: "GWAS_GLMM"
init_params:
input_path: "${steps.preparatory.input_path}"
input_name: "${steps.preparatory.input_name}"
output_path: "${base_output_dir}"
output_name: "gwas_glmm_results"
recompute: true
execute_params:
# Association parameters
maf: 0.01
# GRM computation
pruned_file: "${steps.preparatory.pruned_file}"
max_threads: null # Auto-detect
# Annotation
gtf_path: null
build: "38"
anno_source: "ensembl"
Parameters Explained:
maf (float): MAF threshold for variants in analysis
pruned_file (string): Path to pruned PLINK files (without extension)
max_threads (int or null): Threads for GRM computation
Other parameters same as GLM
Output Files:
gwas_glmm/
├── mydata_clean_pheno.phen
│ # Phenotype file (FID, IID, phenotype)
├── mydata_clean_sex.covar
│ # Sex covariate file
├── mydata_clean_grm.grm.bin
│ # Genetic relationship matrix (binary)
├── mydata_clean_grm.grm.id
│ # Sample IDs for GRM
├── mydata_clean_grm.grm.N.bin
│ # Number of variants used per pair
├── mydata_clean_sparse.grm.sp
│ # Sparse GRM (efficient storage)
├── mydata_clean_sparse.grm.id
│ # Sample IDs for sparse GRM
├── gwas_glmm_results_assocSparseCovar_pca_sex-mlm-binary.fastGWA
│ # Full GWAS results
├── cojo_file.ma
│ # Prepared for COJO analysis
├── gwas_glmm_results-cojo.jma.cojo
│ # Independent genome-wide significant variants
└── top_hits_annotated.tsv
# Annotated top hits
Complete Workflow Example
Full GWAS Configuration:
pipeline:
name: "my_gwas_study"
base_output_dir: "/data/gwas_results"
steps:
# Step 1: Prepare data
- name: "preparatory"
enabled: true
module: "ideal_genom.gwas.preparatory"
class: "Preparatory"
init_params:
input_path: "/data/qc/clean_files"
input_name: "study_clean"
output_path: "${base_output_dir}"
output_name: "gwas_prep"
high_ld_regions_file: "auto"
build: "38"
execute_params:
mind: 0.02
maf: 0.01
geno: 0.02
hwe: 1.0e-6
ind_pair: [50, 5, 0.2]
pca: 10
# Step 2: Fixed effects analysis
- name: "gwas_glm"
enabled: true
module: "ideal_genom.gwas.gen_linear_model"
class: "GWAS_GLM"
init_params:
input_path: "${steps.preparatory.input_path}"
input_name: "${steps.preparatory.input_name}"
output_path: "${base_output_dir}"
output_name: "gwas_glm"
recompute: false
execute_params:
maf: 0.01
mind: 0.02
hwe: 1.0e-6
ci: 0.95
build: "38"
anno_source: "ensembl"
# Step 3: Mixed model analysis (optional)
- name: "gwas_glmm"
enabled: false # Enable if needed
module: "ideal_genom.gwas.gen_linear_mix_model"
class: "GWAS_GLMM"
init_params:
input_path: "${steps.preparatory.input_path}"
input_name: "${steps.preparatory.input_name}"
output_path: "${base_output_dir}"
output_name: "gwas_glmm"
recompute: false
execute_params:
maf: 0.01
pruned_file: "${steps.preparatory.pruned_file}"
build: "38"
anno_source: "ensembl"
settings:
logging:
level: "INFO"
file_logging: true
resources:
max_memory: null
max_threads: null
files:
keep_intermediate: true
Running the Analysis:
# Validate configuration
ideal-genom validate --config my_gwas.yaml
# Dry run to preview
ideal-genom run --config my_gwas.yaml --dry-run
# Execute pipeline
ideal-genom run --config my_gwas.yaml
Parameter Recommendations
Conservative Analysis (reduce false positives):
execute_params:
maf: 0.05 # Higher MAF
hwe: 1.0e-10 # Stricter HWE
mind: 0.01 # Stricter missingness
geno: 0.01
Liberal Analysis (increase power):
execute_params:
maf: 0.01 # Lower MAF
hwe: 1.0e-4 # Relaxed HWE
mind: 0.05
geno: 0.05
Rare Variant Analysis:
execute_params:
maf: 0.001 # Include rare variants
geno: 0.02 # Stricter missingness for rare variants
hwe: 1.0e-6
Resource Management
For Large Datasets (> 500K SNPs, > 10K samples):
settings:
resources:
max_memory: 64000 # 64GB
max_threads: 16 # Use many cores
files:
keep_intermediate: false # Save disk space
For Small Datasets:
settings:
resources:
max_memory: 16000 # 16GB sufficient
max_threads: 4
files:
keep_intermediate: true # Keep for inspection
Troubleshooting
Common Issues and Solutions
Issue: “No genome-wide significant hits”
Solutions:
Check sample size (need sufficient power)
Verify phenotype coding (cases=2, controls=1)
Review QC stringency (may be too strict)
Check for population stratification
Consider suggestive hits (p < 1×10⁻⁵)
Issue: “Inflation of test statistics (λ > 1.1)”
Solutions:
Check for population stratification
Increase number of PCs used as covariates
Consider using GLMM instead of GLM
Review sample quality (duplicates, relatedness)
Issue: “GRM computation fails or runs out of memory”
Solutions:
Reduce max_threads (more memory per thread)
Increase max_memory setting
Ensure pruned file has reasonable number of SNPs (50-100K)
Check available system memory
Issue: “COJO analysis produces no results”
Solutions:
Verify you have genome-wide significant hits
Check reference LD panel is appropriate
Ensure sufficient sample size
Review MAF threshold (not too high)
Issue: “Annotation fails”
Solutions:
Check internet connection (for auto-download)
Provide custom GTF file with gtf_path
Verify genome build matches your data
Check gene database is accessible
Performance Optimization
Speed up analysis:
Use GLM instead of GLMM when possible
Set
recompute: falsefor completed stepsReduce number of PCs if population homogeneous
Use
keep_intermediate: falseto save I/OAllocate more threads for parallel processing
Reduce memory usage:
Reduce number of threads (more memory per thread)
Use sparse GRM for GLMM
Process chromosomes separately if needed
Close other applications during analysis
See Also
Configuration Guide - Full parameter reference
Quality Control Pipeline - Run QC before GWAS
Examples - Complete workflow examples
Troubleshooting Guide - Detailed problem-solving guide
Additional Resources
PLINK Documentation: - PLINK 2.0: https://www.cog-genomics.org/plink/2.0/ - Logistic regression: https://www.cog-genomics.org/plink/2.0/assoc
GCTA Documentation: - GCTA overview: https://yanglab.westlake.edu.cn/software/gcta/ - fastGWA: https://yanglab.westlake.edu.cn/software/gcta/#fastGWA - COJO: https://yanglab.westlake.edu.cn/software/gcta/#COJO