GWAS Pipeline

The GWAS (Genome-Wide Association Study) pipeline in IDEAL-GENOM provides comprehensive tools for performing association analysis between genetic variants and phenotypes. The pipeline includes preparatory steps, statistical analysis using both fixed and mixed effects models, and gene annotation of significant findings.

Overview

The GWAS pipeline consists of three main components:

  1. Preparatory Analysis: LD pruning and PCA decomposition to prepare data

  2. GLM Analysis: Fixed effects association testing using generalized linear models

  3. GLMM Analysis: Mixed model analysis accounting for relatedness and population structure

Key Features:

  • Automated LD pruning with high-LD region filtering

  • Principal component analysis for population stratification

  • Fixed effects (GLM) and random effects (GLMM) association testing

  • Independent signal identification using GCTA COJO

  • Automatic gene annotation with Ensembl or RefSeq

  • Support for binary (case-control) phenotypes

  • Resource-aware parallel processing

Prerequisites

Required Software:

  • PLINK 2.0 (for GLM analysis)

  • GCTA (for GLMM analysis and COJO)

  • Python 3.11+ with required packages

Required Data:

  • Quality-controlled PLINK binary files (.bed, .bim, .fam)

  • Phenotype data in .fam file (case=2, control=1, missing=0/-9)

  • High-LD regions file (auto-downloaded if not provided)

Recommended Preprocessing:

Run the QC pipeline first to ensure data quality:

# Run QC pipeline first
ideal-genom run --config qc_pipeline.yaml

# Then run GWAS on clean data
ideal-genom run --config gwas_pipeline.yaml

Quick Start

1. Get the GWAS Configuration Template

# Copy the GWAS template
cp yaml_configs/gwas_config_template.yaml my_gwas.yaml

2. Edit Configuration

Update paths to your QC-cleaned data:

pipeline:
  name: "my_gwas"
  base_output_dir: "/data/gwas_output"
  steps:
    - name: "preparatory"
      enabled: true
      init_params:
        input_path: "/data/qc_output/variant_qc/clean_files"
        input_name: "mydata_variantQCed"
        output_path: "${base_output_dir}"
        build: "38"

3. Run the Pipeline

ideal-genom run --config my_gwas.yaml

Pipeline Steps

Step 1: Preparatory Analysis

Prepares genetic data for association testing by performing LD pruning and PCA decomposition.

What it does:

  1. Filters high-LD regions

  2. Applies QC filters (MAF, missingness, HWE)

  3. Performs LD-based pruning to identify independent SNPs

  4. Computes principal components for population structure

Configuration:

- name: "preparatory"
  enabled: true
  module: "ideal_genom.gwas.preparatory"
  class: "Preparatory"
  init_params:
    input_path: "/data/qc/clean_files"
    input_name: "mydata_clean"
    output_path: "${base_output_dir}"
    output_name: "gwas_prep"
    high_ld_regions_file: "auto"
    build: "38"
  execute_params:
    # QC filters
    mind: 0.1                   # Max missing per individual
    maf: 0.01                   # Minor allele frequency
    geno: 0.1                   # Max missing per SNP
    hwe: 5.0e-6                 # Hardy-Weinberg p-value

    # LD pruning
    ind_pair: [50, 5, 0.2]      # Window, step, r² threshold

    # PCA
    pca: 10                     # Number of PCs to compute

    # Resources
    memory: null                # Auto-detect
    threads: null               # Auto-detect

Parameters Explained:

  • mind (float, 0-1): Maximum missing genotype rate per individual

  • maf (float, 0-0.5): Minor allele frequency threshold for filtering

  • geno (float, 0-1): Maximum missing genotype rate per SNP

  • hwe (float, 0-1): Hardy-Weinberg equilibrium p-value threshold

  • ind_pair (list): LD pruning [window_size, step_size, r²_threshold]

    • window_size: Number of variants in sliding window (default: 50)

    • step_size: Window shift size in variants (default: 5)

    • r² threshold: Prune variants with r² > threshold (default: 0.2)

  • pca (int): Number of principal components to compute (typically 10-20)

  • memory (int or null): Memory in MB (null = auto-detect 2/3 available RAM)

  • threads (int or null): CPU threads (null = auto-detect cores - 2)

Output Files:

preparatory/
├── gwas_prep-prunning.bed/bim/fam    # After QC filters
├── gwas_prep-prunning.prune.in       # SNPs passing LD pruning
├── gwas_prep-prunning.prune.out      # SNPs removed by LD pruning
├── gwas_prep-pruned.bed/bim/fam      # Pruned dataset
├── mydata_clean.eigenvec             # PC scores
└── mydata_clean.eigenval             # Eigenvalues (variance explained)

Step 2: GLM Analysis

Performs fixed effects association testing using generalized linear models with PCA covariates.

What it does:

  1. Runs logistic regression with PCA covariates

  2. Identifies genome-wide significant variants (p < 5×10⁻⁸)

  3. Uses GCTA COJO to find independent signals

  4. Annotates significant variants with gene information

Configuration:

- name: "gwas_glm"
  enabled: true
  module: "ideal_genom.gwas.gen_linear_model"
  class: "GWAS_GLM"
  init_params:
    input_path: "${steps.preparatory.input_path}"
    input_name: "${steps.preparatory.input_name}"
    output_path: "${base_output_dir}"
    output_name: "gwas_glm_results"
    recompute: true
  execute_params:
    # Association testing filters
    maf: 0.01                   # MAF threshold
    mind: 0.1                   # Missing per individual
    hwe: 5.0e-6                 # HWE p-value
    ci: 0.95                    # Confidence interval

    # Annotation
    gtf_path: null              # Custom GTF (null = auto-download)
    build: "38"                 # Genome build
    anno_source: "ensembl"      # "ensembl" or "refseq"

Parameters Explained:

  • maf (float): MAF threshold for variants included in analysis

  • mind (float): Maximum missing rate per individual

  • hwe (float): Hardy-Weinberg equilibrium p-value threshold

  • ci (float): Confidence interval for effect size estimates (0-1)

  • gtf_path (string or null): Path to custom GTF annotation file

  • build (string): “37” for GRCh37/hg19, “38” for GRCh38/hg38

  • anno_source (string): Annotation source - “ensembl” or “refseq”

  • recompute (bool): If false, skip if results already exist

Output Files:

gwas_glm/
├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid
│   # Full GWAS results with all tested variants
├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid.adjusted
│   # Results with adjusted p-values
├── cojo_file.ma
│   # Prepared for COJO analysis
├── gwas_glm_results-cojo.jma.cojo
│   # Independent genome-wide significant variants
└── top_hits_annotated.tsv
    # Annotated top hits with gene names, positions, effects

Step 3: GLMM Analysis

Performs mixed model association testing accounting for population structure and cryptic relatedness.

What it does:

  1. Computes genetic relationship matrix (GRM) from pruned SNPs

  2. Creates sparse GRM for computational efficiency

  3. Runs fastGWA mixed model with PCA and sex covariates

  4. Identifies independent signals using GCTA COJO

  5. Annotates significant variants with gene information

Configuration:

- name: "gwas_glmm"
  enabled: true
  module: "ideal_genom.gwas.gen_linear_mix_model"
  class: "GWAS_GLMM"
  init_params:
    input_path: "${steps.preparatory.input_path}"
    input_name: "${steps.preparatory.input_name}"
    output_path: "${base_output_dir}"
    output_name: "gwas_glmm_results"
    recompute: true
  execute_params:
    # Association parameters
    maf: 0.01

    # GRM computation
    pruned_file: "${steps.preparatory.pruned_file}"
    max_threads: null           # Auto-detect

    # Annotation
    gtf_path: null
    build: "38"
    anno_source: "ensembl"

Parameters Explained:

  • maf (float): MAF threshold for variants in analysis

  • pruned_file (string): Path to pruned PLINK files (without extension)

  • max_threads (int or null): Threads for GRM computation

  • Other parameters same as GLM

Output Files:

gwas_glmm/
├── mydata_clean_pheno.phen
│   # Phenotype file (FID, IID, phenotype)
├── mydata_clean_sex.covar
│   # Sex covariate file
├── mydata_clean_grm.grm.bin
│   # Genetic relationship matrix (binary)
├── mydata_clean_grm.grm.id
│   # Sample IDs for GRM
├── mydata_clean_grm.grm.N.bin
│   # Number of variants used per pair
├── mydata_clean_sparse.grm.sp
│   # Sparse GRM (efficient storage)
├── mydata_clean_sparse.grm.id
│   # Sample IDs for sparse GRM
├── gwas_glmm_results_assocSparseCovar_pca_sex-mlm-binary.fastGWA
│   # Full GWAS results
├── cojo_file.ma
│   # Prepared for COJO analysis
├── gwas_glmm_results-cojo.jma.cojo
│   # Independent genome-wide significant variants
└── top_hits_annotated.tsv
    # Annotated top hits

Complete Workflow Example

Full GWAS Configuration:

pipeline:
  name: "my_gwas_study"
  base_output_dir: "/data/gwas_results"

  steps:
    # Step 1: Prepare data
    - name: "preparatory"
      enabled: true
      module: "ideal_genom.gwas.preparatory"
      class: "Preparatory"
      init_params:
        input_path: "/data/qc/clean_files"
        input_name: "study_clean"
        output_path: "${base_output_dir}"
        output_name: "gwas_prep"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        mind: 0.02
        maf: 0.01
        geno: 0.02
        hwe: 1.0e-6
        ind_pair: [50, 5, 0.2]
        pca: 10

    # Step 2: Fixed effects analysis
    - name: "gwas_glm"
      enabled: true
      module: "ideal_genom.gwas.gen_linear_model"
      class: "GWAS_GLM"
      init_params:
        input_path: "${steps.preparatory.input_path}"
        input_name: "${steps.preparatory.input_name}"
        output_path: "${base_output_dir}"
        output_name: "gwas_glm"
        recompute: false
      execute_params:
        maf: 0.01
        mind: 0.02
        hwe: 1.0e-6
        ci: 0.95
        build: "38"
        anno_source: "ensembl"

    # Step 3: Mixed model analysis (optional)
    - name: "gwas_glmm"
      enabled: false  # Enable if needed
      module: "ideal_genom.gwas.gen_linear_mix_model"
      class: "GWAS_GLMM"
      init_params:
        input_path: "${steps.preparatory.input_path}"
        input_name: "${steps.preparatory.input_name}"
        output_path: "${base_output_dir}"
        output_name: "gwas_glmm"
        recompute: false
      execute_params:
        maf: 0.01
        pruned_file: "${steps.preparatory.pruned_file}"
        build: "38"
        anno_source: "ensembl"

settings:
  logging:
    level: "INFO"
    file_logging: true
  resources:
    max_memory: null
    max_threads: null
  files:
    keep_intermediate: true

Running the Analysis:

# Validate configuration
ideal-genom validate --config my_gwas.yaml

# Dry run to preview
ideal-genom run --config my_gwas.yaml --dry-run

# Execute pipeline
ideal-genom run --config my_gwas.yaml

Parameter Recommendations

Conservative Analysis (reduce false positives):

execute_params:
  maf: 0.05      # Higher MAF
  hwe: 1.0e-10   # Stricter HWE
  mind: 0.01     # Stricter missingness
  geno: 0.01

Liberal Analysis (increase power):

execute_params:
  maf: 0.01      # Lower MAF
  hwe: 1.0e-4    # Relaxed HWE
  mind: 0.05
  geno: 0.05

Rare Variant Analysis:

execute_params:
  maf: 0.001     # Include rare variants
  geno: 0.02     # Stricter missingness for rare variants
  hwe: 1.0e-6

Resource Management

For Large Datasets (> 500K SNPs, > 10K samples):

settings:
  resources:
    max_memory: 64000    # 64GB
    max_threads: 16      # Use many cores
  files:
    keep_intermediate: false  # Save disk space

For Small Datasets:

settings:
  resources:
    max_memory: 16000    # 16GB sufficient
    max_threads: 4
  files:
    keep_intermediate: true  # Keep for inspection

Troubleshooting

Common Issues and Solutions

Issue: “No genome-wide significant hits”

Solutions:

  1. Check sample size (need sufficient power)

  2. Verify phenotype coding (cases=2, controls=1)

  3. Review QC stringency (may be too strict)

  4. Check for population stratification

  5. Consider suggestive hits (p < 1×10⁻⁵)

Issue: “Inflation of test statistics (λ > 1.1)”

Solutions:

  1. Check for population stratification

  2. Increase number of PCs used as covariates

  3. Consider using GLMM instead of GLM

  4. Review sample quality (duplicates, relatedness)

Issue: “GRM computation fails or runs out of memory”

Solutions:

  1. Reduce max_threads (more memory per thread)

  2. Increase max_memory setting

  3. Ensure pruned file has reasonable number of SNPs (50-100K)

  4. Check available system memory

Issue: “COJO analysis produces no results”

Solutions:

  1. Verify you have genome-wide significant hits

  2. Check reference LD panel is appropriate

  3. Ensure sufficient sample size

  4. Review MAF threshold (not too high)

Issue: “Annotation fails”

Solutions:

  1. Check internet connection (for auto-download)

  2. Provide custom GTF file with gtf_path

  3. Verify genome build matches your data

  4. Check gene database is accessible

Performance Optimization

Speed up analysis:

  1. Use GLM instead of GLMM when possible

  2. Set recompute: false for completed steps

  3. Reduce number of PCs if population homogeneous

  4. Use keep_intermediate: false to save I/O

  5. Allocate more threads for parallel processing

Reduce memory usage:

  1. Reduce number of threads (more memory per thread)

  2. Use sparse GRM for GLMM

  3. Process chromosomes separately if needed

  4. Close other applications during analysis

See Also

Additional Resources

PLINK Documentation: - PLINK 2.0: https://www.cog-genomics.org/plink/2.0/ - Logistic regression: https://www.cog-genomics.org/plink/2.0/assoc

GCTA Documentation: - GCTA overview: https://yanglab.westlake.edu.cn/software/gcta/ - fastGWA: https://yanglab.westlake.edu.cn/software/gcta/#fastGWA - COJO: https://yanglab.westlake.edu.cn/software/gcta/#COJO