GWAS Pipeline ============= The GWAS (Genome-Wide Association Study) pipeline in IDEAL-GENOM provides comprehensive tools for performing association analysis between genetic variants and phenotypes. The pipeline includes preparatory steps, statistical analysis using both fixed and mixed effects models, and gene annotation of significant findings. Overview -------- The GWAS pipeline consists of three main components: 1. **Preparatory Analysis**: LD pruning and PCA decomposition to prepare data 2. **GLM Analysis**: Fixed effects association testing using generalized linear models 3. **GLMM Analysis**: Mixed model analysis accounting for relatedness and population structure **Key Features:** - Automated LD pruning with high-LD region filtering - Principal component analysis for population stratification - Fixed effects (GLM) and random effects (GLMM) association testing - Independent signal identification using GCTA COJO - Automatic gene annotation with Ensembl or RefSeq - Support for binary (case-control) phenotypes - Resource-aware parallel processing Prerequisites ------------- **Required Software:** - PLINK 2.0 (for GLM analysis) - GCTA (for GLMM analysis and COJO) - Python 3.11+ with required packages **Required Data:** - Quality-controlled PLINK binary files (.bed, .bim, .fam) - Phenotype data in .fam file (case=2, control=1, missing=0/-9) - High-LD regions file (auto-downloaded if not provided) **Recommended Preprocessing:** Run the QC pipeline first to ensure data quality: .. code-block:: bash # Run QC pipeline first ideal-genom run --config qc_pipeline.yaml # Then run GWAS on clean data ideal-genom run --config gwas_pipeline.yaml Quick Start ----------- **1. Get the GWAS Configuration Template** .. code-block:: bash # Copy the GWAS template cp yaml_configs/gwas_config_template.yaml my_gwas.yaml **2. Edit Configuration** Update paths to your QC-cleaned data: .. code-block:: yaml pipeline: name: "my_gwas" base_output_dir: "/data/gwas_output" steps: - name: "preparatory" enabled: true init_params: input_path: "/data/qc_output/variant_qc/clean_files" input_name: "mydata_variantQCed" output_path: "${base_output_dir}" build: "38" **3. Run the Pipeline** .. code-block:: bash ideal-genom run --config my_gwas.yaml Pipeline Steps -------------- Step 1: Preparatory Analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Prepares genetic data for association testing by performing LD pruning and PCA decomposition. **What it does:** 1. Filters high-LD regions 2. Applies QC filters (MAF, missingness, HWE) 3. Performs LD-based pruning to identify independent SNPs 4. Computes principal components for population structure **Configuration:** .. code-block:: yaml - name: "preparatory" enabled: true module: "ideal_genom.gwas.preparatory" class: "Preparatory" init_params: input_path: "/data/qc/clean_files" input_name: "mydata_clean" output_path: "${base_output_dir}" output_name: "gwas_prep" high_ld_regions_file: "auto" build: "38" execute_params: # QC filters mind: 0.1 # Max missing per individual maf: 0.01 # Minor allele frequency geno: 0.1 # Max missing per SNP hwe: 5.0e-6 # Hardy-Weinberg p-value # LD pruning ind_pair: [50, 5, 0.2] # Window, step, r² threshold # PCA pca: 10 # Number of PCs to compute # Resources memory: null # Auto-detect threads: null # Auto-detect **Parameters Explained:** - **mind** (float, 0-1): Maximum missing genotype rate per individual - **maf** (float, 0-0.5): Minor allele frequency threshold for filtering - **geno** (float, 0-1): Maximum missing genotype rate per SNP - **hwe** (float, 0-1): Hardy-Weinberg equilibrium p-value threshold - **ind_pair** (list): LD pruning [window_size, step_size, r²_threshold] - window_size: Number of variants in sliding window (default: 50) - step_size: Window shift size in variants (default: 5) - r² threshold: Prune variants with r² > threshold (default: 0.2) - **pca** (int): Number of principal components to compute (typically 10-20) - **memory** (int or null): Memory in MB (null = auto-detect 2/3 available RAM) - **threads** (int or null): CPU threads (null = auto-detect cores - 2) **Output Files:** .. code-block:: text preparatory/ ├── gwas_prep-prunning.bed/bim/fam # After QC filters ├── gwas_prep-prunning.prune.in # SNPs passing LD pruning ├── gwas_prep-prunning.prune.out # SNPs removed by LD pruning ├── gwas_prep-pruned.bed/bim/fam # Pruned dataset ├── mydata_clean.eigenvec # PC scores └── mydata_clean.eigenval # Eigenvalues (variance explained) Step 2: GLM Analysis ^^^^^^^^^^^^^^^^^^^^ Performs fixed effects association testing using generalized linear models with PCA covariates. **What it does:** 1. Runs logistic regression with PCA covariates 2. Identifies genome-wide significant variants (p < 5×10⁻⁸) 3. Uses GCTA COJO to find independent signals 4. Annotates significant variants with gene information **Configuration:** .. code-block:: yaml - name: "gwas_glm" enabled: true module: "ideal_genom.gwas.gen_linear_model" class: "GWAS_GLM" init_params: input_path: "${steps.preparatory.input_path}" input_name: "${steps.preparatory.input_name}" output_path: "${base_output_dir}" output_name: "gwas_glm_results" recompute: true execute_params: # Association testing filters maf: 0.01 # MAF threshold mind: 0.1 # Missing per individual hwe: 5.0e-6 # HWE p-value ci: 0.95 # Confidence interval # Annotation gtf_path: null # Custom GTF (null = auto-download) build: "38" # Genome build anno_source: "ensembl" # "ensembl" or "refseq" **Parameters Explained:** - **maf** (float): MAF threshold for variants included in analysis - **mind** (float): Maximum missing rate per individual - **hwe** (float): Hardy-Weinberg equilibrium p-value threshold - **ci** (float): Confidence interval for effect size estimates (0-1) - **gtf_path** (string or null): Path to custom GTF annotation file - **build** (string): "37" for GRCh37/hg19, "38" for GRCh38/hg38 - **anno_source** (string): Annotation source - "ensembl" or "refseq" - **recompute** (bool): If false, skip if results already exist **Output Files:** .. code-block:: text gwas_glm/ ├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid │ # Full GWAS results with all tested variants ├── gwas_glm_results_glm.PHENO1.glm.logistic.hybrid.adjusted │ # Results with adjusted p-values ├── cojo_file.ma │ # Prepared for COJO analysis ├── gwas_glm_results-cojo.jma.cojo │ # Independent genome-wide significant variants └── top_hits_annotated.tsv # Annotated top hits with gene names, positions, effects Step 3: GLMM Analysis ^^^^^^^^^^^^^^^^^^^^^ Performs mixed model association testing accounting for population structure and cryptic relatedness. **What it does:** 1. Computes genetic relationship matrix (GRM) from pruned SNPs 2. Creates sparse GRM for computational efficiency 3. Runs fastGWA mixed model with PCA and sex covariates 4. Identifies independent signals using GCTA COJO 5. Annotates significant variants with gene information **Configuration:** .. code-block:: yaml - name: "gwas_glmm" enabled: true module: "ideal_genom.gwas.gen_linear_mix_model" class: "GWAS_GLMM" init_params: input_path: "${steps.preparatory.input_path}" input_name: "${steps.preparatory.input_name}" output_path: "${base_output_dir}" output_name: "gwas_glmm_results" recompute: true execute_params: # Association parameters maf: 0.01 # GRM computation pruned_file: "${steps.preparatory.pruned_file}" max_threads: null # Auto-detect # Annotation gtf_path: null build: "38" anno_source: "ensembl" **Parameters Explained:** - **maf** (float): MAF threshold for variants in analysis - **pruned_file** (string): Path to pruned PLINK files (without extension) - **max_threads** (int or null): Threads for GRM computation - Other parameters same as GLM **Output Files:** .. code-block:: text gwas_glmm/ ├── mydata_clean_pheno.phen │ # Phenotype file (FID, IID, phenotype) ├── mydata_clean_sex.covar │ # Sex covariate file ├── mydata_clean_grm.grm.bin │ # Genetic relationship matrix (binary) ├── mydata_clean_grm.grm.id │ # Sample IDs for GRM ├── mydata_clean_grm.grm.N.bin │ # Number of variants used per pair ├── mydata_clean_sparse.grm.sp │ # Sparse GRM (efficient storage) ├── mydata_clean_sparse.grm.id │ # Sample IDs for sparse GRM ├── gwas_glmm_results_assocSparseCovar_pca_sex-mlm-binary.fastGWA │ # Full GWAS results ├── cojo_file.ma │ # Prepared for COJO analysis ├── gwas_glmm_results-cojo.jma.cojo │ # Independent genome-wide significant variants └── top_hits_annotated.tsv # Annotated top hits Complete Workflow Example -------------------------- **Full GWAS Configuration:** .. code-block:: yaml pipeline: name: "my_gwas_study" base_output_dir: "/data/gwas_results" steps: # Step 1: Prepare data - name: "preparatory" enabled: true module: "ideal_genom.gwas.preparatory" class: "Preparatory" init_params: input_path: "/data/qc/clean_files" input_name: "study_clean" output_path: "${base_output_dir}" output_name: "gwas_prep" high_ld_regions_file: "auto" build: "38" execute_params: mind: 0.02 maf: 0.01 geno: 0.02 hwe: 1.0e-6 ind_pair: [50, 5, 0.2] pca: 10 # Step 2: Fixed effects analysis - name: "gwas_glm" enabled: true module: "ideal_genom.gwas.gen_linear_model" class: "GWAS_GLM" init_params: input_path: "${steps.preparatory.input_path}" input_name: "${steps.preparatory.input_name}" output_path: "${base_output_dir}" output_name: "gwas_glm" recompute: false execute_params: maf: 0.01 mind: 0.02 hwe: 1.0e-6 ci: 0.95 build: "38" anno_source: "ensembl" # Step 3: Mixed model analysis (optional) - name: "gwas_glmm" enabled: false # Enable if needed module: "ideal_genom.gwas.gen_linear_mix_model" class: "GWAS_GLMM" init_params: input_path: "${steps.preparatory.input_path}" input_name: "${steps.preparatory.input_name}" output_path: "${base_output_dir}" output_name: "gwas_glmm" recompute: false execute_params: maf: 0.01 pruned_file: "${steps.preparatory.pruned_file}" build: "38" anno_source: "ensembl" settings: logging: level: "INFO" file_logging: true resources: max_memory: null max_threads: null files: keep_intermediate: true **Running the Analysis:** .. code-block:: bash # Validate configuration ideal-genom validate --config my_gwas.yaml # Dry run to preview ideal-genom run --config my_gwas.yaml --dry-run # Execute pipeline ideal-genom run --config my_gwas.yaml Parameter Recommendations ^^^^^^^^^^^^^^^^^^^^^^^^^ **Conservative Analysis (reduce false positives):** .. code-block:: yaml execute_params: maf: 0.05 # Higher MAF hwe: 1.0e-10 # Stricter HWE mind: 0.01 # Stricter missingness geno: 0.01 **Liberal Analysis (increase power):** .. code-block:: yaml execute_params: maf: 0.01 # Lower MAF hwe: 1.0e-4 # Relaxed HWE mind: 0.05 geno: 0.05 **Rare Variant Analysis:** .. code-block:: yaml execute_params: maf: 0.001 # Include rare variants geno: 0.02 # Stricter missingness for rare variants hwe: 1.0e-6 Resource Management ^^^^^^^^^^^^^^^^^^^ **For Large Datasets (> 500K SNPs, > 10K samples):** .. code-block:: yaml settings: resources: max_memory: 64000 # 64GB max_threads: 16 # Use many cores files: keep_intermediate: false # Save disk space **For Small Datasets:** .. code-block:: yaml settings: resources: max_memory: 16000 # 16GB sufficient max_threads: 4 files: keep_intermediate: true # Keep for inspection Troubleshooting --------------- Common Issues and Solutions ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Issue: "No genome-wide significant hits"** Solutions: 1. Check sample size (need sufficient power) 2. Verify phenotype coding (cases=2, controls=1) 3. Review QC stringency (may be too strict) 4. Check for population stratification 5. Consider suggestive hits (p < 1×10⁻⁵) **Issue: "Inflation of test statistics (λ > 1.1)"** Solutions: 1. Check for population stratification 2. Increase number of PCs used as covariates 3. Consider using GLMM instead of GLM 4. Review sample quality (duplicates, relatedness) **Issue: "GRM computation fails or runs out of memory"** Solutions: 1. Reduce max_threads (more memory per thread) 2. Increase max_memory setting 3. Ensure pruned file has reasonable number of SNPs (50-100K) 4. Check available system memory **Issue: "COJO analysis produces no results"** Solutions: 1. Verify you have genome-wide significant hits 2. Check reference LD panel is appropriate 3. Ensure sufficient sample size 4. Review MAF threshold (not too high) **Issue: "Annotation fails"** Solutions: 1. Check internet connection (for auto-download) 2. Provide custom GTF file with gtf_path 3. Verify genome build matches your data 4. Check gene database is accessible Performance Optimization ^^^^^^^^^^^^^^^^^^^^^^^^ **Speed up analysis:** 1. Use GLM instead of GLMM when possible 2. Set ``recompute: false`` for completed steps 3. Reduce number of PCs if population homogeneous 4. Use ``keep_intermediate: false`` to save I/O 5. Allocate more threads for parallel processing **Reduce memory usage:** 1. Reduce number of threads (more memory per thread) 2. Use sparse GRM for GLMM 3. Process chromosomes separately if needed 4. Close other applications during analysis See Also -------- - :doc:`configuration` - Full parameter reference - :doc:`qc_pipeline` - Run QC before GWAS - :doc:`examples` - Complete workflow examples - :doc:`troubleshooting` - Detailed problem-solving guide Additional Resources -------------------- **PLINK Documentation:** - PLINK 2.0: https://www.cog-genomics.org/plink/2.0/ - Logistic regression: https://www.cog-genomics.org/plink/2.0/assoc **GCTA Documentation:** - GCTA overview: https://yanglab.westlake.edu.cn/software/gcta/ - fastGWA: https://yanglab.westlake.edu.cn/software/gcta/#fastGWA - COJO: https://yanglab.westlake.edu.cn/software/gcta/#COJO