Quality Control Pipeline ========================= The Quality Control (QC) pipeline in IDEAL-GENOM provides comprehensive tools for ensuring the quality and integrity of genotype data in case-control studies. The pipeline systematically addresses individual-level quality issues, population stratification, and variant-level quality metrics to produce analysis-ready data. Overview -------- The QC pipeline consists of three sequential components: 1. **Sample QC**: Individual-level quality control and relatedness analysis 2. **Ancestry QC**: Population structure detection and ancestry outlier removal 3. **Variant QC**: Marker-level quality filtering and Hardy-Weinberg equilibrium testing **Key Features:** - Automated multi-stage quality control workflow - Sex verification and heterozygosity analysis - Kinship and IBD-based relatedness detection - Ancestry projection with 1000 Genomes reference - Hardy-Weinberg equilibrium testing by case-control status - Differential missingness and batch effect detection - Resource-aware parallel processing - Publication-ready QC plots and comprehensive reports Use Cases --------- This pipeline is ideal for: - **GWAS Preparation**: Ensure data quality before association testing - **Case-Control Studies**: Remove biased samples and variants Prerequisites ------------- **Required Software:** - PLINK 1.9 (for functionalities not yet in PLINK 2.0) - PLINK 2.0 (for most QC operations) - Python 3.11+ with required packages (matplotlib, seaborn, pandas, numpy, scipy, umap-learn) **Required Data:** - PLINK binary files (.bed, .bim, .fam) - Phenotype data in .fam file (case=2, control=1, missing=0/-9) - 1000 Genomes reference data (auto-downloaded if not provided) **Input Data Requirements:** - Genome-wide genotype data (not pre-filtered) - At least 50,000 SNPs for reliable QC metrics - Both cases and controls with phenotype labels - Autosomal chromosomes (1-22) Quick Start ----------- **1. Get the QC Configuration Template** .. code-block:: bash # Copy the QC pipeline template cp yaml_configs/qc_pipeline_config_template.yaml my_qc_pipeline.yaml **2. Edit Configuration** Update paths to your data: .. code-block:: yaml pipeline: name: "my_qc_pipeline" base_output_dir: "/data/qc_output" steps: - name: "sample_qc" enabled: true init_params: input_path: "/data/raw_data" input_name: "mydata" output_path: "${base_output_dir}/sample_qc" build: "38" **3. Run the Pipeline** .. code-block:: bash # Validate configuration ideal-genom validate --config my_qc_pipeline.yaml # Run the full QC pipeline ideal-genom run --config my_qc_pipeline.yaml Pipeline Components ------------------- 1. Sample QC ~~~~~~~~~~~~ The first stage performs individual-level quality control to identify and remove problematic samples. **Quality Metrics:** - **Genotyping Call Rate**: Remove samples with high missingness (default: >2%) - **Sex Verification**: Check X chromosome homozygosity against reported sex (F<0.2=female, F>0.8=male) - **Heterozygosity Rate**: Identify samples with unusual heterozygosity (±3 SD from mean) - **Relatedness**: Detect duplicates, parent-offspring, siblings using kinship coefficients **Outputs:** - List of samples passing QC filters - Quality control plots (missingness, heterozygosity, sex check, kinship heatmap) - Cleaned PLINK files ready for ancestry analysis - Detailed QC metrics log **Configuration Example:** .. code-block:: yaml - name: "sample_qc" enabled: true init_params: input_path: "/data/raw_data" input_name: "mydata" output_path: "/data/qc_output/sample_qc" build: "38" run_params: geno_mind: 0.02 # Max individual missingness: 2% geno_snp: 0.05 # Max SNP missingness: 5% (for QC only) het_threshold: 3 # Heterozygosity SD threshold pihat_threshold: 0.1875 # Relatedness threshold (2nd degree) maf: 0.01 # MAF threshold for LD pruning **Typical Sample Exclusions:** - Genotyping call rate <98% (~0.5-2% of samples) - Sex discordance (XX as male or XY as female, ~0.1-0.5%) - Extreme heterozygosity (possible contamination, ~0.1-0.5%) - Cryptic relatedness (duplicates/relatives, ~1-5%) 2. Ancestry QC ~~~~~~~~~~~~~~ The second stage projects samples onto reference populations to identify ancestry outliers and ensure population homogeneity. **Methodology:** 1. **Reference Integration**: Merge study data with 1000 Genomes Phase 3 reference 2. **LD Pruning**: Prune to independent SNPs for ancestry analysis (r²<0.2) 3. **PCA Projection**: Compute principal components with reference populations 4. **Outlier Detection**: Identify samples >6 SD from population centroid **Reference Populations:** - **EUR**: European (British, Finnish, Spanish, Italian, Iberian) - **AFR**: African (Yoruba, Luhya, Gambian, Mende, Esan) - **EAS**: East Asian (Han Chinese, Japanese, Southern Han Chinese, Dai, Kinh) - **SAS**: South Asian (Gujarati, Punjabi, Bengali, Tamil, Telugu) - **AMR**: Admixed American (Mexican, Puerto Rican, Colombian, Peruvian) **Outputs:** - PCA plots showing study samples vs. reference populations - List of ancestry outliers - Cleaned PLINK files ready for variant QC - Population assignment for each sample **Configuration Example:** .. code-block:: yaml - name: "ancestry_qc" enabled: true init_params: input_path: "${base_output_dir}/sample_qc" input_name: "mydata_SampleQCed" output_path: "${base_output_dir}/ancestry_qc" build: "38" reference_1kG: "/data/references/1000genomes_build_38" reference_1kG_name: "1kG_phase3_GRCh38" run_params: expected_population: "EUR" ancestry_filter: true # Remove outliers sd_threshold: 6 # SD for outlier detection **Typical Ancestry Exclusions:** - Population outliers 3 SD from centroid (~0.5-2% for homogeneous cohorts) using the Chebyshev metric as default. - Visual inspection of PCA plots recommended 3. Variant QC ~~~~~~~~~~~~~ The final stage performs marker-level quality control to ensure variant reliability. **Quality Filters:** - **SNP Call Rate**: Remove variants with high missingness (default: >2%) - **Minor Allele Frequency (MAF)**: Filter rare variants (default: <1%) - **Hardy-Weinberg Equilibrium**: Separate HWE testing in cases (p<1e-10) and controls (p<1e-6) - **Differential Missingness**: Remove variants with case-control missingness differences (p<1e-5) **Outputs:** - Analysis-ready PLINK binary files - Variant-level QC statistics - Manhattan plots showing removed variants - Summary report with variant counts **Configuration Example:** .. code-block:: yaml - name: "variant_qc" enabled: true init_params: input_path: "${base_output_dir}/ancestry_qc" input_name: "mydata_AncestryQCed" output_path: "${base_output_dir}/variant_qc" build: "38" run_params: geno: 0.02 # Max SNP missingness: 2% maf: 0.01 # Min MAF: 1% hwe_control: 1e-6 # HWE p-value in controls hwe_case: 1e-10 # HWE p-value in cases diff_miss: 1e-5 # Differential missingness p-value **Typical Variant Exclusions:** - Genotyping call rate <98% (~1-3% of variants) - MAF <1% (~20-40% of variants, depending on array) - HWE violations (~0.1-0.5% in controls, ~0.01-0.1% in cases) - Differential missingness (~0.1-0.5%) - Final retention: typically 60-80% of initial markers Advanced Configuration ---------------------- Data Flow ~~~~~~~~~ The pipeline handles data flow automatically using variable substitution: .. code-block:: yaml pipeline: name: "comprehensive_qc" base_output_dir: "/data/qc_results" steps: - name: "sample_qc" init_params: output_path: "${base_output_dir}/step1_sample" - name: "ancestry_qc" init_params: # Automatically uses output from sample_qc input_path: "${base_output_dir}/step1_sample" output_path: "${base_output_dir}/step2_ancestry" - name: "variant_qc" init_params: # Automatically uses output from ancestry_qc input_path: "${base_output_dir}/step2_ancestry" output_path: "${base_output_dir}/step3_variant" Resource Management ~~~~~~~~~~~~~~~~~~~ Control computational resources for each step: .. code-block:: yaml - name: "sample_qc" init_params: threads: 8 # CPU cores memory_gb: 16 # RAM allocation run_params: run_het: true # Enable heterozygosity check run_ibd: true # Enable relatedness analysis Population-Specific Settings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Adjust parameters for different populations: .. code-block:: yaml # European cohort (default settings work well) - name: "ancestry_qc" run_params: expected_population: "EUR" sd_threshold: 6 # Admixed cohort (more lenient thresholds) - name: "ancestry_qc" run_params: expected_population: "AMR" sd_threshold: 8 # More permissive ancestry_filter: false # Keep all samples # African cohort - name: "ancestry_qc" run_params: expected_population: "AFR" sd_threshold: 6 Interpretation Guide -------------------- Sample QC Metrics ~~~~~~~~~~~~~~~~~ **Call Rate** - Normal: >98% (most samples should have >99%) - Problematic: <98% indicates genotyping failure - Action: Remove samples <98% call rate **Heterozygosity** - Normal: Within ±3 SD of cohort mean - High: May indicate DNA contamination - Low: May indicate inbreeding or DNA quality issues - Action: Remove outliers beyond ±3 SD **Sex Discordance** - F statistic: <0.2 (XX/female), >0.8 (XY/male), 0.2-0.8 (ambiguous) - Discordance: F statistic doesn't match reported sex - Action: Remove discordant samples or verify phenotype file **Relatedness** - PI_HAT: 0.5 (parent-offspring/full sibling), 0.25 (half-sibling), 0.125 (first cousin) - Threshold: 0.1875 (between 2nd and 3rd degree) - Action: Remove one sample from each related pair (keep higher call rate) Ancestry QC Metrics ~~~~~~~~~~~~~~~~~~~ **PCA Interpretation** - PC1 typically separates African vs. non-African - PC2 typically separates European vs. East Asian - Outliers: >6 SD from expected population centroid - Visual inspection recommended for borderline cases **Population Stratification** - Presence of clusters suggests population substructure - May need to include PCs as covariates in GWAS - Consider ancestry-stratified analysis for multi-ethnic cohorts Variant QC Metrics ~~~~~~~~~~~~~~~~~~ **Hardy-Weinberg Equilibrium** - Controls: Strict threshold (p<1e-6) - genotyping errors - Cases: Lenient threshold (p<1e-10) - allow disease-associated variants - Rationale: Disease SNPs may violate HWE in cases due to selection **Differential Missingness** - Tests: Is missingness rate different between cases and controls? - Significance: p<1e-5 indicates batch effects or technical artifacts - Action: Remove variants with significant differential missingness **MAF Filtering** - Threshold: Typically 1% for GWAS, higher (5%) for rare variant analysis - Rationale: Rare variants have low statistical power and higher error rates - Consider: Keep all variants for imputation reference panels Best Practices -------------- Workflow Recommendations ~~~~~~~~~~~~~~~~~~~~~~~~ 1. **Pre-QC Checks** - Verify input file integrity (check .bim, .bed, .fam consistency) - Ensure phenotypes are correctly coded (1=control, 2=case) - Review number of samples and variants before QC 2. **QC Order Matters** - Always run: Sample QC → Ancestry QC → Variant QC - Rationale: Poor samples bias variant statistics; ancestry outliers bias association tests 3. **Iterative QC** - Run QC, review plots, adjust thresholds if needed - Document all threshold changes and rationale - Rerun with adjusted parameters if necessary 4. **Documentation** - Save all QC plots and summary statistics - Record number of samples/variants removed at each step - Document rationale for any non-standard thresholds Troubleshooting ~~~~~~~~~~~~~~~ **High Sample Failure Rate (>10%)** - Check: DNA quality, genotyping platform issues - Review: Call rate distributions, batch effects - Action: May need to adjust thresholds or exclude problematic batches **Many Ancestry Outliers (>5%)** - Check: Is expected_population correct? - Review: PCA plots - is there substructure? - Action: Consider ancestry-stratified QC or adjust sd_threshold **Excessive Variant Removal (>50%)** - Check: Are MAF and HWE thresholds too strict? - Review: Variant call rate distributions - Action: May indicate genotyping quality issues **No Relatedness Detected (Unexpected)** - Check: Are kinship calculations enabled (run_ibd: true)? - Review: Are there family structures in study design? - Action: Verify input data includes expected relatives Output Files ------------ Each QC step produces standardized outputs: **Sample QC Outputs:** - ``mydata_SampleQCed.bed/bim/fam`` - Cleaned PLINK files - ``sample_qc_report.txt`` - Summary statistics - ``missingness_plot.png`` - Sample call rate distribution - ``heterozygosity_plot.png`` - F coefficient distribution - ``sex_check_plot.png`` - X chromosome homozygosity - ``kinship_heatmap.png`` - Relatedness matrix - ``removed_samples.txt`` - List of excluded samples with reasons **Ancestry QC Outputs:** - ``mydata_AncestryQCed.bed/bim/fam`` - Cleaned PLINK files - ``pca_plot_PC1_PC2.png`` - Primary PCs with reference - ``pca_plot_PC3_PC4.png`` - Secondary PCs - ``ancestry_outliers.txt`` - List of outlier samples - ``pca_coordinates.txt`` - PC scores for all samples **Variant QC Outputs:** - ``mydata_variantQCed.bed/bim/fam`` - Analysis-ready PLINK files - ``variant_qc_summary.txt`` - Filtering statistics - ``hwe_case_plot.png`` - HWE distribution in cases - ``hwe_control_plot.png`` - HWE distribution in controls - ``maf_distribution.png`` - Allele frequency spectrum - ``removed_variants.txt`` - List of excluded variants with reasons Next Steps ---------- After completing QC, the clean data is ready for: 1. **GWAS Analysis** .. code-block:: bash # Use QC output as GWAS input ideal-genom run --config gwas_config.yaml 2. **Imputation** - Upload to Michigan/TopMed imputation servers - Use ``mydata_variantQCed`` files for pre-imputation QC 3. **Population Structure Analysis** .. code-block:: bash # Compute FST statistics, run PCA/UMAP ideal-genom run --config population_config.yaml See Also -------- - :doc:`configuration` - Detailed YAML configuration reference - :doc:`SampleQC` - Sample QC API documentation - :doc:`AncestryQC` - Ancestry QC API documentation - :doc:`VariantQC` - Variant QC API documentation - :doc:`examples` - Example configurations and workflows - :doc:`gwas_pipeline` - GWAS pipeline documentation