Configuration Guide =================== This comprehensive guide explains the YAML-based configuration system in IDEAL-GENOM v1.1.0. The configuration file controls all aspects of your genomic analysis pipeline, from data paths to QC thresholds. Overview -------- IDEAL-GENOM uses a **single YAML configuration file** that defines: - Pipeline metadata (name, output directory) - Analysis steps to execute (QC, GWAS, VCF processing) - Parameters for each step (thresholds, options) - Global settings (logging, resources, file handling) **Benefits of YAML Configuration:** - **Single Source of Truth**: All settings in one file - **Hierarchical Structure**: Clear organization of related parameters - **Variable Substitution**: Reference values dynamically (e.g., ``${base_output_dir}``) - **Step Control**: Enable/disable steps without editing code - **Self-Documenting**: Comments explain parameters inline Configuration File Structure ----------------------------- A configuration file has three main sections: .. code-block:: yaml pipeline: # Pipeline metadata and steps name: "my_analysis" base_output_dir: "/path/to/output" steps: - name: "step_name" # Step configuration... settings: # Global settings logging: { ... } resources: { ... } files: { ... } Getting Started with Configuration ----------------------------------- **1. Start from a Template** Copy a template from the repository: .. code-block:: bash cp yaml_configs/qc_pipeline_config_template.yaml my_config.yaml **2. Edit Required Fields** At minimum, update these paths: .. code-block:: yaml pipeline: base_output_dir: "/your/output/path" # Where results will be saved steps: - name: "sample_qc" init_params: input_path: "/your/input/path" # Where your PLINK files are input_name: "your_dataset" # PLINK file prefix (without .bed/.bim/.fam) **3. Validate Your Configuration** .. code-block:: bash ideal-genom validate --config my_config.yaml Pipeline Section ---------------- The ``pipeline`` section defines your analysis workflow. Pipeline Metadata ^^^^^^^^^^^^^^^^^ .. code-block:: yaml pipeline: name: "my_study_qc" # Descriptive name for this analysis base_output_dir: "/data/output" # Root directory for all outputs **name** (string, required) A descriptive identifier for your pipeline. Used in logging and output organization. **base_output_dir** (string, required) Absolute path where all pipeline outputs will be saved. Each step creates subdirectories here. Pipeline Steps ^^^^^^^^^^^^^^ Steps are executed in the order listed: .. code-block:: yaml pipeline: steps: - name: "sample_qc" enabled: true module: "ideal_genom.qc.sample_qc" class: "SampleQC" init_params: # Parameters passed to class __init__ execute_params: # Parameters passed to execute() method **name** (string, required) Unique identifier for this step. Used for variable substitution and logging. **enabled** (boolean, required) Set to ``true`` to run this step, ``false`` to skip it. **module** (string, required) Python module path containing the step's class. **class** (string, required) Class name to instantiate for this step. **init_params** (mapping, required) Parameters passed to the class constructor (``__init__``). **execute_params** (mapping, optional) Parameters passed to the ``execute()`` method when running the step. Variable Substitution ^^^^^^^^^^^^^^^^^^^^^ Reference values from elsewhere in the configuration: .. code-block:: yaml pipeline: base_output_dir: "/data/output" steps: - name: "sample_qc" init_params: output_path: "${base_output_dir}" # Expands to /data/output - name: "variant_qc" init_params: # Use output from previous step input_path: "${steps.sample_qc.clean_dir}" input_name: "${steps.sample_qc.output_name}" **Available substitutions:** - ``${base_output_dir}`` - Pipeline's base output directory - ``${steps.STEP_NAME.ATTRIBUTE}`` - Attributes from previous steps - ``.clean_dir`` - Path to clean output files - ``.output_name`` - Output file prefix - ``.output_path`` - Output directory path QC Pipeline Configuration -------------------------- Sample QC Step ^^^^^^^^^^^^^^ Performs individual-level quality control: .. code-block:: yaml - name: "sample_qc" enabled: true module: "ideal_genom.qc.sample_qc" class: "SampleQC" init_params: input_path: "/data/input" # Directory containing PLINK files input_name: "mydata" # PLINK file prefix output_path: "${base_output_dir}" # Output directory output_name: "mydata_sampleQCed" # Output file prefix high_ld_regions_file: "auto" # LD regions file (or "auto" for built-in) build: "38" # Genome build: "37" or "38" execute_params: rename_snp: true # Rename SNPs to chr:pos format hh_to_missing: true # Convert homozygous haploid to missing use_kinship: true # Use kinship instead of IBD ind_pair: [50, 5, 0.2] # LD pruning [window, step, r²] mind: 0.02 # Max missing rate per individual sex_check: [0.2, 0.8] # F coefficient [female_max, male_min] maf: 0.01 # Minor allele frequency threshold het_deviation: 3 # Heterozygosity SD threshold kinship: 0.354 # Kinship coefficient threshold ibd_threshold: 0.185 # IBD threshold for duplicates **init_params:** - **input_path** (string): Directory containing input .bed/.bim/.fam files - **input_name** (string): PLINK file prefix (e.g., "mydata" for mydata.bed) - **output_path** (string): Where to save QC results - **output_name** (string): Prefix for output files - **high_ld_regions_file** (string): Path to high-LD regions file, or "auto" to use built-in - **build** (string): Genome build version - "37" (GRCh37/hg19) or "38" (GRCh38/hg38) **execute_params:** - **rename_snp** (bool): Rename SNPs to chr:pos format for consistency - **hh_to_missing** (bool): Convert heterozygous haploid calls to missing - **use_kinship** (bool): Use KING kinship estimation (recommended over IBD) - **ind_pair** (list[int]): LD pruning parameters [window_size_kb, step_size_kb, r²_threshold] - window_size: SNP window in variant count (default: 50) - step_size: Step size in variant count (default: 5) - r² threshold: Correlation threshold (default: 0.2) - **mind** (float, 0-1): Maximum missing genotype rate per individual (default: 0.02 = 2%) - **sex_check** (list[float]): F coefficient thresholds [female_max, male_min] - female_max: Maximum F for females (default: 0.2) - male_min: Minimum F for males (default: 0.8) - Samples outside these ranges fail sex check - **maf** (float, 0-0.5): Minor allele frequency threshold for LD pruning - **het_deviation** (float): Standard deviations from mean heterozygosity (default: 3) - **kinship** (float): Kinship coefficient threshold for relatedness - 0.354: 1st degree relatives - 0.177: 2nd degree relatives - 0.088: 3rd degree relatives - **ibd_threshold** (float): IBD threshold for identifying duplicates/monozygotic twins Ancestry QC Step ^^^^^^^^^^^^^^^^ Detects population structure and removes ancestry outliers: .. code-block:: yaml - name: "ancestry_qc" enabled: true module: "ideal_genom.qc.ancestry_qc" class: "AncestryQC" init_params: input_path: "${steps.sample_qc.clean_dir}" input_name: "${steps.sample_qc.output_name}" output_path: "${base_output_dir}" output_name: "mydata_ancestryQCed" high_ld_regions_file: "auto" build: "38" execute_params: ind_pair: [50, 5, 0.2] # LD pruning for PCA pca: 10 # Number of principal components maf: 0.01 # MAF threshold for PCA ref_threshold: 4 # SD threshold for reference outliers stu_threshold: 4 # SD threshold for study outliers reference_pop: "EUR" # Expected population num_pcs: 10 # PCs for ancestry assignment distance_metric: "infinity" # Distance metric for outlier detection **execute_params:** - **ind_pair** (list[int]): LD pruning parameters for PCA variants - **pca** (int): Number of principal components to compute - **maf** (float): MAF threshold for variants included in PCA - **ref_threshold** (float): Standard deviations for reference population outliers - **stu_threshold** (float): Standard deviations for study population outliers - **reference_pop** (string): Expected population ancestry - "EUR": European - "AFR": African - "AMR": Admixed American - "EAS": East Asian - "SAS": South Asian - **num_pcs** (int): Number of PCs used for ancestry classification - **distance_metric** (string): "euclidean", "manhattan", or "infinity" (Chebyshev) Variant QC Step ^^^^^^^^^^^^^^^ Performs variant-level quality control: .. code-block:: yaml - name: "variant_qc" enabled: true module: "ideal_genom.qc.variant_qc" class: "VariantQC" init_params: input_path: "${steps.ancestry_qc.clean_dir}" input_name: "${steps.ancestry_qc.output_name}" output_path: "${base_output_dir}" output_name: "mydata_variantQCed" execute_params: miss_data_rate: 0.02 # Max missing rate across samples diff_genotype_rate: 1.0e-5 # Differential missingness p-value geno: 0.02 # Max missing rate per variant maf: 0.01 # Minor allele frequency hwe: 1.0e-6 # Hardy-Weinberg equilibrium p-value chr_y: 24 # Y chromosome identifier **execute_params:** - **miss_data_rate** (float, 0-1): Maximum overall missing data rate threshold - **diff_genotype_rate** (float): P-value threshold for differential missingness between cases/controls - **geno** (float, 0-1): Maximum missing genotype rate per variant - **maf** (float, 0-0.5): Minor allele frequency threshold - Standard GWAS: 0.01-0.05 - Rare variant analysis: 0.001-0.01 - Very strict: 0.001 - **hwe** (float, 0-1): Hardy-Weinberg equilibrium p-value threshold - Standard: 1e-6 - Strict: 1e-10 (for genotyping array data) - Relaxed: 1e-4 - **chr_y** (int): Y chromosome identifier (23 for hg19, 24 for hg38) Population Analysis Step ^^^^^^^^^^^^^^^^^^^^^^^^ Performs dimensionality reduction and population visualization: .. code-block:: yaml - name: "dimensionality_reduction" enabled: true module: "ideal_genom.population.projection" class: "DimensionalityReductionPipeline" init_params: input_path: "${steps.variant_qc.clean_dir}" input_name: "${steps.variant_qc.output_name}" output_path: "${base_output_dir}" build: "38" high_ld_regions_file: "auto" generate_plot: true execute_params: # PCA parameters pca_params: pca: 10 force_pca_recompute: false # UMAP parameters run_umap: true umap_params: n_neighbors: 15 min_dist: 0.1 n_components: 2 # t-SNE parameters run_tsne: true tsne_params: perplexity: 30 # Plotting options case_control_markers: true plot_format: "png" dpi: 600 **execute_params:** - **pca_params** (mapping): PCA configuration - **pca** (int): Number of components to compute - **force_pca_recompute** (bool): Recompute PCA even if results exist - **run_umap** (bool): Enable UMAP analysis - **umap_params** (mapping): UMAP configuration - **n_neighbors** (int): Number of neighbors (5-50, default: 15) - **min_dist** (float): Minimum distance (0.0-1.0, default: 0.1) - **n_components** (int): Output dimensions (typically 2 or 3) - **run_tsne** (bool): Enable t-SNE analysis - **tsne_params** (mapping): t-SNE configuration - **perplexity** (int): Perplexity value (5-50, default: 30) - **case_control_markers** (bool): Color by case/control status - **plot_format** (string): "png", "svg", or "pdf" - **dpi** (int): Plot resolution (default: 600) Settings Section ---------------- Global settings that apply to the entire pipeline: Logging Settings ^^^^^^^^^^^^^^^^ .. code-block:: yaml settings: logging: level: "INFO" # Logging verbosity file_logging: true # Write to log file console_logging: true # Print to console **level** (string): Log message detail level - "DEBUG": Very detailed, for troubleshooting - "INFO": Standard informational messages (recommended) - "WARNING": Only warnings and errors - "ERROR": Only errors **file_logging** (bool): Save logs to ``pipeline.log`` in output directory **console_logging** (bool): Print log messages to terminal Resource Settings ^^^^^^^^^^^^^^^^^ .. code-block:: yaml settings: resources: max_memory: null # Maximum memory in MB max_threads: null # Maximum CPU threads **max_memory** (int or null): Maximum memory allocation in MB - ``null``: Auto-detect (uses 2/3 of available RAM) - Explicit value: Set specific limit (e.g., 32000 for 32GB) **max_threads** (int or null): Maximum CPU threads to use - ``null``: Auto-detect (uses available cores - 2) - Explicit value: Set specific number File Management Settings ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml settings: files: keep_intermediate: true # Preserve temporary files compress_outputs: false # Compress output files overwrite_existing: false # Overwrite existing results **keep_intermediate** (bool): Keep temporary intermediate files - ``true``: Keep all files (useful for debugging) - ``false``: Clean up after each step (saves disk space) **compress_outputs** (bool): Compress output files with gzip **overwrite_existing** (bool): Overwrite existing output files - ``true``: Overwrite without asking - ``false``: Fail if outputs exist (safer) Report Generation Settings ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml settings: reports: generate_reports: true # Generate visualization reports plot_format: "png" # Plot file format **generate_reports** (bool): Automatically generate QC plots and reports **plot_format** (string): Output format for plots - "png": Standard format, good quality - "svg": Vector format, scalable - "pdf": Publication-ready format Advanced Configuration Patterns -------------------------------- Conditional Step Execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Skip steps based on your needs: .. code-block:: yaml pipeline: steps: - name: "sample_qc" enabled: true - name: "ancestry_qc" enabled: false # Skip for homogeneous population - name: "variant_qc" enabled: true init_params: # Connect directly to sample QC input_path: "${steps.sample_qc.clean_dir}" Using Pre-existing Results ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Resume pipeline from intermediate step: .. code-block:: yaml pipeline: steps: - name: "sample_qc" enabled: false # Already completed - name: "variant_qc" enabled: true init_params: # Use existing sample QC output input_path: "/data/output/my_study/sample_qc/clean_files" input_name: "mydata_sampleQCed" Multiple Output Directories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Organize outputs by analysis type: .. code-block:: yaml pipeline: base_output_dir: "/data/project" steps: - name: "sample_qc" init_params: output_path: "${base_output_dir}/qc_results" - name: "gwas_prep" init_params: output_path: "${base_output_dir}/gwas_analysis" Parameter Tuning Guidelines ---------------------------- Sample QC Thresholds ^^^^^^^^^^^^^^^^^^^^ **For Standard Case-Control GWAS:** - mind: 0.02 (2% missing) - maf: 0.01 (1% MAF) - het_deviation: 3 SD - kinship: 0.354 (exclude 1st degree relatives) **For Rare Variant Analysis:** - mind: 0.01 (stricter) - maf: 0.001 (include rare variants) - het_deviation: 4 SD (more lenient) **For Family-Based Studies:** - kinship: 0.088 (allow up to 3rd degree relatives) - Adjust sex_check if samples include children Ancestry QC Thresholds ^^^^^^^^^^^^^^^^^^^^^^^ **For Homogeneous Populations:** - ref_threshold: 6 SD (softer) - stu_threshold: 6 SD (softer) - Consider disabling ancestry QC entirely Variant QC Thresholds ^^^^^^^^^^^^^^^^^^^^^^ **For Array-Based Data:** - geno: 0.02 (2% missing) - hwe: 1e-10 (very strict) - maf: 0.01 **For Sequencing Data:** - geno: 0.05 (more lenient) - hwe: 1e-6 (standard) - maf: 0.001 (include rare variants) Common Configuration Examples ------------------------------ Minimal QC Pipeline ^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml pipeline: name: "minimal_qc" base_output_dir: "/data/output" steps: - name: "sample_qc" enabled: true module: "ideal_genom.qc.sample_qc" class: "SampleQC" init_params: input_path: "/data/input" input_name: "mydata" output_path: "${base_output_dir}" output_name: "mydata_clean" high_ld_regions_file: "auto" build: "38" execute_params: mind: 0.02 maf: 0.01 settings: logging: level: "INFO" Complete QC with Ancestry ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml pipeline: name: "full_qc" base_output_dir: "/data/output" steps: - name: "sample_qc" enabled: true module: "ideal_genom.qc.sample_qc" class: "SampleQC" init_params: input_path: "/data/input" input_name: "mydata" output_path: "${base_output_dir}" output_name: "mydata_sampleQCed" high_ld_regions_file: "auto" build: "38" execute_params: mind: 0.02 sex_check: [0.2, 0.8] maf: 0.01 het_deviation: 3 kinship: 0.354 - name: "ancestry_qc" enabled: true module: "ideal_genom.qc.ancestry_qc" class: "AncestryQC" init_params: input_path: "${steps.sample_qc.clean_dir}" input_name: "${steps.sample_qc.output_name}" output_path: "${base_output_dir}" output_name: "mydata_ancestryQCed" high_ld_regions_file: "auto" build: "38" execute_params: pca: 10 ref_threshold: 4 stu_threshold: 4 reference_pop: "EUR" - name: "variant_qc" enabled: true module: "ideal_genom.qc.variant_qc" class: "VariantQC" init_params: input_path: "${steps.ancestry_qc.clean_dir}" input_name: "${steps.ancestry_qc.output_name}" output_path: "${base_output_dir}" output_name: "mydata_final" execute_params: geno: 0.02 maf: 0.01 hwe: 1.0e-6 Troubleshooting Configuration ------------------------------ **Configuration validation fails:** 1. Check YAML syntax (indentation, colons, quotes) 2. Verify all required fields are present 3. Ensure paths exist and are accessible 4. Check module and class names are correct **Pipeline runs but produces no output:** 1. Verify ``enabled: true`` for desired steps 2. Check input file paths are correct 3. Review ``pipeline.log`` for errors 4. Ensure output directory is writable **Memory errors:** 1. Set ``max_memory`` explicitly 2. Reduce ``max_threads`` to free memory 3. Process datasets in batches 4. Enable ``keep_intermediate: false`` to save space **Variable substitution not working:** 1. Ensure correct syntax: ``${variable_name}`` 2. Check referenced step names match exactly 3. Verify step order (can't reference future steps) See Also -------- - :doc:`getting_started` - Quick start guide - :doc:`examples` - Complete workflow examples - :doc:`troubleshooting` - Detailed problem-solving - :doc:`faq` - Frequently asked questions **Docker Paths:** When using Docker, paths should be relative to the container's ``/data`` directory: .. code-block:: json { "input_directory": "/data/inputData", "input_prefix": "mydata", "output_directory": "/data/outputData", "output_prefix": "clean_data", "high_ld_file": "/data/dependables/high-LD-regions.txt" } Steps Configuration ------------------- The ``steps.json`` file controls which pipeline steps to execute: .. code-block:: json { "ancestry": true, "sample": true, "variant": true, "umap": true, "fst": true } **Step Dependencies:** - ``sample`` → ``ancestry`` → ``variant`` → ``dim reduction`` → ``fst`` - You can skip steps, but maintain dependencies - Results from previous steps are required for subsequent steps Advanced Configuration ---------------------- Custom LD Regions ^^^^^^^^^^^^^^^^^^ Provide your own high-LD regions file: .. code-block:: text # high-LD-regions.txt format 1 48000000 52000000 # Chromosome, start, end 2 85000000 100000000 6 25000000 35000000 Performance Tuning ------------------- **Memory Optimization:** - Increase ``ind_pair`` window size for large datasets - Reduce ``pca`` components if memory is limited - Process chromosomes separately for very large datasets **Speed Optimization:** - Use SSD storage for temporary files - Increase available CPU cores - Consider splitting large datasets **Disk Space Management:** - Monitor intermediate file sizes - Clean up temporary files regularly - Use compression for archival storage Best Practices -------------- 1. **Version Control**: Keep configuration files under version control 2. **Documentation**: Document parameter choices and rationale 3. **Validation**: Always validate results visually 4. **Backup**: Keep copies of successful configurations 5. **Testing**: Test parameter changes on small datasets first Troubleshooting --------------- **Common Configuration Issues:** - **Path not found**: Check absolute paths and permissions - **Parameter out of range**: Verify threshold values are reasonable - **JSON syntax errors**: Validate JSON format - **Memory errors**: Reduce dataset size or adjust parameters See the :doc:`troubleshooting` guide for more detailed solutions.