Configuration Guide

This comprehensive guide explains the YAML-based configuration system in IDEAL-GENOM v0.2.0. The configuration file controls all aspects of your genomic analysis pipeline, from data paths to QC thresholds.

Overview

IDEAL-GENOM uses a single YAML configuration file that defines:

Pipeline metadata (name, output directory)
Analysis steps to execute (QC, GWAS, VCF processing)
Parameters for each step (thresholds, options)
Global settings (logging, resources, file handling)

Benefits of YAML Configuration:

Single Source of Truth: All settings in one file
Hierarchical Structure: Clear organization of related parameters
Variable Substitution: Reference values dynamically (e.g., ${base_output_dir})
Step Control: Enable/disable steps without editing code
Self-Documenting: Comments explain parameters inline

Configuration File Structure

A configuration file has three main sections:

pipeline:
  # Pipeline metadata and steps
  name: "my_analysis"
  base_output_dir: "/path/to/output"
  steps:
    - name: "step_name"
      # Step configuration...

settings:
  # Global settings
  logging: { ... }
  resources: { ... }
  files: { ... }

Getting Started with Configuration

1. Start from a Template

Copy a template from the repository:

cp yaml_configs/qc_pipeline_config_template.yaml my_config.yaml

2. Edit Required Fields

At minimum, update these paths:

pipeline:
  base_output_dir: "/your/output/path"  # Where results will be saved
  steps:
    - name: "sample_qc"
      init_params:
        input_path: "/your/input/path"  # Where your PLINK files are
        input_name: "your_dataset"      # PLINK file prefix (without .bed/.bim/.fam)

3. Validate Your Configuration

ideal-genom validate --config my_config.yaml

Pipeline Section

The pipeline section defines your analysis workflow.

Pipeline Metadata

pipeline:
  name: "my_study_qc"           # Descriptive name for this analysis
  base_output_dir: "/data/output"  # Root directory for all outputs

name (string, required): A descriptive identifier for your pipeline. Used in logging and output organization.
base_output_dir (string, required): Absolute path where all pipeline outputs will be saved. Each step creates subdirectories here.

Pipeline Steps

Steps are executed in the order listed:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        # Parameters passed to class __init__
      execute_params:
        # Parameters passed to execute() method

name (string, required): Unique identifier for this step. Used for variable substitution and logging.
enabled (boolean, required): Set to true to run this step, false to skip it.
module (string, required): Python module path containing the step’s class.
class (string, required): Class name to instantiate for this step.
init_params (mapping, required): Parameters passed to the class constructor (__init__).
execute_params (mapping, optional): Parameters passed to the execute() method when running the step.

Variable Substitution

Reference values from elsewhere in the configuration:

pipeline:
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      init_params:
        output_path: "${base_output_dir}"  # Expands to /data/output

    - name: "variant_qc"
      init_params:
        # Use output from previous step
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"

Available substitutions:

${base_output_dir} - Pipeline’s base output directory
${steps.STEP_NAME.ATTRIBUTE} - Attributes from previous steps - .clean_dir - Path to clean output files - .output_name - Output file prefix - .output_path - Output directory path

QC Pipeline Configuration

Sample QC Step

Performs individual-level quality control:

- name: "sample_qc"
  enabled: true
  module: "ideal_genom.qc.sample_qc"
  class: "SampleQC"
  init_params:
    input_path: "/data/input"           # Directory containing PLINK files
    input_name: "mydata"                # PLINK file prefix
    output_path: "${base_output_dir}"   # Output directory
    output_name: "mydata_sampleQCed"    # Output file prefix
    high_ld_regions_file: "auto"        # LD regions file (or "auto" for built-in)
    build: "38"                         # Genome build: "37" or "38"
  execute_params:
    rename_snp: true                    # Rename SNPs to chr:pos format
    hh_to_missing: true                 # Convert homozygous haploid to missing
    use_kinship: true                   # Use kinship instead of IBD
    ind_pair: [50, 5, 0.2]              # LD pruning [window, step, r²]
    mind: 0.02                          # Max missing rate per individual
    sex_check: [0.2, 0.8]               # F coefficient [female_max, male_min]
    maf: 0.01                           # Minor allele frequency threshold
    het_deviation: 3                    # Heterozygosity SD threshold
    kinship: 0.354                      # Kinship coefficient threshold
    ibd_threshold: 0.185                # IBD threshold for duplicates

init_params:

input_path (string): Directory containing input .bed/.bim/.fam files
input_name (string): PLINK file prefix (e.g., “mydata” for mydata.bed)
output_path (string): Where to save QC results
output_name (string): Prefix for output files
high_ld_regions_file (string): Path to high-LD regions file, or “auto” to use built-in
build (string): Genome build version - “37” (GRCh37/hg19) or “38” (GRCh38/hg38)

execute_params:

rename_snp (bool): Rename SNPs to chr:pos format for consistency
hh_to_missing (bool): Convert heterozygous haploid calls to missing
use_kinship (bool): Use KING kinship estimation (recommended over IBD)
ind_pair (list[int]): LD pruning parameters [window_size_kb, step_size_kb, r²_threshold]
- window_size: SNP window in variant count (default: 50)
- step_size: Step size in variant count (default: 5)
- r² threshold: Correlation threshold (default: 0.2)
mind (float, 0-1): Maximum missing genotype rate per individual (default: 0.02 = 2%)
sex_check (list[float]): F coefficient thresholds [female_max, male_min]
- female_max: Maximum F for females (default: 0.2)
- male_min: Minimum F for males (default: 0.8)
- Samples outside these ranges fail sex check
maf (float, 0-0.5): Minor allele frequency threshold for LD pruning
het_deviation (float): Standard deviations from mean heterozygosity (default: 3)
kinship (float): Kinship coefficient threshold for relatedness
- 0.354: 1st degree relatives
- 0.177: 2nd degree relatives
- 0.088: 3rd degree relatives
ibd_threshold (float): IBD threshold for identifying duplicates/monozygotic twins

Ancestry QC Step

Detects population structure and removes ancestry outliers:

- name: "ancestry_qc"
  enabled: true
  module: "ideal_genom.qc.ancestry_qc"
  class: "AncestryQC"
  init_params:
    input_path: "${steps.sample_qc.clean_dir}"
    input_name: "${steps.sample_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_ancestryQCed"
    high_ld_regions_file: "auto"
    build: "38"
  execute_params:
    ind_pair: [50, 5, 0.2]        # LD pruning for PCA
    pca: 10                       # Number of principal components
    maf: 0.01                     # MAF threshold for PCA
    ref_threshold: 4              # SD threshold for reference outliers
    stu_threshold: 4              # SD threshold for study outliers
    reference_pop: "EUR"          # Expected population
    num_pcs: 10                   # PCs for ancestry assignment
    distance_metric: "infinity"   # Distance metric for outlier detection

execute_params:

ind_pair (list[int]): LD pruning parameters for PCA variants
pca (int): Number of principal components to compute
maf (float): MAF threshold for variants included in PCA
ref_threshold (float): Standard deviations for reference population outliers
stu_threshold (float): Standard deviations for study population outliers
reference_pop (string): Expected population ancestry
- “EUR”: European
- “AFR”: African
- “AMR”: Admixed American
- “EAS”: East Asian
- “SAS”: South Asian
num_pcs (int): Number of PCs used for ancestry classification
distance_metric (string): “euclidean”, “manhattan”, or “infinity” (Chebyshev)

Variant QC Step

Performs variant-level quality control:

- name: "variant_qc"
  enabled: true
  module: "ideal_genom.qc.variant_qc"
  class: "VariantQC"
  init_params:
    input_path: "${steps.ancestry_qc.clean_dir}"
    input_name: "${steps.ancestry_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_variantQCed"
  execute_params:
    miss_data_rate: 0.02          # Max missing rate across samples
    diff_genotype_rate: 1.0e-5    # Differential missingness p-value
    geno: 0.02                    # Max missing rate per variant
    maf: 0.01                     # Minor allele frequency
    hwe: 1.0e-6                   # Hardy-Weinberg equilibrium p-value
    chr_y: 24                     # Y chromosome identifier

execute_params:

miss_data_rate (float, 0-1): Maximum overall missing data rate threshold
diff_genotype_rate (float): P-value threshold for differential missingness between cases/controls
geno (float, 0-1): Maximum missing genotype rate per variant
maf (float, 0-0.5): Minor allele frequency threshold
- Standard GWAS: 0.01-0.05
- Rare variant analysis: 0.001-0.01
- Very strict: 0.001
hwe (float, 0-1): Hardy-Weinberg equilibrium p-value threshold
- Standard: 1e-6
- Strict: 1e-10 (for genotyping array data)
- Relaxed: 1e-4
chr_y (int): Y chromosome identifier (23 for hg19, 24 for hg38)

Population Analysis Step

Performs dimensionality reduction and population visualization:

- name: "dimensionality_reduction"
  enabled: true
  module: "ideal_genom.population.projection"
  class: "DimensionalityReductionPipeline"
  init_params:
    input_path: "${steps.variant_qc.clean_dir}"
    input_name: "${steps.variant_qc.output_name}"
    output_path: "${base_output_dir}"
    build: "38"
    high_ld_regions_file: "auto"
    generate_plot: true
  execute_params:
    # PCA parameters
    pca_params:
      pca: 10
    force_pca_recompute: false

    # UMAP parameters
    run_umap: true
    umap_params:
      n_neighbors: 15
      min_dist: 0.1
      n_components: 2

    # t-SNE parameters
    run_tsne: true
    tsne_params:
      perplexity: 30

    # Plotting options
    case_control_markers: true
    plot_format: "png"
    dpi: 600

execute_params:

pca_params (mapping): PCA configuration
- pca (int): Number of components to compute
force_pca_recompute (bool): Recompute PCA even if results exist
run_umap (bool): Enable UMAP analysis
umap_params (mapping): UMAP configuration
- n_neighbors (int): Number of neighbors (5-50, default: 15)
- min_dist (float): Minimum distance (0.0-1.0, default: 0.1)
- n_components (int): Output dimensions (typically 2 or 3)
run_tsne (bool): Enable t-SNE analysis
tsne_params (mapping): t-SNE configuration
- perplexity (int): Perplexity value (5-50, default: 30)
case_control_markers (bool): Color by case/control status
plot_format (string): “png”, “svg”, or “pdf”
dpi (int): Plot resolution (default: 600)

Settings Section

Global settings that apply to the entire pipeline:

Logging Settings

settings:
  logging:
    level: "INFO"              # Logging verbosity
    file_logging: true         # Write to log file
    console_logging: true      # Print to console

level (string): Log message detail level

“DEBUG”: Very detailed, for troubleshooting
“INFO”: Standard informational messages (recommended)
“WARNING”: Only warnings and errors
“ERROR”: Only errors

file_logging (bool): Save logs to pipeline.log in output directory

console_logging (bool): Print log messages to terminal

Resource Settings

settings:
  resources:
    max_memory: null           # Maximum memory in MB
    max_threads: null          # Maximum CPU threads

max_memory (int or null): Maximum memory allocation in MB

null: Auto-detect (uses 2/3 of available RAM)
Explicit value: Set specific limit (e.g., 32000 for 32GB)

max_threads (int or null): Maximum CPU threads to use

null: Auto-detect (uses available cores - 2)
Explicit value: Set specific number

File Management Settings

settings:
  files:
    keep_intermediate: true    # Preserve temporary files
    compress_outputs: false    # Compress output files
    overwrite_existing: false  # Overwrite existing results

keep_intermediate (bool): Keep temporary intermediate files

true: Keep all files (useful for debugging)
false: Clean up after each step (saves disk space)

compress_outputs (bool): Compress output files with gzip

overwrite_existing (bool): Overwrite existing output files

true: Overwrite without asking
false: Fail if outputs exist (safer)

Report Generation Settings

settings:
  reports:
    generate_reports: true     # Generate visualization reports
    plot_format: "png"         # Plot file format

generate_reports (bool): Automatically generate QC plots and reports

plot_format (string): Output format for plots

“png”: Standard format, good quality
“svg”: Vector format, scalable
“pdf”: Publication-ready format

Advanced Configuration Patterns

Conditional Step Execution

Skip steps based on your needs:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
    - name: "ancestry_qc"
      enabled: false  # Skip for homogeneous population
    - name: "variant_qc"
      enabled: true
      init_params:
        # Connect directly to sample QC
        input_path: "${steps.sample_qc.clean_dir}"

Using Pre-existing Results

Resume pipeline from intermediate step:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: false  # Already completed
    - name: "variant_qc"
      enabled: true
      init_params:
        # Use existing sample QC output
        input_path: "/data/output/my_study/sample_qc/clean_files"
        input_name: "mydata_sampleQCed"

Multiple Output Directories

Organize outputs by analysis type:

pipeline:
  base_output_dir: "/data/project"
  steps:
    - name: "sample_qc"
      init_params:
        output_path: "${base_output_dir}/qc_results"
    - name: "gwas_prep"
      init_params:
        output_path: "${base_output_dir}/gwas_analysis"

Parameter Tuning Guidelines

Sample QC Thresholds

For Standard Case-Control GWAS:

mind: 0.02 (2% missing)
maf: 0.01 (1% MAF)
het_deviation: 3 SD
kinship: 0.354 (exclude 1st degree relatives)

For Rare Variant Analysis:

mind: 0.01 (stricter)
maf: 0.001 (include rare variants)
het_deviation: 4 SD (more lenient)

For Family-Based Studies:

kinship: 0.088 (allow up to 3rd degree relatives)
Adjust sex_check if samples include children

Ancestry QC Thresholds

For Homogeneous Populations:

ref_threshold: 6 SD (softer)
stu_threshold: 6 SD (softer)
Consider disabling ancestry QC entirely

Variant QC Thresholds

For Array-Based Data:

geno: 0.02 (2% missing)
hwe: 1e-10 (very strict)
maf: 0.01

For Sequencing Data:

geno: 0.05 (more lenient)
hwe: 1e-6 (standard)
maf: 0.001 (include rare variants)

Common Configuration Examples

Minimal QC Pipeline

pipeline:
  name: "minimal_qc"
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_clean"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        mind: 0.02
        maf: 0.01

settings:
  logging:
    level: "INFO"

Complete QC with Ancestry

pipeline:
  name: "full_qc"
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_sampleQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        mind: 0.02
        sex_check: [0.2, 0.8]
        maf: 0.01
        het_deviation: 3
        kinship: 0.354

    - name: "ancestry_qc"
      enabled: true
      module: "ideal_genom.qc.ancestry_qc"
      class: "AncestryQC"
      init_params:
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_ancestryQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        pca: 10
        ref_threshold: 4
        stu_threshold: 4
        reference_pop: "EUR"

    - name: "variant_qc"
      enabled: true
      module: "ideal_genom.qc.variant_qc"
      class: "VariantQC"
      init_params:
        input_path: "${steps.ancestry_qc.clean_dir}"
        input_name: "${steps.ancestry_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_final"
      execute_params:
        geno: 0.02
        maf: 0.01
        hwe: 1.0e-6

Troubleshooting Configuration

Configuration validation fails:

Check YAML syntax (indentation, colons, quotes)
Verify all required fields are present
Ensure paths exist and are accessible
Check module and class names are correct

Pipeline runs but produces no output:

Verify enabled: true for desired steps
Check input file paths are correct
Review pipeline.log for errors
Ensure output directory is writable

Memory errors:

Set max_memory explicitly
Reduce max_threads to free memory
Process datasets in batches
Enable keep_intermediate: false to save space

Variable substitution not working:

Ensure correct syntax: ${variable_name}
Check referenced step names match exactly
Verify step order (can’t reference future steps)

Steps Configuration

The steps.json file controls which pipeline steps to execute:

{
    "ancestry": true,
    "sample": true,
    "variant": true,
    "umap": true,
    "fst": true
}

Step Dependencies:

sample → ancestry → variant → dim reduction → fst
You can skip steps, but maintain dependencies
Results from previous steps are required for subsequent steps

Advanced Configuration

Custom LD Regions

Provide your own high-LD regions file:

# high-LD-regions.txt format
 48000000    52000000    # Chromosome, start, end
 85000000    100000000
 25000000    35000000

Performance Tuning

Memory Optimization:

Increase ind_pair window size for large datasets
Reduce pca components if memory is limited
Process chromosomes separately for very large datasets

Speed Optimization:

Use SSD storage for temporary files
Increase available CPU cores
Consider splitting large datasets

Disk Space Management:

Monitor intermediate file sizes
Clean up temporary files regularly
Use compression for archival storage

Best Practices

Version Control: Keep configuration files under version control
Documentation: Document parameter choices and rationale
Validation: Always validate results visually
Backup: Keep copies of successful configurations
Testing: Test parameter changes on small datasets first

Troubleshooting

Common Configuration Issues:

Path not found: Check absolute paths and permissions
Parameter out of range: Verify threshold values are reasonable
JSON syntax errors: Validate JSON format
Memory errors: Reduce dataset size or adjust parameters

See the Troubleshooting Guide guide for more detailed solutions.

Configuration Guide

Overview

Configuration File Structure

Getting Started with Configuration

Pipeline Section

Pipeline Metadata

Pipeline Steps

Variable Substitution

QC Pipeline Configuration

Sample QC Step

Ancestry QC Step

Variant QC Step

Population Analysis Step

Settings Section

Logging Settings

Resource Settings

File Management Settings

Report Generation Settings

Advanced Configuration Patterns

Conditional Step Execution

Using Pre-existing Results

Multiple Output Directories

Parameter Tuning Guidelines

Sample QC Thresholds

Ancestry QC Thresholds

Variant QC Thresholds

Common Configuration Examples

Minimal QC Pipeline

Complete QC with Ancestry

Troubleshooting Configuration

See Also

Steps Configuration

Advanced Configuration

Custom LD Regions

Performance Tuning

Best Practices

Troubleshooting