Configuration Guide
This comprehensive guide explains the YAML-based configuration system in IDEAL-GENOM v0.2.0. The configuration file controls all aspects of your genomic analysis pipeline, from data paths to QC thresholds.
Overview
IDEAL-GENOM uses a single YAML configuration file that defines:
Pipeline metadata (name, output directory)
Analysis steps to execute (QC, GWAS, VCF processing)
Parameters for each step (thresholds, options)
Global settings (logging, resources, file handling)
Benefits of YAML Configuration:
Single Source of Truth: All settings in one file
Hierarchical Structure: Clear organization of related parameters
Variable Substitution: Reference values dynamically (e.g.,
${base_output_dir})Step Control: Enable/disable steps without editing code
Self-Documenting: Comments explain parameters inline
Configuration File Structure
A configuration file has three main sections:
pipeline:
# Pipeline metadata and steps
name: "my_analysis"
base_output_dir: "/path/to/output"
steps:
- name: "step_name"
# Step configuration...
settings:
# Global settings
logging: { ... }
resources: { ... }
files: { ... }
Getting Started with Configuration
1. Start from a Template
Copy a template from the repository:
cp yaml_configs/qc_pipeline_config_template.yaml my_config.yaml
2. Edit Required Fields
At minimum, update these paths:
pipeline:
base_output_dir: "/your/output/path" # Where results will be saved
steps:
- name: "sample_qc"
init_params:
input_path: "/your/input/path" # Where your PLINK files are
input_name: "your_dataset" # PLINK file prefix (without .bed/.bim/.fam)
3. Validate Your Configuration
ideal-genom validate --config my_config.yaml
Pipeline Section
The pipeline section defines your analysis workflow.
Pipeline Metadata
pipeline:
name: "my_study_qc" # Descriptive name for this analysis
base_output_dir: "/data/output" # Root directory for all outputs
- name (string, required)
A descriptive identifier for your pipeline. Used in logging and output organization.
- base_output_dir (string, required)
Absolute path where all pipeline outputs will be saved. Each step creates subdirectories here.
Pipeline Steps
Steps are executed in the order listed:
pipeline:
steps:
- name: "sample_qc"
enabled: true
module: "ideal_genom.qc.sample_qc"
class: "SampleQC"
init_params:
# Parameters passed to class __init__
execute_params:
# Parameters passed to execute() method
- name (string, required)
Unique identifier for this step. Used for variable substitution and logging.
- enabled (boolean, required)
Set to
trueto run this step,falseto skip it.- module (string, required)
Python module path containing the step’s class.
- class (string, required)
Class name to instantiate for this step.
- init_params (mapping, required)
Parameters passed to the class constructor (
__init__).- execute_params (mapping, optional)
Parameters passed to the
execute()method when running the step.
Variable Substitution
Reference values from elsewhere in the configuration:
pipeline:
base_output_dir: "/data/output"
steps:
- name: "sample_qc"
init_params:
output_path: "${base_output_dir}" # Expands to /data/output
- name: "variant_qc"
init_params:
# Use output from previous step
input_path: "${steps.sample_qc.clean_dir}"
input_name: "${steps.sample_qc.output_name}"
Available substitutions:
${base_output_dir}- Pipeline’s base output directory${steps.STEP_NAME.ATTRIBUTE}- Attributes from previous steps -.clean_dir- Path to clean output files -.output_name- Output file prefix -.output_path- Output directory path
QC Pipeline Configuration
Sample QC Step
Performs individual-level quality control:
- name: "sample_qc"
enabled: true
module: "ideal_genom.qc.sample_qc"
class: "SampleQC"
init_params:
input_path: "/data/input" # Directory containing PLINK files
input_name: "mydata" # PLINK file prefix
output_path: "${base_output_dir}" # Output directory
output_name: "mydata_sampleQCed" # Output file prefix
high_ld_regions_file: "auto" # LD regions file (or "auto" for built-in)
build: "38" # Genome build: "37" or "38"
execute_params:
rename_snp: true # Rename SNPs to chr:pos format
hh_to_missing: true # Convert homozygous haploid to missing
use_kinship: true # Use kinship instead of IBD
ind_pair: [50, 5, 0.2] # LD pruning [window, step, r²]
mind: 0.02 # Max missing rate per individual
sex_check: [0.2, 0.8] # F coefficient [female_max, male_min]
maf: 0.01 # Minor allele frequency threshold
het_deviation: 3 # Heterozygosity SD threshold
kinship: 0.354 # Kinship coefficient threshold
ibd_threshold: 0.185 # IBD threshold for duplicates
init_params:
input_path (string): Directory containing input .bed/.bim/.fam files
input_name (string): PLINK file prefix (e.g., “mydata” for mydata.bed)
output_path (string): Where to save QC results
output_name (string): Prefix for output files
high_ld_regions_file (string): Path to high-LD regions file, or “auto” to use built-in
build (string): Genome build version - “37” (GRCh37/hg19) or “38” (GRCh38/hg38)
execute_params:
rename_snp (bool): Rename SNPs to chr:pos format for consistency
hh_to_missing (bool): Convert heterozygous haploid calls to missing
use_kinship (bool): Use KING kinship estimation (recommended over IBD)
ind_pair (list[int]): LD pruning parameters [window_size_kb, step_size_kb, r²_threshold]
window_size: SNP window in variant count (default: 50)
step_size: Step size in variant count (default: 5)
r² threshold: Correlation threshold (default: 0.2)
mind (float, 0-1): Maximum missing genotype rate per individual (default: 0.02 = 2%)
sex_check (list[float]): F coefficient thresholds [female_max, male_min]
female_max: Maximum F for females (default: 0.2)
male_min: Minimum F for males (default: 0.8)
Samples outside these ranges fail sex check
maf (float, 0-0.5): Minor allele frequency threshold for LD pruning
het_deviation (float): Standard deviations from mean heterozygosity (default: 3)
kinship (float): Kinship coefficient threshold for relatedness
0.354: 1st degree relatives
0.177: 2nd degree relatives
0.088: 3rd degree relatives
ibd_threshold (float): IBD threshold for identifying duplicates/monozygotic twins
Ancestry QC Step
Detects population structure and removes ancestry outliers:
- name: "ancestry_qc"
enabled: true
module: "ideal_genom.qc.ancestry_qc"
class: "AncestryQC"
init_params:
input_path: "${steps.sample_qc.clean_dir}"
input_name: "${steps.sample_qc.output_name}"
output_path: "${base_output_dir}"
output_name: "mydata_ancestryQCed"
high_ld_regions_file: "auto"
build: "38"
execute_params:
ind_pair: [50, 5, 0.2] # LD pruning for PCA
pca: 10 # Number of principal components
maf: 0.01 # MAF threshold for PCA
ref_threshold: 4 # SD threshold for reference outliers
stu_threshold: 4 # SD threshold for study outliers
reference_pop: "EUR" # Expected population
num_pcs: 10 # PCs for ancestry assignment
distance_metric: "infinity" # Distance metric for outlier detection
execute_params:
ind_pair (list[int]): LD pruning parameters for PCA variants
pca (int): Number of principal components to compute
maf (float): MAF threshold for variants included in PCA
ref_threshold (float): Standard deviations for reference population outliers
stu_threshold (float): Standard deviations for study population outliers
reference_pop (string): Expected population ancestry
“EUR”: European
“AFR”: African
“AMR”: Admixed American
“EAS”: East Asian
“SAS”: South Asian
num_pcs (int): Number of PCs used for ancestry classification
distance_metric (string): “euclidean”, “manhattan”, or “infinity” (Chebyshev)
Variant QC Step
Performs variant-level quality control:
- name: "variant_qc"
enabled: true
module: "ideal_genom.qc.variant_qc"
class: "VariantQC"
init_params:
input_path: "${steps.ancestry_qc.clean_dir}"
input_name: "${steps.ancestry_qc.output_name}"
output_path: "${base_output_dir}"
output_name: "mydata_variantQCed"
execute_params:
miss_data_rate: 0.02 # Max missing rate across samples
diff_genotype_rate: 1.0e-5 # Differential missingness p-value
geno: 0.02 # Max missing rate per variant
maf: 0.01 # Minor allele frequency
hwe: 1.0e-6 # Hardy-Weinberg equilibrium p-value
chr_y: 24 # Y chromosome identifier
execute_params:
miss_data_rate (float, 0-1): Maximum overall missing data rate threshold
diff_genotype_rate (float): P-value threshold for differential missingness between cases/controls
geno (float, 0-1): Maximum missing genotype rate per variant
maf (float, 0-0.5): Minor allele frequency threshold
Standard GWAS: 0.01-0.05
Rare variant analysis: 0.001-0.01
Very strict: 0.001
hwe (float, 0-1): Hardy-Weinberg equilibrium p-value threshold
Standard: 1e-6
Strict: 1e-10 (for genotyping array data)
Relaxed: 1e-4
chr_y (int): Y chromosome identifier (23 for hg19, 24 for hg38)
Population Analysis Step
Performs dimensionality reduction and population visualization:
- name: "dimensionality_reduction"
enabled: true
module: "ideal_genom.population.projection"
class: "DimensionalityReductionPipeline"
init_params:
input_path: "${steps.variant_qc.clean_dir}"
input_name: "${steps.variant_qc.output_name}"
output_path: "${base_output_dir}"
build: "38"
high_ld_regions_file: "auto"
generate_plot: true
execute_params:
# PCA parameters
pca_params:
pca: 10
force_pca_recompute: false
# UMAP parameters
run_umap: true
umap_params:
n_neighbors: 15
min_dist: 0.1
n_components: 2
# t-SNE parameters
run_tsne: true
tsne_params:
perplexity: 30
# Plotting options
case_control_markers: true
plot_format: "png"
dpi: 600
execute_params:
pca_params (mapping): PCA configuration
pca (int): Number of components to compute
force_pca_recompute (bool): Recompute PCA even if results exist
run_umap (bool): Enable UMAP analysis
umap_params (mapping): UMAP configuration
n_neighbors (int): Number of neighbors (5-50, default: 15)
min_dist (float): Minimum distance (0.0-1.0, default: 0.1)
n_components (int): Output dimensions (typically 2 or 3)
run_tsne (bool): Enable t-SNE analysis
tsne_params (mapping): t-SNE configuration
perplexity (int): Perplexity value (5-50, default: 30)
case_control_markers (bool): Color by case/control status
plot_format (string): “png”, “svg”, or “pdf”
dpi (int): Plot resolution (default: 600)
Settings Section
Global settings that apply to the entire pipeline:
Logging Settings
settings:
logging:
level: "INFO" # Logging verbosity
file_logging: true # Write to log file
console_logging: true # Print to console
level (string): Log message detail level
“DEBUG”: Very detailed, for troubleshooting
“INFO”: Standard informational messages (recommended)
“WARNING”: Only warnings and errors
“ERROR”: Only errors
file_logging (bool): Save logs to pipeline.log in output directory
console_logging (bool): Print log messages to terminal
Resource Settings
settings:
resources:
max_memory: null # Maximum memory in MB
max_threads: null # Maximum CPU threads
max_memory (int or null): Maximum memory allocation in MB
null: Auto-detect (uses 2/3 of available RAM)Explicit value: Set specific limit (e.g., 32000 for 32GB)
max_threads (int or null): Maximum CPU threads to use
null: Auto-detect (uses available cores - 2)Explicit value: Set specific number
File Management Settings
settings:
files:
keep_intermediate: true # Preserve temporary files
compress_outputs: false # Compress output files
overwrite_existing: false # Overwrite existing results
keep_intermediate (bool): Keep temporary intermediate files
true: Keep all files (useful for debugging)false: Clean up after each step (saves disk space)
compress_outputs (bool): Compress output files with gzip
overwrite_existing (bool): Overwrite existing output files
true: Overwrite without askingfalse: Fail if outputs exist (safer)
Report Generation Settings
settings:
reports:
generate_reports: true # Generate visualization reports
plot_format: "png" # Plot file format
generate_reports (bool): Automatically generate QC plots and reports
plot_format (string): Output format for plots
“png”: Standard format, good quality
“svg”: Vector format, scalable
“pdf”: Publication-ready format
Advanced Configuration Patterns
Conditional Step Execution
Skip steps based on your needs:
pipeline:
steps:
- name: "sample_qc"
enabled: true
- name: "ancestry_qc"
enabled: false # Skip for homogeneous population
- name: "variant_qc"
enabled: true
init_params:
# Connect directly to sample QC
input_path: "${steps.sample_qc.clean_dir}"
Using Pre-existing Results
Resume pipeline from intermediate step:
pipeline:
steps:
- name: "sample_qc"
enabled: false # Already completed
- name: "variant_qc"
enabled: true
init_params:
# Use existing sample QC output
input_path: "/data/output/my_study/sample_qc/clean_files"
input_name: "mydata_sampleQCed"
Multiple Output Directories
Organize outputs by analysis type:
pipeline:
base_output_dir: "/data/project"
steps:
- name: "sample_qc"
init_params:
output_path: "${base_output_dir}/qc_results"
- name: "gwas_prep"
init_params:
output_path: "${base_output_dir}/gwas_analysis"
Parameter Tuning Guidelines
Sample QC Thresholds
For Standard Case-Control GWAS:
mind: 0.02 (2% missing)
maf: 0.01 (1% MAF)
het_deviation: 3 SD
kinship: 0.354 (exclude 1st degree relatives)
For Rare Variant Analysis:
mind: 0.01 (stricter)
maf: 0.001 (include rare variants)
het_deviation: 4 SD (more lenient)
For Family-Based Studies:
kinship: 0.088 (allow up to 3rd degree relatives)
Adjust sex_check if samples include children
Ancestry QC Thresholds
For Homogeneous Populations:
ref_threshold: 6 SD (softer)
stu_threshold: 6 SD (softer)
Consider disabling ancestry QC entirely
Variant QC Thresholds
For Array-Based Data:
geno: 0.02 (2% missing)
hwe: 1e-10 (very strict)
maf: 0.01
For Sequencing Data:
geno: 0.05 (more lenient)
hwe: 1e-6 (standard)
maf: 0.001 (include rare variants)
Common Configuration Examples
Minimal QC Pipeline
pipeline:
name: "minimal_qc"
base_output_dir: "/data/output"
steps:
- name: "sample_qc"
enabled: true
module: "ideal_genom.qc.sample_qc"
class: "SampleQC"
init_params:
input_path: "/data/input"
input_name: "mydata"
output_path: "${base_output_dir}"
output_name: "mydata_clean"
high_ld_regions_file: "auto"
build: "38"
execute_params:
mind: 0.02
maf: 0.01
settings:
logging:
level: "INFO"
Complete QC with Ancestry
pipeline:
name: "full_qc"
base_output_dir: "/data/output"
steps:
- name: "sample_qc"
enabled: true
module: "ideal_genom.qc.sample_qc"
class: "SampleQC"
init_params:
input_path: "/data/input"
input_name: "mydata"
output_path: "${base_output_dir}"
output_name: "mydata_sampleQCed"
high_ld_regions_file: "auto"
build: "38"
execute_params:
mind: 0.02
sex_check: [0.2, 0.8]
maf: 0.01
het_deviation: 3
kinship: 0.354
- name: "ancestry_qc"
enabled: true
module: "ideal_genom.qc.ancestry_qc"
class: "AncestryQC"
init_params:
input_path: "${steps.sample_qc.clean_dir}"
input_name: "${steps.sample_qc.output_name}"
output_path: "${base_output_dir}"
output_name: "mydata_ancestryQCed"
high_ld_regions_file: "auto"
build: "38"
execute_params:
pca: 10
ref_threshold: 4
stu_threshold: 4
reference_pop: "EUR"
- name: "variant_qc"
enabled: true
module: "ideal_genom.qc.variant_qc"
class: "VariantQC"
init_params:
input_path: "${steps.ancestry_qc.clean_dir}"
input_name: "${steps.ancestry_qc.output_name}"
output_path: "${base_output_dir}"
output_name: "mydata_final"
execute_params:
geno: 0.02
maf: 0.01
hwe: 1.0e-6
Troubleshooting Configuration
Configuration validation fails:
Check YAML syntax (indentation, colons, quotes)
Verify all required fields are present
Ensure paths exist and are accessible
Check module and class names are correct
Pipeline runs but produces no output:
Verify
enabled: truefor desired stepsCheck input file paths are correct
Review
pipeline.logfor errorsEnsure output directory is writable
Memory errors:
Set
max_memoryexplicitlyReduce
max_threadsto free memoryProcess datasets in batches
Enable
keep_intermediate: falseto save space
Variable substitution not working:
Ensure correct syntax:
${variable_name}Check referenced step names match exactly
Verify step order (can’t reference future steps)
See Also
Getting Started - Quick start guide
Examples - Complete workflow examples
Troubleshooting Guide - Detailed problem-solving
Frequently Asked Questions - Frequently asked questions
Docker Paths:
When using Docker, paths should be relative to the container’s /data directory:
{
"input_directory": "/data/inputData",
"input_prefix": "mydata",
"output_directory": "/data/outputData",
"output_prefix": "clean_data",
"high_ld_file": "/data/dependables/high-LD-regions.txt"
}
Steps Configuration
The steps.json file controls which pipeline steps to execute:
{
"ancestry": true,
"sample": true,
"variant": true,
"umap": true,
"fst": true
}
Step Dependencies:
sample→ancestry→variant→dim reduction→fstYou can skip steps, but maintain dependencies
Results from previous steps are required for subsequent steps
Advanced Configuration
Custom LD Regions
Provide your own high-LD regions file:
# high-LD-regions.txt format
1 48000000 52000000 # Chromosome, start, end
2 85000000 100000000
6 25000000 35000000
Performance Tuning
Memory Optimization:
Increase
ind_pairwindow size for large datasetsReduce
pcacomponents if memory is limitedProcess chromosomes separately for very large datasets
Speed Optimization:
Use SSD storage for temporary files
Increase available CPU cores
Consider splitting large datasets
Disk Space Management:
Monitor intermediate file sizes
Clean up temporary files regularly
Use compression for archival storage
Best Practices
Version Control: Keep configuration files under version control
Documentation: Document parameter choices and rationale
Validation: Always validate results visually
Backup: Keep copies of successful configurations
Testing: Test parameter changes on small datasets first
Troubleshooting
Common Configuration Issues:
Path not found: Check absolute paths and permissions
Parameter out of range: Verify threshold values are reasonable
JSON syntax errors: Validate JSON format
Memory errors: Reduce dataset size or adjust parameters
See the Troubleshooting Guide guide for more detailed solutions.