Getting Started

This guide will help you get up and running with IDEAL-GENOM quickly. We’ll walk through setting up your first genomic analysis pipeline step by step using the new YAML-based configuration system.

Overview

IDEAL-GENOM uses a modern, flexible pipeline system:

Prepare Your Data: Ensure data is in PLINK1.9 format
Generate Configuration: Create a YAML configuration file
Customize Pipeline: Edit configuration to match your needs
Validate Configuration: Check for errors before running
Execute Pipeline: Run the analysis
Review Results: Examine outputs and visualizations

The New Configuration System

IDEAL-GENOM v0.2.0 introduces a YAML-based configuration system that replaces the previous JSON approach. Benefits include:

Single File: All settings in one place (no more separate parameters.json, paths.json, steps.json)
Hierarchical Structure: Clear organization of pipeline steps and parameters
Variable Substitution: Reference outputs from previous steps automatically
Enable/Disable Steps: Easily control which analyses to run
Comments: Built-in documentation within the config file

Quick Start: 5-Minute Tutorial

1. Get a Configuration Template

Configuration templates are included in the repository under yaml_configs/:

# Clone the repository (if you haven't already)
git clone https://github.com/cge-tubingens/ideal-genom-qc.git
cd ideal-genom-qc

# Copy the QC pipeline template
cp yaml_configs/qc_pipeline_config_template.yaml my_qc_pipeline.yaml

Available templates: - qc_pipeline_config_template.yaml - Complete QC pipeline - gwas_config_template.yaml - GWAS analysis pipeline - vcf_config_template.yaml - VCF post-imputation processing

2. Edit the Configuration

Open my_qc_pipeline.yaml and update the paths to match your data:

pipeline:
  name: "my_study_qc"
  base_output_dir: "/path/to/output"

  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/path/to/your/data"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_sampleQCed"
        high_ld_regions_file: "/path/to/high_ld_regions.txt"
        build: "38"

3. Validate Your Configuration

ideal-genom validate --config my_qc_pipeline.yaml

4. Run the Pipeline

ideal-genom run --config my_qc_pipeline.yaml

That’s it! The pipeline will execute all enabled steps in order.

Step-by-Step Guide

Step 1: Prepare Your Data Step 1: Prepare Your Data ^^^^^^^^^^^^^^^^^^^^^^^^^

IDEAL-GENOM works with PLINK1.9 binary format files:

.bed: Binary genotype data
.bim: Variant information (chromosome, position, alleles, etc.)
.fam: Sample information (family ID, individual ID, phenotype, etc.)

Convert from VCF (if needed):

plink --vcf mydata.vcf.gz --make-bed --out mydata

Data Requirements:

Genome build: GRCh37 (hg19) or GRCh38 (hg38)
For ancestry QC: 1000 Genomes reference files (auto-downloaded if not provided)
For high LD region filtering: high-LD-regions file (included with package)

Step 2: Create Your Configuration

Option A: Use a Template (Recommended)

Copy one of the provided templates from the repository:

# Copy the QC pipeline template
cp yaml_configs/qc_pipeline_config_template.yaml my_qc_pipeline.yaml

# Or for GWAS analysis
cp yaml_configs/gwas_config_template.yaml my_gwas_pipeline.yaml

# Or for VCF processing
cp yaml_configs/vcf_config_template.yaml my_vcf_pipeline.yaml

Option B: Start from Scratch

Create a minimal configuration file:

pipeline:
  name: "my_analysis"
  base_output_dir: "/data/output"

  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_sampleQCed"
        high_ld_regions_file: "auto"  # Use built-in file
        build: "38"
      execute_params:
        mind: 0.02
        sex_check: [0.2, 0.8]
        maf: 0.01
        het_deviation: 3
        kinship: 0.354

settings:
  logging:
    level: "INFO"
  resources:
    max_memory: null  # Auto-detect
    max_threads: null  # Auto-detect

Step 3: Understanding the Configuration Structure

The YAML configuration has three main sections:

Pipeline Section

pipeline:
  name: "pipeline_name"              # Descriptive name for your analysis
  base_output_dir: "/path/to/output" # All outputs will go here
  steps:                             # List of analysis steps (in order)
    - name: "step_name"
      enabled: true                  # Set to false to skip this step
      module: "ideal_genom.module"   # Python module path
      class: "ClassName"             # Class to instantiate
      init_params:                   # Parameters passed to __init__
        # ...
      execute_params:                # Parameters passed to execute()
        # ...

Variable Substitution

Reference values from elsewhere in the config:

pipeline:
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      init_params:
        output_path: "${base_output_dir}"  # Uses /data/output

    - name: "variant_qc"
      init_params:
        # Use output from previous step
        input_path: "${steps.sample_qc.clean_dir}"

Settings Section

settings:
  logging:
    level: "INFO"                    # DEBUG, INFO, WARNING, ERROR
    file_logging: true               # Log to file
    console_logging: true            # Log to console

  resources:
    max_memory: null                 # null = auto-detect (uses 2/3 available)
    max_threads: null                # null = auto-detect (uses cores - 2)

  files:
    keep_intermediate: true          # Keep temporary files
    compress_outputs: false          # Compress output files
    overwrite_existing: false        # Overwrite existing results

Step 4: Configure Your Pipeline Steps

Sample QC - Remove low-quality samples

- name: "sample_qc"
  enabled: true
  module: "ideal_genom.qc.sample_qc"
  class: "SampleQC"
  init_params:
    input_path: "/data/input"
    input_name: "mydata"
    output_path: "${base_output_dir}"
    output_name: "mydata_sampleQCed"
    high_ld_regions_file: "auto"
    build: "38"
  execute_params:
    rename_snp: true              # Rename SNPs to chr:pos format
    hh_to_missing: true           # Convert homozygous haploid calls to missing
    use_kinship: true             # Use kinship instead of IBD
    ind_pair: [50, 5, 0.2]        # LD pruning: window, step, r² threshold
    mind: 0.02                    # Max missing rate per individual (2%)
    sex_check: [0.2, 0.8]         # F coefficient bounds [female_max, male_min]
    maf: 0.01                     # Minor allele frequency threshold
    het_deviation: 3              # Heterozygosity SD threshold
    kinship: 0.354                # Kinship coefficient (2nd degree relatives)
    ibd_threshold: 0.185          # IBD threshold for duplicate detection

Ancestry QC - Detect population outliers

- name: "ancestry_qc"
  enabled: true
  module: "ideal_genom.qc.ancestry_qc"
  class: "AncestryQC"
  init_params:
    input_path: "${steps.sample_qc.clean_dir}"
    input_name: "${steps.sample_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_ancestryQCed"
    high_ld_regions_file: "auto"
    build: "38"
  execute_params:
    ind_pair: [50, 5, 0.2]        # LD pruning for PCA
    pca: 10                       # Number of PCs to compute
    maf: 0.01                     # MAF threshold
    ref_threshold: 4              # SD threshold for reference outliers
    stu_threshold: 4              # SD threshold for study outliers
    reference_pop: "EUR"          # Expected population (EUR, AFR, AMR, EAS, SAS)
    num_pcs: 10                   # Number of PCs for ancestry assignment

Variant QC - Remove low-quality variants

- name: "variant_qc"
  enabled: true
  module: "ideal_genom.qc.variant_qc"
  class: "VariantQC"
  init_params:
    input_path: "${steps.ancestry_qc.clean_dir}"
    input_name: "${steps.ancestry_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_variantQCed"
  execute_params:
    miss_data_rate: 0.02          # Max missing rate across all samples
    diff_genotype_rate: 1.0e-5    # Differential missingness p-value
    geno: 0.02                    # Max missing rate per variant
    maf: 0.01                     # Minor allele frequency threshold
    hwe: 1.0e-6                   # Hardy-Weinberg equilibrium p-value
    chr_y: 24                     # Y chromosome code (24 for hg38)

Step 5: Validate Your Configuration Step 5: Validate Your Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before running the pipeline, validate your configuration:

ideal-genom validate --config qc_pipeline.yaml

This checks for:

File paths existence
Required parameters
Parameter value ranges
Module and class availability
Configuration syntax

Example output:

✓ Configuration file is valid
✓ Pipeline 'my_study_qc' configured with 3/3 enabled steps

Step 6: Run the Pipeline

Basic Execution

ideal-genom run --config qc_pipeline.yaml

Dry Run (Preview Without Executing)

ideal-genom run --config qc_pipeline.yaml --dry-run

Example dry-run output:

============================================================
PIPELINE SUMMARY (DRY RUN)
============================================================
Pipeline Name: my_study_qc
Output Directory: /data/output
Total Steps: 3
Enabled Steps: 3

Enabled Steps:
  1. sample_qc (ideal_genom.qc.sample_qc.SampleQC)
  2. ancestry_qc (ideal_genom.qc.ancestry_qc.AncestryQC)
  3. variant_qc (ideal_genom.qc.variant_qc.VariantQC)
============================================================

Custom Logging Level

ideal-genom run --config qc_pipeline.yaml --log-level DEBUG

Step 7: Understanding the Results

After pipeline execution, your output directory will contain:

/data/output/
├── my_study_qc/                    # Pipeline-specific directory
│   ├── sample_qc/
│   │   ├── clean_files/            # QC-passed data
│   │   │   ├── mydata_sampleQCed.bed
│   │   │   ├── mydata_sampleQCed.bim
│   │   │   └── mydata_sampleQCed.fam
│   │   ├── fail_samples/           # Removed samples with reasons
│   │   │   ├── failed_mind.txt
│   │   │   ├── failed_sexcheck.txt
│   │   │   ├── failed_het.txt
│   │   │   └── failed_kinship.txt
│   │   └── plots/                  # Visualization reports
│   │       ├── call_rate.png
│   │       ├── heterozygosity.png
│   │       ├── sex_check.png
│   │       └── kinship_distribution.png
│   ├── ancestry_qc/
│   │   ├── clean_files/
│   │   ├── fail_samples/
│   │   │   └── ancestry_outliers.txt
│   │   └── plots/
│   │       ├── pca_all_samples.png
│   │       ├── pca_after_qc.png
│   │       └── scree_plot.png
│   └── variant_qc/
│       ├── clean_files/            # Final QC-passed variants
│       │   ├── mydata_variantQCed.bed  # Ready for GWAS!
│       │   ├── mydata_variantQCed.bim
│       │   └── mydata_variantQCed.fam
│       ├── fail_variants/
│       │   ├── failed_geno.txt
│       │   ├── failed_hwe.txt
│       │   └── failed_maf.txt
│       └── plots/
│           ├── maf_distribution.png
│           ├── hwe_distribution.png
│           └── missingness.png
└── pipeline.log                    # Complete execution log

Key Output Files:

clean_files/: Final PLINK binary files ready for downstream analysis (GWAS, etc.)
fail_samples/fail_variants/: Lists of excluded samples/variants with QC failure reasons
plots/: Publication-ready visualizations for QC reporting
pipeline.log: Detailed log of all operations, parameters, and results

Using the Python API

For more control, use the Python API directly:

Basic Example

from ideal_genom.core.config import load_config
from ideal_genom.core.pipeline import PipelineExecutor

# Load configuration
config = load_config("qc_pipeline.yaml")

# Create and execute pipeline
executor = PipelineExecutor(config)
executor.execute()

Advanced Example with Custom Handling

from ideal_genom.core.config import load_config
from ideal_genom.core.pipeline import PipelineExecutor
import logging

# Setup custom logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Load and modify configuration
config = load_config("qc_pipeline.yaml")

# Create executor
executor = PipelineExecutor(config, dry_run=False)

# Get pipeline summary
summary = executor.get_pipeline_summary()
print(f"Running pipeline: {summary['pipeline_name']}")
print(f"Enabled steps: {summary['enabled_steps']}")

# Execute
try:
    executor.execute()
    print("✓ Pipeline completed successfully!")
except Exception as e:
    print(f"✗ Pipeline failed: {e}")
    raise

Using Individual Modules

from ideal_genom.qc.sample_qc import SampleQC
from pathlib import Path

# Initialize Sample QC
sample_qc = SampleQC(
    input_path=Path("/data/input"),
    input_name="mydata",
    output_path=Path("/data/output"),
    output_name="mydata_sampleQCed",
    high_ld_regions_file="auto",
    build="38"
)

# Run with custom parameters
sample_qc.execute_sample_qc_pipeline(sample_params={
    "rename_snp": True,
    "mind": 0.02,
    "sex_check": [0.2, 0.8],
    "maf": 0.01,
    "het_deviation": 3,
    "kinship": 0.354
})

# Access results
print(f"Clean data saved to: {sample_qc.clean_dir}")

Common Workflows

Workflow 1: Complete QC Pipeline

pipeline:
  name: "full_qc"
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      enabled: true
      # ... (sample QC config)
    - name: "ancestry_qc"
      enabled: true
      # ... (ancestry QC config)
    - name: "variant_qc"
      enabled: true
      # ... (variant QC config)

Workflow 2: Skip Ancestry QC (Homogeneous Population)

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
      # ...
    - name: "ancestry_qc"
      enabled: false  # Skip ancestry analysis
    - name: "variant_qc"
      enabled: true
      init_params:
        # Connect directly to sample QC output
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"

Workflow 3: Resume from Previous Step

pipeline:
  steps:
    - name: "sample_qc"
      enabled: false  # Already completed
    - name: "ancestry_qc"
      enabled: false  # Already completed
    - name: "variant_qc"
      enabled: true
      init_params:
        # Use existing ancestry QC results
        input_path: "/data/output/my_study/ancestry_qc/clean_files"
        input_name: "mydata_ancestryQCed"

Tips and Best Practices

Configuration Management

Use descriptive pipeline names
Comment your configuration extensively
Keep configuration files in version control (git)
Create separate configs for different studies/populations

Resource Management

Set max_memory and max_threads to null for auto-detection
For large datasets (>100K samples), consider increasing memory allocation
Monitor logs for memory/performance issues

Quality Control Thresholds

Standard thresholds work for most datasets
For rare variant analysis, lower MAF thresholds (e.g., 0.001)
For array data, stricter HWE thresholds (1e-10)
Adjust kinship threshold based on study design (family vs. unrelated)

File Organization

Use consistent naming conventions
Keep intermediate files during initial runs (keep_intermediate: true)
Enable logging to files (file_logging: true)
Generate visualization reports (generate_reports: true)

Debugging

Always validate configuration before running
Use --dry-run to preview pipeline execution
Set --log-level DEBUG for detailed troubleshooting
Check fail_samples/fail_variants files to understand QC failures

Debugging

Always validate configuration before running
Use --dry-run to preview pipeline execution
Set --log-level DEBUG for detailed troubleshooting
Check fail_samples/fail_variants files to understand QC failures

Troubleshooting Common Issues

Issue: “Module not found” error

Solution: Check that the module path in your config is correct.
Example: "ideal_genom.qc.sample_qc" not "ideal_genom_qc.sample_qc"

Issue: “File not found” for input data

Solution: Ensure paths are absolute or relative to execution directory.
Use ${base_output_dir} for variable substitution.

Issue: Pipeline runs but produces no output

Solution: Check that steps are enabled: true in configuration.
Verify input files exist at specified paths.

Issue: High memory usage

Solution: Set max_memory explicitly in settings.resources.
Consider splitting large datasets or increasing available RAM.

Next Steps

Now that you understand the basics:

Explore Examples: See Examples for complete workflows
Understand Configuration: Read Configuration Guide for all parameters
Learn GWAS: Check GWAS Pipeline for association analysis
Process VCF Files: See VCF Processing Pipeline for post-imputation workflows
API Reference: Browse module documentation for advanced usage

Additional Resources:

Configuration templates: Clone the repository to access yaml_configs/ directory
Example notebooks in notebooks/ directory
Frequently Asked Questions for frequently asked questions
Troubleshooting Guide for detailed problem-solving

Getting Help:

GitHub Issues: https://github.com/cge-tubingens/IDEAL-GENOM-QC/issues
Check logs: Review pipeline.log for detailed execution information
Community: Join discussions on the GitHub repository