VCF Processing Pipeline

The VCF (Variant Call Format) processing pipeline in IDEAL-GENOM provides tools for handling post-imputation VCF files. It automates the complete workflow from imputed VCF files to analysis-ready PLINK binary format, including quality filtering, normalization, annotation, and format conversion.

Overview

The VCF pipeline consists of two main components:

VCF Processing: Filter, normalize, annotate, and concatenate imputed VCF files
PLINK Conversion: Convert processed VCF files to PLINK binary format

Key Features:

Parallel processing of multiple chromosome VCF files
Quality filtering based on imputation R² score
Reference genome normalization
Variant ID annotation with dbSNP
Automatic handling of multiallelic variants
Efficient conversion to PLINK format
Resource-aware parallel execution

Use Cases

This pipeline is ideal for:

Post-Imputation Processing: Clean and prepare imputed genotype data
TOPMed/HRC Imputation: Process results from imputation servers
Data Harmonization: Standardize variant representation and IDs
GWAS Preparation: Convert VCF to PLINK for association analysis

Prerequisites

Required Software:

BCFtools (v1.10+) for VCF manipulation
PLINK 2.0 for format conversion
Python 3.11+ with required packages

Required Data:

Imputed VCF files (can be gzipped)
Reference genome FASTA (auto-downloaded if not provided)
dbSNP VCF for annotation (optional but recommended)

Input File Formats:

Supported input formats:

.vcf.gz - Gzipped VCF files (recommended)
.vcf - Uncompressed VCF files
.zip - Zipped VCF files (automatically extracted)

Quick Start

1. Get the VCF Configuration Template

# Copy the VCF processing template
cp yaml_configs/vcf_config_template.yaml my_vcf_pipeline.yaml

2. Edit Configuration

Update paths to your imputed data:

pipeline:
  name: "vcf_processing"
  base_output_dir: "/data/output"
  steps:
    - name: "process_vcf"
      enabled: true
      init_params:
        input_path: "/data/imputed_vcfs"
        output_path: "${base_output_dir}"
        output_name: "processed_data.vcf.gz"
      execute_params:
        r2_threshold: 0.3
        build: "38"

3. Run the Pipeline

ideal-genom run --config my_vcf_pipeline.yaml

Pipeline Steps

Step 1: VCF Processing

Processes imputed VCF files through multiple quality control and normalization steps.

What it does:

Unzip Files: Extract compressed VCF files if needed
Filter by R²: Remove poorly imputed variants (R² < threshold)
Normalize: Split multiallelic variants and left-align indels
Reference Normalization: Ensure variants match reference genome
Annotate IDs: Add or update variant IDs with dbSNP rs numbers
Concatenate: Merge all chromosome VCFs into a single file
Index: Create tabix index for the final VCF

Configuration:

- name: "process_vcf"
  enabled: true
  module: "ideal_genom.post_imputation.vcf_process"
  class: "ProcessVCF"
  init_params:
    input_path: "/data/imputed_vcfs"
    output_path: "${base_output_dir}"
    input_name: "placeholder"
    output_name: "processed_data.vcf.gz"
  execute_params:
    password: null              # For password-protected zip files
    r2_threshold: 0.3           # Minimum imputation quality
    build: "38"                 # Genome build ("37" or "38")
    ref_genome: null            # Custom reference (auto-downloaded if null)
    ref_annotation: null        # dbSNP VCF for annotation
    max_threads: null           # Thread count (auto-detect if null)

Parameters Explained:

init_params:

input_path (string): Directory containing imputed VCF files
- Can contain .vcf.gz, .vcf, or .zip files
- Automatically detects chromosome-specific files
output_path (string): Where to save processed files
input_name (string): Placeholder for compatibility (can be any value)
output_name (string): Name for final concatenated VCF file

execute_params:

password (string or null): Password for encrypted zip files
r2_threshold (float, 0-1): Minimum R² (imputation quality) score
- 0.3: Liberal (includes more variants, some lower quality)
- 0.5: Moderate (balance quality and quantity)
- 0.8: Conservative (high quality, fewer variants)
- Variants with R² < threshold are removed
build (string): Genome build version
- “37”: GRCh37/hg19
- “38”: GRCh38/hg38
ref_genome (string or null): Path to reference genome FASTA
- null: Auto-download from Ensembl based on build
- Custom path: Use your own reference FASTA file
ref_annotation (string or null): Path to dbSNP VCF for variant annotation
- null: Skip annotation step
- Path to dbSNP VCF: Annotate variants with rs IDs
max_threads (int or null): Maximum threads for parallel processing
- null: Auto-detect (uses available cores)
- Specific number: Limit thread usage

Processing Pipeline Details:

Input VCF files (per chromosome)
       ↓
[1. Unzip if needed]
       ↓
[2. Filter: R² >= threshold]
       ↓
[3. Normalize: split multiallelic, left-align]
       ↓
[4. Reference normalization]
       ↓
[5. Annotate with dbSNP IDs]
       ↓
[6. Concatenate all chromosomes]
       ↓
[7. Index final VCF]
       ↓
Output: processed_data.vcf.gz + .tbi

Output Files:

process_vcf/
├── unzipped/                   # Extracted VCF files
│   ├── chr1.vcf.gz
│   ├── chr2.vcf.gz
│   └── ...
├── filtered/                   # After R² filtering
│   ├── chr1_filtered.vcf.gz
│   ├── chr2_filtered.vcf.gz
│   └── ...
├── normalized/                 # After normalization
│   ├── chr1_norm.vcf.gz
│   ├── chr2_norm.vcf.gz
│   └── ...
├── ref_normalized/             # After reference normalization
│   ├── chr1_refnorm.vcf.gz
│   └── ...
├── annotated/                  # After dbSNP annotation
│   ├── chr1_annotated.vcf.gz
│   └── ...
└── processed_data.vcf.gz       # Final concatenated file
    └── processed_data.vcf.gz.tbi  # Tabix index

Step 2: PLINK Conversion

Converts processed VCF files to PLINK binary format for GWAS and other analyses.

What it does:

Converts VCF to PLINK1.9 binary format (.bed/.bim/.fam)
Handles sample ID formatting
Updates family information if provided
Optimizes for downstream analysis

Configuration:

- name: "plink_conversion"
  enabled: true
  module: "ideal_genom.post_imputation.vcf_to_plink"
  class: "GetPLINK"
  init_params:
    input_path: "${steps.process_vcf.output_path}"
    input_name: "${steps.process_vcf.concatenated_file}"
    output_path: "${base_output_dir}"
    output_name: "imputed_data_plink"
  execute_params:
    double_id: true             # Use sample ID for both FID and IID
    for_fam_update_file: null   # FAM file for updating family info
    threads: null               # Thread count (auto-detect)
    memory: null                # Memory in MB (auto-detect)

Parameters Explained:

init_params:

input_path (string): Directory containing processed VCF
input_name (string): Name of input VCF file (without extension)
output_path (string): Where to save PLINK files
output_name (string): Prefix for output PLINK files

execute_params:

double_id (bool): Sample ID handling
- true: Set FID = IID (recommended for most cases)
- false: Use VCF sample IDs as-is
for_fam_update_file (string or null): Path to FAM file for updating
- null: No update, use VCF phenotype information
- Path: Update .fam with information from this file (FID, IID, phenotype, sex)
threads (int or null): Number of threads for conversion
memory (int or null): Memory allocation in MB

Output Files:

plink_conversion/
├── imputed_data_plink.bed      # Binary genotype file
├── imputed_data_plink.bim      # Variant information
├── imputed_data_plink.fam      # Sample information
└── imputed_data_plink.log      # Conversion log

Complete Workflow Example

Full VCF Processing Configuration:

pipeline:
  name: "imputed_data_processing"
  base_output_dir: "/data/processed_imputation"

  steps:
    # Step 1: Process VCF files
    - name: "process_vcf"
      enabled: true
      module: "ideal_genom.post_imputation.vcf_process"
      class: "ProcessVCF"
      init_params:
        input_path: "/data/imputation_results"
        output_path: "${base_output_dir}"
        input_name: "placeholder"
        output_name: "imputed_clean.vcf.gz"
      execute_params:
        password: null
        r2_threshold: 0.3
        build: "38"
        ref_genome: null
        ref_annotation: "/data/references/dbSNP156_GRCh38.vcf.gz"
        max_threads: null

    # Step 2: Convert to PLINK
    - name: "plink_conversion"
      enabled: true
      module: "ideal_genom.post_imputation.vcf_to_plink"
      class: "GetPLINK"
      init_params:
        input_path: "${steps.process_vcf.output_path}"
        input_name: "${steps.process_vcf.concatenated_file}"
        output_path: "${base_output_dir}"
        output_name: "imputed_plink"
      execute_params:
        double_id: true
        for_fam_update_file: "/data/original_study.fam"
        threads: null
        memory: null

settings:
  logging:
    level: "INFO"
    file_logging: true
  resources:
    max_memory: null
    max_threads: null
  files:
    keep_intermediate: true

Running the Pipeline:

# Validate
ideal-genom validate --config vcf_pipeline.yaml

# Execute
ideal-genom run --config vcf_pipeline.yaml

Input Data Organization

Expected Directory Structure

For chromosome-specific VCF files:

/data/imputation_results/
├── chr1.dose.vcf.gz
├── chr2.dose.vcf.gz
├── chr3.dose.vcf.gz
├── ...
├── chr22.dose.vcf.gz
└── chrX.dose.vcf.gz

Or as zip files:

/data/imputation_results/
├── chr1.zip
├── chr2.zip
├── ...
└── chr22.zip

File Naming Patterns:

The pipeline automatically detects files with these patterns:

chr*.vcf.gz
chr*.vcf
*.dose.vcf.gz
chromosome_*.vcf.gz
chr*.zip

VCF File Requirements

Required INFO Fields:

R2 or INFO: Imputation quality score (R²)
- Used for quality filtering
- Typically ranges from 0 to 1

Quality Control Considerations

R² Threshold Selection

Recommended Thresholds:

R² Threshold  |  Quality     |  Use Case
--------------|--------------|--------------------------------
0.3           |  Liberal     |  Discovery, maximize variants
0.5           |  Moderate    |  Standard analysis
0.8           |  Conservative|  High-confidence variants only
0.9           |  Very strict |  Fine-mapping, functional studies

Trade-offs:

Lower threshold (0.3): More variants, some lower quality
Higher threshold (0.8): Fewer variants, higher confidence

Check Imputation Quality:

# Before filtering
bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\
    awk '{sum+=$1; count++} END {print "Mean R2:", sum/count}'

# Distribution
bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\
    awk '{if($1<0.3) low++; else if($1<0.8) med++; else high++}
         END {print "R2<0.3:", low, "0.3-0.8:", med, "R2>=0.8:", high}'

Reference Genome Selection

Auto-Download (Recommended):

Set ref_genome: null and specify build:

Build “37”: Downloads GRCh37 from Ensembl
Build “38”: Downloads GRCh38 from Ensembl

Custom Reference:

Use your own reference if:

Using non-standard reference assembly
Working offline
Need specific reference version

Ensure reference matches imputation reference!

dbSNP Annotation

Benefits of Annotation:

Standardized variant IDs (rs numbers)
Easier result interpretation
Facilitates meta-analysis
Improves result sharing

Download dbSNP:

# GRCh38
wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz
wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz.tbi

# GRCh37
wget https://ftp.ncbi.nih.gov/snp/archive/b151/VCF/GCF_000001405.25.gz

Best Practices

Pre-Processing Checklist

Before running the VCF pipeline:

Verify File Integrity:

# Check VCF files
for vcf in chr*.vcf.gz; do
    bcftools view -H $vcf | head -1 || echo "Error in $vcf"
done

Check Sample Counts:

# Ensure consistent sample numbers
for vcf in chr*.vcf.gz; do
    echo "$vcf: $(bcftools query -l $vcf | wc -l) samples"
done

Inspect Imputation Quality:

# Check R² distribution
bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\
    python -c "import sys; import numpy as np; \\
    r2 = [float(x) for x in sys.stdin]; \\
    print(f'Mean: {np.mean(r2):.3f}'); \\
    print(f'Median: {np.median(r2):.3f}')"

Workflow Optimization

For Large Datasets (>500K variants per chromosome):

execute_params:
  r2_threshold: 0.5          # More stringent
  max_threads: 16            # Use more cores

settings:
  resources:
    max_memory: 64000        # 64GB
    max_threads: 16
  files:
    keep_intermediate: false  # Save disk space

For Small Datasets:

execute_params:
  r2_threshold: 0.3          # More lenient
  max_threads: 4

settings:
  files:
    keep_intermediate: true  # Keep for inspection

For Repeated Analysis:

Set keep_intermediate: true and recompute: false in init_params to avoid reprocessing.

Common Workflows

Workflow 1: Basic VCF Processing

Process and convert imputed VCF files:

pipeline:
  steps:
    - name: "process_vcf"
      enabled: true
      execute_params:
        r2_threshold: 0.3
        build: "38"

    - name: "plink_conversion"
      enabled: true
      execute_params:
        double_id: true

Workflow 2: High-Quality Variant Subset

Extract only high-confidence variants:

pipeline:
  steps:
    - name: "process_vcf"
      enabled: true
      execute_params:
        r2_threshold: 0.8      # Strict filtering
        build: "38"
        ref_annotation: "/data/dbSNP.vcf.gz"

Workflow 3: Update Phenotype Information

Merge with existing phenotype data:

pipeline:
  steps:
    - name: "process_vcf"
      enabled: true

    - name: "plink_conversion"
      enabled: true
      execute_params:
        for_fam_update_file: "/data/phenotypes.fam"
        double_id: true

Troubleshooting

Common Issues

Issue: “R2 field not found in VCF”

Solutions:

Check INFO field name (might be INFO, IMP, IMPINFO)
Verify VCF format with bcftools view -h file.vcf.gz
Update code to match your VCF’s R² field name

Issue: “Reference genome download fails”

Solutions:

Check internet connection
Download manually and provide path with ref_genome
Use alternative Ensembl mirror

Issue: “Memory error during concatenation”

Solutions:

Set max_memory explicitly in settings
Reduce max_threads (more memory per thread)
Process subsets of chromosomes separately
Close other applications

Issue: “Sample IDs don’t match phenotype file”

Solutions:

Verify sample ID format in VCF
Check FAM file sample ID format (FID and IID)
Use double_id: true if VCF has only one ID per sample
Manually create matching FAM file

Issue: “Missing variants after filtering”

Solutions:

Lower r2_threshold (e.g., from 0.8 to 0.3)
Check imputation quality of input files
Verify R² field is correctly parsed

Performance Issues

Slow processing:

Increase max_threads in execute_params
Use SSD storage for temporary files
Reduce r2_threshold to filter earlier
Process chromosomes in parallel manually

High memory usage:

Reduce max_threads (each thread uses memory)
Process chromosomes separately
Set explicit max_memory limit
Enable keep_intermediate: false

Output Validation

Verify Pipeline Success

# Check file integrity
plink2 --bfile imputed_plink --validate

# Verify sample count
wc -l imputed_plink.fam

# Check variant info
head imputed_plink.bim

Integration with QC Pipeline

After VCF processing, run the QC pipeline:

# Step 1: Process VCF
ideal-genom run --config vcf_pipeline.yaml

# Step 2: Run QC on converted PLINK files
# Update qc_pipeline.yaml to use processed data
ideal-genom run --config qc_pipeline.yaml

QC Configuration Update:

pipeline:
  steps:
    - name: "sample_qc"
      init_params:
        input_path: "/data/processed_imputation/plink_conversion"
        input_name: "imputed_plink"

External Resources

BCFtools Documentation: - Main page: http://samtools.github.io/bcftools/ - VCF filtering: http://samtools.github.io/bcftools/bcftools.html#view - Normalization: http://samtools.github.io/bcftools/bcftools.html#norm

PLINK Documentation: - PLINK 2.0: https://www.cog-genomics.org/plink/2.0/ - VCF import: https://www.cog-genomics.org/plink/2.0/input#vcf

Imputation Services: - TOPMed: https://imputation.biodatacatalyst.nhlbi.nih.gov/ - Michigan Imputation Server: https://imputationserver.sph.umich.edu/ - HRC: http://www.haplotype-reference-consortium.org/

VCF Format: - Specification: https://samtools.github.io/hts-specs/VCFv4.2.pdf - Best practices: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/

VCF Processing Pipeline

Overview

Use Cases

Prerequisites

Quick Start

Pipeline Steps

Step 1: VCF Processing

Step 2: PLINK Conversion

Complete Workflow Example

Input Data Organization

Expected Directory Structure

VCF File Requirements

Quality Control Considerations

R² Threshold Selection

Reference Genome Selection

dbSNP Annotation

Best Practices

Pre-Processing Checklist

Workflow Optimization

Common Workflows

Workflow 1: Basic VCF Processing

Workflow 2: High-Quality Variant Subset

Workflow 3: Update Phenotype Information

Troubleshooting

Common Issues

Performance Issues

Output Validation

Verify Pipeline Success

Integration with QC Pipeline

See Also

External Resources