Frequently Asked Questions

This page answers common questions about using IDEAL-GENOM. If you don’t find your answer here, please check the Troubleshooting Guide guide or open an issue on GitHub.

General Questions

Q: What types of genomic data does IDEAL-GENOM support?

A: IDEAL-GENOM primarily works with human genomic data in PLINK binary format (.bed, .bim, .fam files) and VCF format. It supports:

  • SNP array data (e.g., Illumina, Affymetrix)

  • Imputed genotype data (VCF from TOPMed, Michigan Imputation Server)

  • Whole genome sequencing (WGS) data

  • Whole exome sequencing (WES) data

  • Targeted sequencing panels

The pipeline is optimized for autosomal chromosomes but can handle X and Y chromosomes with appropriate configuration.

Q: Which genome builds are supported?

A: IDEAL-GENOM supports both major human genome builds:

  • GRCh37/hg19 (use build: "37" in YAML config)

  • GRCh38/hg38 (use build: "38" in YAML config, recommended)

Reference files and LD regions are automatically adjusted based on the build you specify.

Q: Can I use IDEAL-GENOM for non-human data?

A: The current version is specifically designed for human genomic data. While the underlying algorithms could theoretically work with other species, the reference panels, LD regions, and ancestry databases are human-specific.

Installation and Setup

Q: Why do I need both PLINK 1.9 and PLINK 2.0?

A: Different QC steps require different PLINK versions:

  • PLINK 1.9: Core QC operations, kinship analysis, basic statistics

  • PLINK 2.0: Advanced features, faster VCF processing, improved performance for large datasets

Both tools complement each other and are required for the full pipeline functionality.

Q: Can I install IDEAL-GENOM without admin privileges?

A: Yes! You can install IDEAL-GENOM in user space:

# Install to user directory
pip install --user ideal-genom-qc

# Or use a virtual environment
python -m venv ideal_qc_env
source ideal_qc_env/bin/activate
pip install ideal-genom-qc

Just ensure PLINK tools are available in your PATH or specify their location.

Q: How do I install PLINK without admin privileges?

A: Download PLINK binaries and add them to your PATH:

# Create local bin directory
mkdir -p ~/local/bin

# Download and install PLINK 1.9
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
unzip plink_linux_x86_64_20231211.zip
mv plink ~/local/bin/

# Add to PATH
echo 'export PATH=$HOME/local/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Configuration and Parameters

Q: How do I choose the right QC parameters for my study?

A: Parameter selection depends on your study type:

Parameter Guidelines by Study Type

Parameter

Population Study

Case-Control

Rare Disease

mind

0.05-0.1

0.1-0.2

0.2-0.3

geno

0.05

0.1

0.2

maf

0.01-0.05

0.01

0.0-0.001

hwe

1e-6

5e-8

1e-10

Start with conservative values and relax them if you lose too many samples/variants.

Q: What happens if I set parameters too strictly?

A: Overly strict parameters can lead to:

  • Excessive sample removal (>20% loss is concerning)

  • Loss of rare variants important for your analysis

  • Population bias if certain ethnic groups are disproportionately affected

  • Reduced statistical power

Monitor the QC plots and logs to ensure reasonable filtering.

Q: Can I run only specific QC steps?

A: Yes! Use the steps.json configuration file:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
    - name: "ancestry_qc"
      enabled: false  # Skip ancestry QC
    - name: "variant_qc"
      enabled: true
    - name: "umap_plot"
      enabled: false  # Skip UMAP plotting

Note that some steps depend on others (e.g., variant QC needs ancestry QC results).

Data and File Formats

Q: How do I convert my data to PLINK format?

A: Common conversions:

# VCF to PLINK
plink --vcf mydata.vcf --make-bed --out mydata

# 23andMe format to PLINK
plink --23file mydata.txt --make-bed --out mydata

# PLINK text to binary
plink --file mydata --make-bed --out mydata

Q: What should I do if my .fam file doesn’t have phenotype information?

A: If your .fam file has missing phenotypes (all -9 or 0), you can:

  1. Add phenotype data manually:

# Create phenotype file (FID, IID, phenotype)
# 1=control, 2=case in PLINK format
echo "FAM1 SAMPLE1 1" > phenotypes.txt
echo "FAM2 SAMPLE2 2" >> phenotypes.txt

# Update .fam file
plink --bfile mydata --pheno phenotypes.txt --make-bed --out mydata_pheno
  1. Run without case-control specific steps:

{
    "umap_plot": {
        "case_control_marker": false
    }
}

Q: My data has non-standard chromosome coding. How do I fix this?

A: PLINK expects standard chromosome codes (1-22, X, Y). Convert non-standard coding:

# If using 23=X, 24=Y, 25=XY, 26=MT
plink --bfile mydata --update-chr update_chr.txt --make-bed --out mydata_fixed

Where update_chr.txt contains mappings like:

23 X
24 Y
25 XY
26 MT

Performance and Memory

Q: My analysis is running very slowly. How can I speed it up?

A: Several optimization strategies:

  1. Use faster storage: SSD instead of HDD

  2. Increase memory: Add more RAM if possible

  3. Parallel processing: Use multi-core systems

  4. Reduce data size: Filter variants/samples beforehand

  5. Optimize parameters: Larger LD pruning windows, fewer PCs

Q: I’m getting “out of memory” errors. What should I do?

A: Memory issues can be addressed by:

pipeline:
  steps:
    - name: "sample_qc"
      run_params:
        ind_pair: [200, 50, 0.2]  # Larger LD windows
        chunk_size: 5000          # Process in smaller chunks
    - name: "ancestry_qc"
      run_params:
        pca: 5        # Fewer PCs
        maf: 0.05     # Higher MAF threshold

Or process chromosomes separately:

# Split by chromosome first
for chr in {1..22}; do
    plink --bfile mydata --chr $chr --make-bed --out chr${chr}_data
done

Q: How much disk space do I need?

A: Disk space requirements depend on your dataset size:

Disk Space Requirements

Dataset

Input Size

Temp Files

Total Needed

Small (1K samples, 100K SNPs)

100MB

500MB

1GB

Medium (10K samples, 1M SNPs)

1GB

5GB

10GB

Large (100K samples, 5M SNPs)

10GB

50GB

100GB

Results and Interpretation

Q: How do I interpret the QC plots?

A: Key plots to examine:

  1. Heterozygosity plot: Should show normal distribution; outliers indicate DNA quality issues

  2. Missing data plot: Should show most samples with <10% missing data

  3. PCA plot: Should show clear population clusters

  4. Kinship plot: Should identify related individuals

Q: What constitutes “normal” QC results?

A: Typical expectations:

  • Sample removal: 5-15% of samples failing QC

  • Variant removal: 10-30% of variants failing QC

  • Population outliers: 1-5% of samples (depends on population)

  • Related individuals: Variable (0-10% depending on study design)

Q: Should I be concerned if many samples fail ancestry QC?

A: High ancestry failure rates could indicate:

  1. Wrong reference population: Check your reference_pop setting

  2. Population admixture: Use more lenient thresholds or “ALL” reference

  3. Technical issues: Check for batch effects or DNA quality problems

  4. Study design: Expected in multi-ethnic studies

Quality Control Interpretation

Q: A sample failed multiple QC steps. Should I remove it?

A: Generally yes, especially if it failed:

  • High missing data + ancestry outlier

  • Sex discordance + high heterozygosity

  • Multiple relatedness flags

  • Technical replicates with poor concordance

Q: Can I recover samples that failed QC?

A: Sometimes. Options include:

  1. Relaxing thresholds: If loss is excessive

  2. Investigating causes: Address systematic issues

  3. Manual review: Check borderline cases individually

  4. Batch correction: If batch effects are detected

Q: How do I handle related individuals?

A: Strategies for related samples:

  1. Remove one from each pair: Keep higher call rate individual

  2. Family-based analysis: Use appropriate statistical methods

  3. Clustering approach: Remove minimal set to break all relationships

  4. Separate analysis: Analyze related/unrelated separately

Technical Issues

Q: The pipeline crashed with a PLINK error. What should I do?

A: Common PLINK issues:

  1. Check file formats: Ensure files are not corrupted

  2. Verify file paths: Use absolute paths in configuration

  3. Check disk space: Ensure sufficient space for temp files

  4. Update PLINK: Use latest versions

  5. Check logs: Look for specific error messages

Q: Reference data download failed. How do I fix this?

A: Reference data issues:

from ideal_genom_qc.get_references import FetcherReference

# Manually download reference data
fetcher = FetcherReference(built="38")
fetcher.download_references(force_redownload=True)

Or provide your own reference files in the configuration.

Q: Can I run IDEAL-GENOM-QC on a cluster/HPC system?

A: Yes! Example SLURM script:

#!/bin/bash
#SBATCH --job-name=ideal_qc
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=24:00:00

module load python/3.9
module load plink/1.9

python -m ideal_genom_qc \\
    --path_params config/parameters.json \\
    --file_folders config/paths.json \\
    --steps config/steps.json \\
    --built 38

Contributing and Support

Q: I found a bug. How do I report it?

A: Please report bugs on our GitHub Issues page with:

  1. Complete error message

  2. Configuration files used

  3. System information (OS, Python version, PLINK versions)

  4. Steps to reproduce the issue

Q: Can I contribute to IDEAL-GENOM-QC development?

A: Absolutely! We welcome contributions:

  1. Bug fixes: Submit pull requests for any bugs you fix

  2. New features: Propose enhancements via GitHub issues first

  3. Documentation: Help improve documentation and examples

  4. Testing: Report issues with different data types/systems

See our Contributing Guide guide for details.

Q: Is commercial use allowed?

A: Yes, IDEAL-GENOM-QC is open source under the MIT license, allowing commercial use. Please review the license terms in the repository for full details.