Frequently Asked Questions

This page answers common questions about using IDEAL-GENOM. If you don’t find your answer here, please check the Troubleshooting Guide guide or open an issue on GitHub.

General Questions

Q: What types of genomic data does IDEAL-GENOM support?

A: IDEAL-GENOM primarily works with human genomic data in PLINK binary format (.bed, .bim, .fam files) and VCF format. It supports:

SNP array data (e.g., Illumina, Affymetrix)
Imputed genotype data (VCF from TOPMed, Michigan Imputation Server)
Whole genome sequencing (WGS) data
Whole exome sequencing (WES) data
Targeted sequencing panels

The pipeline is optimized for autosomal chromosomes but can handle X and Y chromosomes with appropriate configuration.

Q: Which genome builds are supported?

A: IDEAL-GENOM supports both major human genome builds:

GRCh37/hg19 (use build: "37" in YAML config)
GRCh38/hg38 (use build: "38" in YAML config, recommended)

Reference files and LD regions are automatically adjusted based on the build you specify.

Q: Can I use IDEAL-GENOM for non-human data?

A: The current version is specifically designed for human genomic data. While the underlying algorithms could theoretically work with other species, the reference panels, LD regions, and ancestry databases are human-specific.

Installation and Setup

Q: Why do I need both PLINK 1.9 and PLINK 2.0?

A: Different QC steps require different PLINK versions:

PLINK 1.9: Core QC operations, kinship analysis, basic statistics
PLINK 2.0: Advanced features, faster VCF processing, improved performance for large datasets

Both tools complement each other and are required for the full pipeline functionality.

Q: Can I install IDEAL-GENOM without admin privileges?

A: Yes! You can install IDEAL-GENOM in user space:

# Install to user directory
pip install --user ideal-genom-qc

# Or use a virtual environment
python -m venv ideal_qc_env
source ideal_qc_env/bin/activate
pip install ideal-genom-qc

Just ensure PLINK tools are available in your PATH or specify their location.

Q: How do I install PLINK without admin privileges?

A: Download PLINK binaries and add them to your PATH:

# Create local bin directory
mkdir -p ~/local/bin

# Download and install PLINK 1.9
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
unzip plink_linux_x86_64_20231211.zip
mv plink ~/local/bin/

# Add to PATH
echo 'export PATH=$HOME/local/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Configuration and Parameters

Q: How do I choose the right QC parameters for my study?

A: Parameter selection depends on your study type:

Parameter Guidelines by Study Type
Parameter	Population Study	Case-Control	Rare Disease
mind	0.05-0.1	0.1-0.2	0.2-0.3
geno	0.05	0.1	0.2
maf	0.01-0.05	0.01	0.0-0.001
hwe	1e-6	5e-8	1e-10

Start with conservative values and relax them if you lose too many samples/variants.

Q: What happens if I set parameters too strictly?

A: Overly strict parameters can lead to:

Excessive sample removal (>20% loss is concerning)
Loss of rare variants important for your analysis
Population bias if certain ethnic groups are disproportionately affected
Reduced statistical power

Monitor the QC plots and logs to ensure reasonable filtering.

Q: Can I run only specific QC steps?

A: Yes! Use the steps.json configuration file:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
    - name: "ancestry_qc"
      enabled: false  # Skip ancestry QC
    - name: "variant_qc"
      enabled: true
    - name: "umap_plot"
      enabled: false  # Skip UMAP plotting

Note that some steps depend on others (e.g., variant QC needs ancestry QC results).

Data and File Formats

Q: How do I convert my data to PLINK format?

A: Common conversions:

# VCF to PLINK
plink --vcf mydata.vcf --make-bed --out mydata

# 23andMe format to PLINK
plink --23file mydata.txt --make-bed --out mydata

# PLINK text to binary
plink --file mydata --make-bed --out mydata

Q: What should I do if my .fam file doesn’t have phenotype information?

A: If your .fam file has missing phenotypes (all -9 or 0), you can:

Add phenotype data manually:

# Create phenotype file (FID, IID, phenotype)
# 1=control, 2=case in PLINK format
echo "FAM1 SAMPLE1 1" > phenotypes.txt
echo "FAM2 SAMPLE2 2" >> phenotypes.txt

# Update .fam file
plink --bfile mydata --pheno phenotypes.txt --make-bed --out mydata_pheno

Run without case-control specific steps:

{
    "umap_plot": {
        "case_control_marker": false
    }
}

Q: My data has non-standard chromosome coding. How do I fix this?

A: PLINK expects standard chromosome codes (1-22, X, Y). Convert non-standard coding:

# If using 23=X, 24=Y, 25=XY, 26=MT
plink --bfile mydata --update-chr update_chr.txt --make-bed --out mydata_fixed

Where update_chr.txt contains mappings like:

X
Y
XY
MT

Performance and Memory

Q: My analysis is running very slowly. How can I speed it up?

A: Several optimization strategies:

Use faster storage: SSD instead of HDD
Increase memory: Add more RAM if possible
Parallel processing: Use multi-core systems
Reduce data size: Filter variants/samples beforehand
Optimize parameters: Larger LD pruning windows, fewer PCs

Q: I’m getting “out of memory” errors. What should I do?

A: Memory issues can be addressed by:

pipeline:
  steps:
    - name: "sample_qc"
      run_params:
        ind_pair: [200, 50, 0.2]  # Larger LD windows
        chunk_size: 5000          # Process in smaller chunks
    - name: "ancestry_qc"
      run_params:
        pca: 5        # Fewer PCs
        maf: 0.05     # Higher MAF threshold

Or process chromosomes separately:

# Split by chromosome first
for chr in {1..22}; do
    plink --bfile mydata --chr $chr --make-bed --out chr${chr}_data
done

Q: How much disk space do I need?

A: Disk space requirements depend on your dataset size:

Disk Space Requirements
Dataset	Input Size	Temp Files	Total Needed
Small (1K samples, 100K SNPs)	100MB	500MB	1GB
Medium (10K samples, 1M SNPs)	1GB	5GB	10GB
Large (100K samples, 5M SNPs)	10GB	50GB	100GB

Results and Interpretation

Q: How do I interpret the QC plots?

A: Key plots to examine:

Heterozygosity plot: Should show normal distribution; outliers indicate DNA quality issues
Missing data plot: Should show most samples with <10% missing data
PCA plot: Should show clear population clusters
Kinship plot: Should identify related individuals

Q: What constitutes “normal” QC results?

A: Typical expectations:

Sample removal: 5-15% of samples failing QC
Variant removal: 10-30% of variants failing QC
Population outliers: 1-5% of samples (depends on population)
Related individuals: Variable (0-10% depending on study design)

Q: Should I be concerned if many samples fail ancestry QC?

A: High ancestry failure rates could indicate:

Wrong reference population: Check your reference_pop setting
Population admixture: Use more lenient thresholds or “ALL” reference
Technical issues: Check for batch effects or DNA quality problems
Study design: Expected in multi-ethnic studies

Quality Control Interpretation

Q: A sample failed multiple QC steps. Should I remove it?

A: Generally yes, especially if it failed:

High missing data + ancestry outlier
Sex discordance + high heterozygosity
Multiple relatedness flags
Technical replicates with poor concordance

Q: Can I recover samples that failed QC?

A: Sometimes. Options include:

Relaxing thresholds: If loss is excessive
Investigating causes: Address systematic issues
Manual review: Check borderline cases individually
Batch correction: If batch effects are detected

Q: How do I handle related individuals?

A: Strategies for related samples:

Remove one from each pair: Keep higher call rate individual
Family-based analysis: Use appropriate statistical methods
Clustering approach: Remove minimal set to break all relationships
Separate analysis: Analyze related/unrelated separately

Technical Issues

Q: The pipeline crashed with a PLINK error. What should I do?

A: Common PLINK issues:

Check file formats: Ensure files are not corrupted
Verify file paths: Use absolute paths in configuration
Check disk space: Ensure sufficient space for temp files
Update PLINK: Use latest versions
Check logs: Look for specific error messages

Q: Reference data download failed. How do I fix this?

A: Reference data issues:

from ideal_genom_qc.get_references import FetcherReference

# Manually download reference data
fetcher = FetcherReference(built="38")
fetcher.download_references(force_redownload=True)

Or provide your own reference files in the configuration.

Q: Can I run IDEAL-GENOM-QC on a cluster/HPC system?

A: Yes! Example SLURM script:

#!/bin/bash
#SBATCH --job-name=ideal_qc
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=24:00:00

module load python/3.9
module load plink/1.9

python -m ideal_genom_qc \\
    --path_params config/parameters.json \\
    --file_folders config/paths.json \\
    --steps config/steps.json \\
    --built 38

Contributing and Support

Q: I found a bug. How do I report it?

A: Please report bugs on our GitHub Issues page with:

Complete error message
Configuration files used
System information (OS, Python version, PLINK versions)
Steps to reproduce the issue

Q: Can I contribute to IDEAL-GENOM-QC development?

A: Absolutely! We welcome contributions:

Bug fixes: Submit pull requests for any bugs you fix
New features: Propose enhancements via GitHub issues first
Documentation: Help improve documentation and examples
Testing: Report issues with different data types/systems

See our Contributing Guide guide for details.

Q: Is commercial use allowed?

A: Yes, IDEAL-GENOM-QC is open source under the MIT license, allowing commercial use. Please review the license terms in the repository for full details.