Troubleshooting Guide

This guide helps you diagnose and resolve common issues when using IDEAL-GENOM-QC. Issues are organized by category for easier navigation.

Installation Issues

PLINK Not Found

Error: plink: command not found or plink2: command not found

Solution:

Check if PLINK is installed:

which plink
which plink2

Install PLINK if missing:

# Download PLINK 1.9
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
unzip plink_linux_x86_64_20231211.zip
sudo mv plink /usr/local/bin/

# Download PLINK 2.0
wget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_x86_64_20231212.zip
unzip plink2_linux_x86_64_20231212.zip
sudo mv plink2 /usr/local/bin/

Add to PATH if installed elsewhere:

export PATH=/path/to/plink:$PATH
# Add to ~/.bashrc for persistence

Permission Denied Errors

Error: Permission denied when installing or running

Solutions:

# Install to user directory
pip install --user ideal-genom-qc

# Or use virtual environment
python -m venv qc_env
source qc_env/bin/activate
pip install ideal-genom-qc

# Fix file permissions
chmod +x /path/to/plink

Python Module Import Errors

Error: ModuleNotFoundError: No module named 'ideal_genom_qc'

Solutions:

Check installation:

pip list | grep ideal
python -c "import ideal_genom_qc; print(ideal_genom_qc.__version__)"

Reinstall if needed:

pip uninstall ideal-genom-qc
pip install ideal-genom-qc

Check Python environment:

which python
which pip
# Ensure they point to the same environment

Configuration Issues

JSON Syntax Errors

Error: JSONDecodeError: Expecting ',' delimiter

Solution: Validate your JSON files:

# Check JSON syntax
python -m json.tool configFiles/parameters.json
python -m json.tool configFiles/paths.json
python -m json.tool configFiles/steps.json

Common JSON mistakes:

Missing commas between elements
Trailing commas after last element
Unescaped quotes in strings
Comments (not allowed in JSON)

File Path Issues

Error: FileNotFoundError: [Errno 2] No such file or directory

Solutions:

Use absolute paths:

{
    "input_directory": "/full/path/to/inputData",
    "output_directory": "/full/path/to/outputData"
}

Check file permissions:

ls -la /path/to/your/files
# Ensure read/write permissions

Verify file existence:

# Check if input files exist
ls inputData/mydata.bed
ls inputData/mydata.bim
ls inputData/mydata.fam

Invalid Parameter Values

Error: ValueError: Parameter 'mind' must be between 0 and 1

Solution: Check parameter ranges:

{
    "sample_qc": {
        "mind": 0.2,        // Must be 0-1
        "maf": 0.01,        // Must be 0-0.5
        "hwe": 5e-8,        // Must be > 0
        "sex_check": [0.2, 0.8]  // [female_max, male_min]
    }
}

Parameter validation checklist:

mind, geno: 0.0 to 1.0
maf: 0.0 to 0.5
hwe: > 0 (p-value threshold)
sex_check: [female_threshold, male_threshold] where female < male

Data Format Issues

Corrupted PLINK Files

Error: Error: Invalid .bed file or Error: .fam file has wrong number of columns

Solutions:

Validate PLINK files:

# Check file integrity
plink --bfile inputData/mydata --freq --out test_freq

# Check file formats
head inputData/mydata.fam  # Should have 6 columns
head inputData/mydata.bim  # Should have 6 columns
file inputData/mydata.bed  # Should be binary

Regenerate binary files:

# From PLINK text format
plink --file inputData/mydata --make-bed --out inputData/mydata_fixed

# From VCF
plink --vcf inputData/mydata.vcf --make-bed --out inputData/mydata

Chromosome Encoding Issues

Error: Error: Unrecognized chromosome code

Solution: Standardize chromosome codes:

# Create chromosome update file
echo "23 X" > update_chr.txt
echo "24 Y" >> update_chr.txt
echo "25 XY" >> update_chr.txt
echo "26 MT" >> update_chr.txt

# Update chromosome codes
plink --bfile inputData/mydata --update-chr update_chr.txt --make-bed --out inputData/mydata_fixed

Missing Phenotype Data

Error: Warning: No phenotype data available

Solution: Add phenotype information:

# Create phenotype file (FID, IID, phenotype)
# 1=control, 2=case, -9=missing
awk '{print $1, $2, "1"}' inputData/mydata.fam > phenotypes.txt

# Update phenotypes
plink --bfile inputData/mydata --pheno phenotypes.txt --make-bed --out inputData/mydata_pheno

Runtime Issues

Memory Errors

Error: MemoryError or Killed (out of memory)

Solutions:

Reduce memory usage:

{
    "sample_qc": {
        "ind_pair": [200, 50, 0.2],  // Larger LD windows
        "chunk_size": 5000           // Process in chunks
    },
    "ancestry_qc": {
        "pca": 5,                    // Fewer PCs
        "maf": 0.05                  // Higher MAF filter
    }
}

Process chromosomes separately:

# Split by chromosome
for chr in {1..22}; do
    plink --bfile inputData/mydata --chr $chr --make-bed --out chr${chr}_data
done

Monitor memory usage:

# Check available memory
free -h

# Monitor during execution
top -p $(pgrep -f ideal_genom_qc)

Disk Space Issues

Error: OSError: [Errno 28] No space left on device

Solutions:

Check disk space:

df -h .
du -sh outputData/

Clean temporary files:

# Remove temporary PLINK files
find . -name "*.tmp" -delete
find . -name "plink.log" -delete
find . -name "*.nosex" -delete

Use different output directory:

{
    "output_directory": "/path/to/larger/disk/outputData"
}

Long Runtime Issues

Issue: Pipeline takes much longer than expected

Solutions:

Check system resources:

# CPU usage
htop

# I/O wait
iostat -x 1

# Check for bottlenecks
iotop

Optimize parameters:

{
    "sample_qc": {
        "ind_pair": [100, 25, 0.3],  // Faster LD pruning
        "use_kingship": false        // Skip if not needed
    }
}

Use SSD storage: Move data to faster storage if possible

Output and Results Issues

Missing Output Files

Issue: Expected output files are not generated

Solutions:

Check pipeline logs:

# Look for error messages
grep -i error outputData/*.log
grep -i warning outputData/*.log

Verify step completion:

# Check if steps completed
ls outputData/*/clean_files/
ls outputData/*/fail_samples/

Re-run specific steps:

{
    "ancestry": false,  // Skip completed steps
    "sample": false,
    "variant": true,    // Re-run failed step
    "umap": true
}

Empty or Invalid Results

Issue: Output files exist but are empty or contain unexpected results

Solutions:

Check input data quality:

# Basic statistics
plink --bfile inputData/mydata --freq --missing --out data_check

# Check sample sizes
wc -l inputData/mydata.fam
wc -l outputData/*/clean_files/*.fam

Review QC thresholds:

# Check how many samples/variants were removed
grep -i "removed" outputData/*.log

Visualize intermediate results:

import pandas as pd
import matplotlib.pyplot as plt

# Load and plot QC metrics
metrics = pd.read_csv("outputData/sample_qc_results/qc_metrics.txt", sep="\\t")
metrics.hist(figsize=(12, 8))
plt.show()

Plotting and Visualization Issues

Missing Plots

Issue: QC plots are not generated

Solutions:

Check plotting dependencies:

python -c "import matplotlib, seaborn, pandas; print('All plotting modules available')"

Check output directories:

ls outputData/*/plots/
ls outputData/*_plots/

Generate plots manually:

from ideal_genom_qc import UMAPplot

plotter = UMAPplot(
    input_path="outputData/ancestry_results/clean_files",
    input_name="clean_data",
    output_path="outputData/manual_plots"
)
plotter.create_umap_plots()

Plot Display Issues

Issue: Plots are generated but not displaying correctly

Solutions:

Check image formats:

file outputData/*/plots/*.png
# Should show valid PNG files

Convert formats if needed:

# Convert to different format
for img in outputData/*/plots/*.png; do
    convert "$img" "${img%.png}.pdf"
done

Check plotting backend:

import matplotlib
print(matplotlib.get_backend())

# Set non-interactive backend if needed
matplotlib.use('Agg')

Network and Download Issues

Reference Data Download Failures

Error: ConnectionError or TimeoutError when downloading reference data

Solutions:

Check internet connection:

ping google.com
curl -I https://github.com

Manual download:

from ideal_genom_qc.get_references import FetcherReference

fetcher = FetcherReference(built="38")
fetcher.download_references(
    force_redownload=True,
    timeout=300  # Increase timeout
)

Use local reference files:

{
    "high_ld_file": "/path/to/local/high-LD-regions.txt"
}

Proxy or Firewall Issues

Error: Download fails due to network restrictions

Solutions:

Configure proxy:

export http_proxy=http://proxy.company.com:8080
export https_proxy=https://proxy.company.com:8080

Download manually: Get reference files from the GitHub repository and place them locally

Performance Optimization

Slow Performance Debugging

Issue: Pipeline runs slower than expected

Debugging steps:

Profile system resources:

# CPU and memory usage
htop

# Disk I/O
iotop -a

# Network usage (if downloading references)
nethogs

Identify bottlenecks:

import cProfile
import ideal_genom_qc

# Profile the QC pipeline
cProfile.run('ideal_genom_qc.main()', 'profile_output.txt')

Optimize based on bottleneck:

CPU bound: Use fewer PCs, larger LD windows
Memory bound: Process in chunks, reduce dataset size
I/O bound: Use SSD, reduce intermediate file writes
Network bound: Download references once, use local files

Getting Help

When to Seek Additional Help

Contact the development team if you encounter:

Reproducible bugs not covered in this guide
Unexpected scientific results that need expert interpretation
Feature requests for new functionality
Performance issues on large datasets

How to Report Issues Effectively

When reporting issues, please include:

Complete error message (copy-paste from terminal)
Configuration files (parameters.json, paths.json, steps.json)
System information:

# System info
uname -a
python --version
pip show ideal-genom-qc
plink --version
plink2 --version

Data characteristics:

# Dataset size
wc -l inputData/*.fam
wc -l inputData/*.bim

Steps to reproduce the issue
Expected vs. actual behavior

Where to get help:

GitHub Issues: https://github.com/cge-tubingens/IDEAL-GENOM-QC/issues
Documentation: https://ideal-genom-qc.readthedocs.io/
Email: Contact information in the repository

Debug Mode

Enable debug logging for more detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

# Run your QC pipeline with debug output

Or use the command line with verbose output:

python -m ideal_genom_qc --verbose \\
    --path_params config/parameters.json \\
    --file_folders config/paths.json \\
    --steps config/steps.json