Troubleshooting Guide

This guide helps you diagnose and resolve common issues when using IDEAL-GENOM-QC. Issues are organized by category for easier navigation.

Installation Issues

Permission Denied Errors

Error: Permission denied when installing or running

Solutions:

# Install to user directory
pip install --user ideal-genom-qc

# Or use virtual environment
python -m venv qc_env
source qc_env/bin/activate
pip install ideal-genom-qc

# Fix file permissions
chmod +x /path/to/plink

Python Module Import Errors

Error: ModuleNotFoundError: No module named 'ideal_genom_qc'

Solutions:

  1. Check installation:

pip list | grep ideal
python -c "import ideal_genom_qc; print(ideal_genom_qc.__version__)"
  1. Reinstall if needed:

pip uninstall ideal-genom-qc
pip install ideal-genom-qc
  1. Check Python environment:

which python
which pip
# Ensure they point to the same environment

Configuration Issues

JSON Syntax Errors

Error: JSONDecodeError: Expecting ',' delimiter

Solution: Validate your JSON files:

# Check JSON syntax
python -m json.tool configFiles/parameters.json
python -m json.tool configFiles/paths.json
python -m json.tool configFiles/steps.json

Common JSON mistakes:

  • Missing commas between elements

  • Trailing commas after last element

  • Unescaped quotes in strings

  • Comments (not allowed in JSON)

File Path Issues

Error: FileNotFoundError: [Errno 2] No such file or directory

Solutions:

  1. Use absolute paths:

{
    "input_directory": "/full/path/to/inputData",
    "output_directory": "/full/path/to/outputData"
}
  1. Check file permissions:

ls -la /path/to/your/files
# Ensure read/write permissions
  1. Verify file existence:

# Check if input files exist
ls inputData/mydata.bed
ls inputData/mydata.bim
ls inputData/mydata.fam

Invalid Parameter Values

Error: ValueError: Parameter 'mind' must be between 0 and 1

Solution: Check parameter ranges:

{
    "sample_qc": {
        "mind": 0.2,        // Must be 0-1
        "maf": 0.01,        // Must be 0-0.5
        "hwe": 5e-8,        // Must be > 0
        "sex_check": [0.2, 0.8]  // [female_max, male_min]
    }
}

Parameter validation checklist:

  • mind, geno: 0.0 to 1.0

  • maf: 0.0 to 0.5

  • hwe: > 0 (p-value threshold)

  • sex_check: [female_threshold, male_threshold] where female < male

Data Format Issues

Chromosome Encoding Issues

Error: Error: Unrecognized chromosome code

Solution: Standardize chromosome codes:

# Create chromosome update file
echo "23 X" > update_chr.txt
echo "24 Y" >> update_chr.txt
echo "25 XY" >> update_chr.txt
echo "26 MT" >> update_chr.txt

# Update chromosome codes
plink --bfile inputData/mydata --update-chr update_chr.txt --make-bed --out inputData/mydata_fixed

Missing Phenotype Data

Error: Warning: No phenotype data available

Solution: Add phenotype information:

# Create phenotype file (FID, IID, phenotype)
# 1=control, 2=case, -9=missing
awk '{print $1, $2, "1"}' inputData/mydata.fam > phenotypes.txt

# Update phenotypes
plink --bfile inputData/mydata --pheno phenotypes.txt --make-bed --out inputData/mydata_pheno

Runtime Issues

Memory Errors

Error: MemoryError or Killed (out of memory)

Solutions:

  1. Reduce memory usage:

{
    "sample_qc": {
        "ind_pair": [200, 50, 0.2],  // Larger LD windows
        "chunk_size": 5000           // Process in chunks
    },
    "ancestry_qc": {
        "pca": 5,                    // Fewer PCs
        "maf": 0.05                  // Higher MAF filter
    }
}
  1. Process chromosomes separately:

# Split by chromosome
for chr in {1..22}; do
    plink --bfile inputData/mydata --chr $chr --make-bed --out chr${chr}_data
done
  1. Monitor memory usage:

# Check available memory
free -h

# Monitor during execution
top -p $(pgrep -f ideal_genom_qc)

Disk Space Issues

Error: OSError: [Errno 28] No space left on device

Solutions:

  1. Check disk space:

df -h .
du -sh outputData/
  1. Clean temporary files:

# Remove temporary PLINK files
find . -name "*.tmp" -delete
find . -name "plink.log" -delete
find . -name "*.nosex" -delete
  1. Use different output directory:

{
    "output_directory": "/path/to/larger/disk/outputData"
}

Long Runtime Issues

Issue: Pipeline takes much longer than expected

Solutions:

  1. Check system resources:

# CPU usage
htop

# I/O wait
iostat -x 1

# Check for bottlenecks
iotop
  1. Optimize parameters:

{
    "sample_qc": {
        "ind_pair": [100, 25, 0.3],  // Faster LD pruning
        "use_kingship": false        // Skip if not needed
    }
}
  1. Use SSD storage: Move data to faster storage if possible

Output and Results Issues

Missing Output Files

Issue: Expected output files are not generated

Solutions:

  1. Check pipeline logs:

# Look for error messages
grep -i error outputData/*.log
grep -i warning outputData/*.log
  1. Verify step completion:

# Check if steps completed
ls outputData/*/clean_files/
ls outputData/*/fail_samples/
  1. Re-run specific steps:

{
    "ancestry": false,  // Skip completed steps
    "sample": false,
    "variant": true,    // Re-run failed step
    "umap": true
}

Empty or Invalid Results

Issue: Output files exist but are empty or contain unexpected results

Solutions:

  1. Check input data quality:

# Basic statistics
plink --bfile inputData/mydata --freq --missing --out data_check

# Check sample sizes
wc -l inputData/mydata.fam
wc -l outputData/*/clean_files/*.fam
  1. Review QC thresholds:

# Check how many samples/variants were removed
grep -i "removed" outputData/*.log
  1. Visualize intermediate results:

import pandas as pd
import matplotlib.pyplot as plt

# Load and plot QC metrics
metrics = pd.read_csv("outputData/sample_qc_results/qc_metrics.txt", sep="\\t")
metrics.hist(figsize=(12, 8))
plt.show()

Plotting and Visualization Issues

Missing Plots

Issue: QC plots are not generated

Solutions:

  1. Check plotting dependencies:

python -c "import matplotlib, seaborn, pandas; print('All plotting modules available')"
  1. Check output directories:

ls outputData/*/plots/
ls outputData/*_plots/
  1. Generate plots manually:

from ideal_genom_qc import UMAPplot

plotter = UMAPplot(
    input_path="outputData/ancestry_results/clean_files",
    input_name="clean_data",
    output_path="outputData/manual_plots"
)
plotter.create_umap_plots()

Plot Display Issues

Issue: Plots are generated but not displaying correctly

Solutions:

  1. Check image formats:

file outputData/*/plots/*.png
# Should show valid PNG files
  1. Convert formats if needed:

# Convert to different format
for img in outputData/*/plots/*.png; do
    convert "$img" "${img%.png}.pdf"
done
  1. Check plotting backend:

import matplotlib
print(matplotlib.get_backend())

# Set non-interactive backend if needed
matplotlib.use('Agg')

Network and Download Issues

Reference Data Download Failures

Error: ConnectionError or TimeoutError when downloading reference data

Solutions:

  1. Check internet connection:

ping google.com
curl -I https://github.com
  1. Manual download:

from ideal_genom_qc.get_references import FetcherReference

fetcher = FetcherReference(built="38")
fetcher.download_references(
    force_redownload=True,
    timeout=300  # Increase timeout
)
  1. Use local reference files:

{
    "high_ld_file": "/path/to/local/high-LD-regions.txt"
}

Proxy or Firewall Issues

Error: Download fails due to network restrictions

Solutions:

  1. Configure proxy:

export http_proxy=http://proxy.company.com:8080
export https_proxy=https://proxy.company.com:8080
  1. Download manually: Get reference files from the GitHub repository and place them locally

Performance Optimization

Slow Performance Debugging

Issue: Pipeline runs slower than expected

Debugging steps:

  1. Profile system resources:

# CPU and memory usage
htop

# Disk I/O
iotop -a

# Network usage (if downloading references)
nethogs
  1. Identify bottlenecks:

import cProfile
import ideal_genom_qc

# Profile the QC pipeline
cProfile.run('ideal_genom_qc.main()', 'profile_output.txt')
  1. Optimize based on bottleneck:

  • CPU bound: Use fewer PCs, larger LD windows

  • Memory bound: Process in chunks, reduce dataset size

  • I/O bound: Use SSD, reduce intermediate file writes

  • Network bound: Download references once, use local files

Getting Help

When to Seek Additional Help

Contact the development team if you encounter:

  • Reproducible bugs not covered in this guide

  • Unexpected scientific results that need expert interpretation

  • Feature requests for new functionality

  • Performance issues on large datasets

How to Report Issues Effectively

When reporting issues, please include:

  1. Complete error message (copy-paste from terminal)

  2. Configuration files (parameters.json, paths.json, steps.json)

  3. System information:

# System info
uname -a
python --version
pip show ideal-genom-qc
plink --version
plink2 --version
  1. Data characteristics:

# Dataset size
wc -l inputData/*.fam
wc -l inputData/*.bim
  1. Steps to reproduce the issue

  2. Expected vs. actual behavior

Where to get help:

Debug Mode

Enable debug logging for more detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

# Run your QC pipeline with debug output

Or use the command line with verbose output:

python -m ideal_genom_qc --verbose \\
    --path_params config/parameters.json \\
    --file_folders config/paths.json \\
    --steps config/steps.json