Troubleshooting Guide
This guide helps you diagnose and resolve common issues when using IDEAL-GENOM-QC. Issues are organized by category for easier navigation.
Installation Issues
PLINK Not Found
Error: plink: command not found or plink2: command not found
Solution:
Check if PLINK is installed:
which plink
which plink2
Install PLINK if missing:
# Download PLINK 1.9
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip
unzip plink_linux_x86_64_20231211.zip
sudo mv plink /usr/local/bin/
# Download PLINK 2.0
wget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_x86_64_20231212.zip
unzip plink2_linux_x86_64_20231212.zip
sudo mv plink2 /usr/local/bin/
Add to PATH if installed elsewhere:
export PATH=/path/to/plink:$PATH
# Add to ~/.bashrc for persistence
Permission Denied Errors
Error: Permission denied when installing or running
Solutions:
# Install to user directory
pip install --user ideal-genom-qc
# Or use virtual environment
python -m venv qc_env
source qc_env/bin/activate
pip install ideal-genom-qc
# Fix file permissions
chmod +x /path/to/plink
Python Module Import Errors
Error: ModuleNotFoundError: No module named 'ideal_genom_qc'
Solutions:
Check installation:
pip list | grep ideal
python -c "import ideal_genom_qc; print(ideal_genom_qc.__version__)"
Reinstall if needed:
pip uninstall ideal-genom-qc
pip install ideal-genom-qc
Check Python environment:
which python
which pip
# Ensure they point to the same environment
Configuration Issues
JSON Syntax Errors
Error: JSONDecodeError: Expecting ',' delimiter
Solution: Validate your JSON files:
# Check JSON syntax
python -m json.tool configFiles/parameters.json
python -m json.tool configFiles/paths.json
python -m json.tool configFiles/steps.json
Common JSON mistakes:
Missing commas between elements
Trailing commas after last element
Unescaped quotes in strings
Comments (not allowed in JSON)
File Path Issues
Error: FileNotFoundError: [Errno 2] No such file or directory
Solutions:
Use absolute paths:
{
"input_directory": "/full/path/to/inputData",
"output_directory": "/full/path/to/outputData"
}
Check file permissions:
ls -la /path/to/your/files
# Ensure read/write permissions
Verify file existence:
# Check if input files exist
ls inputData/mydata.bed
ls inputData/mydata.bim
ls inputData/mydata.fam
Invalid Parameter Values
Error: ValueError: Parameter 'mind' must be between 0 and 1
Solution: Check parameter ranges:
{
"sample_qc": {
"mind": 0.2, // Must be 0-1
"maf": 0.01, // Must be 0-0.5
"hwe": 5e-8, // Must be > 0
"sex_check": [0.2, 0.8] // [female_max, male_min]
}
}
Parameter validation checklist:
mind,geno: 0.0 to 1.0maf: 0.0 to 0.5hwe: > 0 (p-value threshold)sex_check: [female_threshold, male_threshold] where female < male
Data Format Issues
Corrupted PLINK Files
Error: Error: Invalid .bed file or Error: .fam file has wrong number of columns
Solutions:
Validate PLINK files:
# Check file integrity
plink --bfile inputData/mydata --freq --out test_freq
# Check file formats
head inputData/mydata.fam # Should have 6 columns
head inputData/mydata.bim # Should have 6 columns
file inputData/mydata.bed # Should be binary
Regenerate binary files:
# From PLINK text format
plink --file inputData/mydata --make-bed --out inputData/mydata_fixed
# From VCF
plink --vcf inputData/mydata.vcf --make-bed --out inputData/mydata
Chromosome Encoding Issues
Error: Error: Unrecognized chromosome code
Solution: Standardize chromosome codes:
# Create chromosome update file
echo "23 X" > update_chr.txt
echo "24 Y" >> update_chr.txt
echo "25 XY" >> update_chr.txt
echo "26 MT" >> update_chr.txt
# Update chromosome codes
plink --bfile inputData/mydata --update-chr update_chr.txt --make-bed --out inputData/mydata_fixed
Missing Phenotype Data
Error: Warning: No phenotype data available
Solution: Add phenotype information:
# Create phenotype file (FID, IID, phenotype)
# 1=control, 2=case, -9=missing
awk '{print $1, $2, "1"}' inputData/mydata.fam > phenotypes.txt
# Update phenotypes
plink --bfile inputData/mydata --pheno phenotypes.txt --make-bed --out inputData/mydata_pheno
Runtime Issues
Memory Errors
Error: MemoryError or Killed (out of memory)
Solutions:
Reduce memory usage:
{
"sample_qc": {
"ind_pair": [200, 50, 0.2], // Larger LD windows
"chunk_size": 5000 // Process in chunks
},
"ancestry_qc": {
"pca": 5, // Fewer PCs
"maf": 0.05 // Higher MAF filter
}
}
Process chromosomes separately:
# Split by chromosome
for chr in {1..22}; do
plink --bfile inputData/mydata --chr $chr --make-bed --out chr${chr}_data
done
Monitor memory usage:
# Check available memory
free -h
# Monitor during execution
top -p $(pgrep -f ideal_genom_qc)
Disk Space Issues
Error: OSError: [Errno 28] No space left on device
Solutions:
Check disk space:
df -h .
du -sh outputData/
Clean temporary files:
# Remove temporary PLINK files
find . -name "*.tmp" -delete
find . -name "plink.log" -delete
find . -name "*.nosex" -delete
Use different output directory:
{
"output_directory": "/path/to/larger/disk/outputData"
}
Long Runtime Issues
Issue: Pipeline takes much longer than expected
Solutions:
Check system resources:
# CPU usage
htop
# I/O wait
iostat -x 1
# Check for bottlenecks
iotop
Optimize parameters:
{
"sample_qc": {
"ind_pair": [100, 25, 0.3], // Faster LD pruning
"use_kingship": false // Skip if not needed
}
}
Use SSD storage: Move data to faster storage if possible
Output and Results Issues
Missing Output Files
Issue: Expected output files are not generated
Solutions:
Check pipeline logs:
# Look for error messages
grep -i error outputData/*.log
grep -i warning outputData/*.log
Verify step completion:
# Check if steps completed
ls outputData/*/clean_files/
ls outputData/*/fail_samples/
Re-run specific steps:
{
"ancestry": false, // Skip completed steps
"sample": false,
"variant": true, // Re-run failed step
"umap": true
}
Empty or Invalid Results
Issue: Output files exist but are empty or contain unexpected results
Solutions:
Check input data quality:
# Basic statistics
plink --bfile inputData/mydata --freq --missing --out data_check
# Check sample sizes
wc -l inputData/mydata.fam
wc -l outputData/*/clean_files/*.fam
Review QC thresholds:
# Check how many samples/variants were removed
grep -i "removed" outputData/*.log
Visualize intermediate results:
import pandas as pd
import matplotlib.pyplot as plt
# Load and plot QC metrics
metrics = pd.read_csv("outputData/sample_qc_results/qc_metrics.txt", sep="\\t")
metrics.hist(figsize=(12, 8))
plt.show()
Plotting and Visualization Issues
Missing Plots
Issue: QC plots are not generated
Solutions:
Check plotting dependencies:
python -c "import matplotlib, seaborn, pandas; print('All plotting modules available')"
Check output directories:
ls outputData/*/plots/
ls outputData/*_plots/
Generate plots manually:
from ideal_genom_qc import UMAPplot
plotter = UMAPplot(
input_path="outputData/ancestry_results/clean_files",
input_name="clean_data",
output_path="outputData/manual_plots"
)
plotter.create_umap_plots()
Plot Display Issues
Issue: Plots are generated but not displaying correctly
Solutions:
Check image formats:
file outputData/*/plots/*.png
# Should show valid PNG files
Convert formats if needed:
# Convert to different format
for img in outputData/*/plots/*.png; do
convert "$img" "${img%.png}.pdf"
done
Check plotting backend:
import matplotlib
print(matplotlib.get_backend())
# Set non-interactive backend if needed
matplotlib.use('Agg')
Network and Download Issues
Reference Data Download Failures
Error: ConnectionError or TimeoutError when downloading reference data
Solutions:
Check internet connection:
ping google.com
curl -I https://github.com
Manual download:
from ideal_genom_qc.get_references import FetcherReference
fetcher = FetcherReference(built="38")
fetcher.download_references(
force_redownload=True,
timeout=300 # Increase timeout
)
Use local reference files:
{
"high_ld_file": "/path/to/local/high-LD-regions.txt"
}
Proxy or Firewall Issues
Error: Download fails due to network restrictions
Solutions:
Configure proxy:
export http_proxy=http://proxy.company.com:8080
export https_proxy=https://proxy.company.com:8080
Download manually: Get reference files from the GitHub repository and place them locally
Performance Optimization
Slow Performance Debugging
Issue: Pipeline runs slower than expected
Debugging steps:
Profile system resources:
# CPU and memory usage
htop
# Disk I/O
iotop -a
# Network usage (if downloading references)
nethogs
Identify bottlenecks:
import cProfile
import ideal_genom_qc
# Profile the QC pipeline
cProfile.run('ideal_genom_qc.main()', 'profile_output.txt')
Optimize based on bottleneck:
CPU bound: Use fewer PCs, larger LD windows
Memory bound: Process in chunks, reduce dataset size
I/O bound: Use SSD, reduce intermediate file writes
Network bound: Download references once, use local files
Getting Help
When to Seek Additional Help
Contact the development team if you encounter:
Reproducible bugs not covered in this guide
Unexpected scientific results that need expert interpretation
Feature requests for new functionality
Performance issues on large datasets
How to Report Issues Effectively
When reporting issues, please include:
Complete error message (copy-paste from terminal)
Configuration files (parameters.json, paths.json, steps.json)
System information:
# System info
uname -a
python --version
pip show ideal-genom-qc
plink --version
plink2 --version
Data characteristics:
# Dataset size
wc -l inputData/*.fam
wc -l inputData/*.bim
Steps to reproduce the issue
Expected vs. actual behavior
Where to get help:
GitHub Issues: https://github.com/cge-tubingens/IDEAL-GENOM-QC/issues
Documentation: https://ideal-genom-qc.readthedocs.io/
Email: Contact information in the repository
Debug Mode
Enable debug logging for more detailed information:
import logging
logging.basicConfig(level=logging.DEBUG)
# Run your QC pipeline with debug output
Or use the command line with verbose output:
python -m ideal_genom_qc --verbose \\
--path_params config/parameters.json \\
--file_folders config/paths.json \\
--steps config/steps.json