Troubleshooting Guide ===================== This guide helps you diagnose and resolve common issues when using IDEAL-GENOM-QC. Issues are organized by category for easier navigation. Installation Issues ------------------- PLINK Not Found ^^^^^^^^^^^^^^^ **Error:** ``plink: command not found`` or ``plink2: command not found`` **Solution:** 1. **Check if PLINK is installed:** .. code-block:: bash which plink which plink2 2. **Install PLINK if missing:** .. code-block:: bash # Download PLINK 1.9 wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20231211.zip unzip plink_linux_x86_64_20231211.zip sudo mv plink /usr/local/bin/ # Download PLINK 2.0 wget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_x86_64_20231212.zip unzip plink2_linux_x86_64_20231212.zip sudo mv plink2 /usr/local/bin/ 3. **Add to PATH if installed elsewhere:** .. code-block:: bash export PATH=/path/to/plink:$PATH # Add to ~/.bashrc for persistence Permission Denied Errors ^^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Permission denied`` when installing or running **Solutions:** .. code-block:: bash # Install to user directory pip install --user ideal-genom-qc # Or use virtual environment python -m venv qc_env source qc_env/bin/activate pip install ideal-genom-qc # Fix file permissions chmod +x /path/to/plink Python Module Import Errors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``ModuleNotFoundError: No module named 'ideal_genom_qc'`` **Solutions:** 1. **Check installation:** .. code-block:: bash pip list | grep ideal python -c "import ideal_genom_qc; print(ideal_genom_qc.__version__)" 2. **Reinstall if needed:** .. code-block:: bash pip uninstall ideal-genom-qc pip install ideal-genom-qc 3. **Check Python environment:** .. code-block:: bash which python which pip # Ensure they point to the same environment Configuration Issues -------------------- JSON Syntax Errors ^^^^^^^^^^^^^^^^^^^ **Error:** ``JSONDecodeError: Expecting ',' delimiter`` **Solution:** Validate your JSON files: .. code-block:: bash # Check JSON syntax python -m json.tool configFiles/parameters.json python -m json.tool configFiles/paths.json python -m json.tool configFiles/steps.json **Common JSON mistakes:** - Missing commas between elements - Trailing commas after last element - Unescaped quotes in strings - Comments (not allowed in JSON) File Path Issues ^^^^^^^^^^^^^^^^ **Error:** ``FileNotFoundError: [Errno 2] No such file or directory`` **Solutions:** 1. **Use absolute paths:** .. code-block:: json { "input_directory": "/full/path/to/inputData", "output_directory": "/full/path/to/outputData" } 2. **Check file permissions:** .. code-block:: bash ls -la /path/to/your/files # Ensure read/write permissions 3. **Verify file existence:** .. code-block:: bash # Check if input files exist ls inputData/mydata.bed ls inputData/mydata.bim ls inputData/mydata.fam Invalid Parameter Values ^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``ValueError: Parameter 'mind' must be between 0 and 1`` **Solution:** Check parameter ranges: .. code-block:: json { "sample_qc": { "mind": 0.2, // Must be 0-1 "maf": 0.01, // Must be 0-0.5 "hwe": 5e-8, // Must be > 0 "sex_check": [0.2, 0.8] // [female_max, male_min] } } **Parameter validation checklist:** - ``mind``, ``geno``: 0.0 to 1.0 - ``maf``: 0.0 to 0.5 - ``hwe``: > 0 (p-value threshold) - ``sex_check``: [female_threshold, male_threshold] where female < male Data Format Issues ------------------ Corrupted PLINK Files ^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Error: Invalid .bed file`` or ``Error: .fam file has wrong number of columns`` **Solutions:** 1. **Validate PLINK files:** .. code-block:: bash # Check file integrity plink --bfile inputData/mydata --freq --out test_freq # Check file formats head inputData/mydata.fam # Should have 6 columns head inputData/mydata.bim # Should have 6 columns file inputData/mydata.bed # Should be binary 2. **Regenerate binary files:** .. code-block:: bash # From PLINK text format plink --file inputData/mydata --make-bed --out inputData/mydata_fixed # From VCF plink --vcf inputData/mydata.vcf --make-bed --out inputData/mydata Chromosome Encoding Issues ^^^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Error: Unrecognized chromosome code`` **Solution:** Standardize chromosome codes: .. code-block:: bash # Create chromosome update file echo "23 X" > update_chr.txt echo "24 Y" >> update_chr.txt echo "25 XY" >> update_chr.txt echo "26 MT" >> update_chr.txt # Update chromosome codes plink --bfile inputData/mydata --update-chr update_chr.txt --make-bed --out inputData/mydata_fixed Missing Phenotype Data ^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Warning: No phenotype data available`` **Solution:** Add phenotype information: .. code-block:: bash # Create phenotype file (FID, IID, phenotype) # 1=control, 2=case, -9=missing awk '{print $1, $2, "1"}' inputData/mydata.fam > phenotypes.txt # Update phenotypes plink --bfile inputData/mydata --pheno phenotypes.txt --make-bed --out inputData/mydata_pheno Runtime Issues -------------- Memory Errors ^^^^^^^^^^^^^ **Error:** ``MemoryError`` or ``Killed`` (out of memory) **Solutions:** 1. **Reduce memory usage:** .. code-block:: json { "sample_qc": { "ind_pair": [200, 50, 0.2], // Larger LD windows "chunk_size": 5000 // Process in chunks }, "ancestry_qc": { "pca": 5, // Fewer PCs "maf": 0.05 // Higher MAF filter } } 2. **Process chromosomes separately:** .. code-block:: bash # Split by chromosome for chr in {1..22}; do plink --bfile inputData/mydata --chr $chr --make-bed --out chr${chr}_data done 3. **Monitor memory usage:** .. code-block:: bash # Check available memory free -h # Monitor during execution top -p $(pgrep -f ideal_genom_qc) Disk Space Issues ^^^^^^^^^^^^^^^^^ **Error:** ``OSError: [Errno 28] No space left on device`` **Solutions:** 1. **Check disk space:** .. code-block:: bash df -h . du -sh outputData/ 2. **Clean temporary files:** .. code-block:: bash # Remove temporary PLINK files find . -name "*.tmp" -delete find . -name "plink.log" -delete find . -name "*.nosex" -delete 3. **Use different output directory:** .. code-block:: json { "output_directory": "/path/to/larger/disk/outputData" } Long Runtime Issues ^^^^^^^^^^^^^^^^^^^ **Issue:** Pipeline takes much longer than expected **Solutions:** 1. **Check system resources:** .. code-block:: bash # CPU usage htop # I/O wait iostat -x 1 # Check for bottlenecks iotop 2. **Optimize parameters:** .. code-block:: json { "sample_qc": { "ind_pair": [100, 25, 0.3], // Faster LD pruning "use_kingship": false // Skip if not needed } } 3. **Use SSD storage:** Move data to faster storage if possible Output and Results Issues ------------------------- Missing Output Files ^^^^^^^^^^^^^^^^^^^^ **Issue:** Expected output files are not generated **Solutions:** 1. **Check pipeline logs:** .. code-block:: bash # Look for error messages grep -i error outputData/*.log grep -i warning outputData/*.log 2. **Verify step completion:** .. code-block:: bash # Check if steps completed ls outputData/*/clean_files/ ls outputData/*/fail_samples/ 3. **Re-run specific steps:** .. code-block:: json { "ancestry": false, // Skip completed steps "sample": false, "variant": true, // Re-run failed step "umap": true } Empty or Invalid Results ^^^^^^^^^^^^^^^^^^^^^^^^ **Issue:** Output files exist but are empty or contain unexpected results **Solutions:** 1. **Check input data quality:** .. code-block:: bash # Basic statistics plink --bfile inputData/mydata --freq --missing --out data_check # Check sample sizes wc -l inputData/mydata.fam wc -l outputData/*/clean_files/*.fam 2. **Review QC thresholds:** .. code-block:: bash # Check how many samples/variants were removed grep -i "removed" outputData/*.log 3. **Visualize intermediate results:** .. code-block:: python import pandas as pd import matplotlib.pyplot as plt # Load and plot QC metrics metrics = pd.read_csv("outputData/sample_qc_results/qc_metrics.txt", sep="\\t") metrics.hist(figsize=(12, 8)) plt.show() Plotting and Visualization Issues --------------------------------- Missing Plots ^^^^^^^^^^^^^ **Issue:** QC plots are not generated **Solutions:** 1. **Check plotting dependencies:** .. code-block:: bash python -c "import matplotlib, seaborn, pandas; print('All plotting modules available')" 2. **Check output directories:** .. code-block:: bash ls outputData/*/plots/ ls outputData/*_plots/ 3. **Generate plots manually:** .. code-block:: python from ideal_genom_qc import UMAPplot plotter = UMAPplot( input_path="outputData/ancestry_results/clean_files", input_name="clean_data", output_path="outputData/manual_plots" ) plotter.create_umap_plots() Plot Display Issues ^^^^^^^^^^^^^^^^^^^ **Issue:** Plots are generated but not displaying correctly **Solutions:** 1. **Check image formats:** .. code-block:: bash file outputData/*/plots/*.png # Should show valid PNG files 2. **Convert formats if needed:** .. code-block:: bash # Convert to different format for img in outputData/*/plots/*.png; do convert "$img" "${img%.png}.pdf" done 3. **Check plotting backend:** .. code-block:: python import matplotlib print(matplotlib.get_backend()) # Set non-interactive backend if needed matplotlib.use('Agg') Network and Download Issues --------------------------- Reference Data Download Failures ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``ConnectionError`` or ``TimeoutError`` when downloading reference data **Solutions:** 1. **Check internet connection:** .. code-block:: bash ping google.com curl -I https://github.com 2. **Manual download:** .. code-block:: python from ideal_genom_qc.get_references import FetcherReference fetcher = FetcherReference(built="38") fetcher.download_references( force_redownload=True, timeout=300 # Increase timeout ) 3. **Use local reference files:** .. code-block:: json { "high_ld_file": "/path/to/local/high-LD-regions.txt" } Proxy or Firewall Issues ^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** Download fails due to network restrictions **Solutions:** 1. **Configure proxy:** .. code-block:: bash export http_proxy=http://proxy.company.com:8080 export https_proxy=https://proxy.company.com:8080 2. **Download manually:** Get reference files from the GitHub repository and place them locally Performance Optimization ------------------------ Slow Performance Debugging ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Issue:** Pipeline runs slower than expected **Debugging steps:** 1. **Profile system resources:** .. code-block:: bash # CPU and memory usage htop # Disk I/O iotop -a # Network usage (if downloading references) nethogs 2. **Identify bottlenecks:** .. code-block:: python import cProfile import ideal_genom_qc # Profile the QC pipeline cProfile.run('ideal_genom_qc.main()', 'profile_output.txt') 3. **Optimize based on bottleneck:** - **CPU bound:** Use fewer PCs, larger LD windows - **Memory bound:** Process in chunks, reduce dataset size - **I/O bound:** Use SSD, reduce intermediate file writes - **Network bound:** Download references once, use local files Getting Help ------------ When to Seek Additional Help ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Contact the development team if you encounter: - Reproducible bugs not covered in this guide - Unexpected scientific results that need expert interpretation - Feature requests for new functionality - Performance issues on large datasets How to Report Issues Effectively ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When reporting issues, please include: 1. **Complete error message** (copy-paste from terminal) 2. **Configuration files** (parameters.json, paths.json, steps.json) 3. **System information:** .. code-block:: bash # System info uname -a python --version pip show ideal-genom-qc plink --version plink2 --version 4. **Data characteristics:** .. code-block:: bash # Dataset size wc -l inputData/*.fam wc -l inputData/*.bim 5. **Steps to reproduce** the issue 6. **Expected vs. actual behavior** **Where to get help:** - GitHub Issues: https://github.com/cge-tubingens/IDEAL-GENOM-QC/issues - Documentation: https://ideal-genom-qc.readthedocs.io/ - Email: Contact information in the repository Debug Mode ^^^^^^^^^^ Enable debug logging for more detailed information: .. code-block:: python import logging logging.basicConfig(level=logging.DEBUG) # Run your QC pipeline with debug output Or use the command line with verbose output: .. code-block:: bash python -m ideal_genom_qc --verbose \\ --path_params config/parameters.json \\ --file_folders config/paths.json \\ --steps config/steps.json