Getting Started
===============

This guide will help you get up and running with IDEAL-GENOM quickly. We'll walk through setting up your first genomic analysis pipeline step by step using the new YAML-based configuration system.

Overview
--------

IDEAL-GENOM uses a modern, flexible pipeline system:

1. **Prepare Your Data**: Ensure data is in PLINK1.9 format
2. **Generate Configuration**: Create a YAML configuration file
3. **Customize Pipeline**: Edit configuration to match your needs
4. **Validate Configuration**: Check for errors before running
5. **Execute Pipeline**: Run the analysis
6. **Review Results**: Examine outputs and visualizations

The New Configuration System
-----------------------------

IDEAL-GENOM v0.2.0 introduces a **YAML-based configuration system** that replaces the previous JSON approach. Benefits include:

- **Single File**: All settings in one place (no more separate parameters.json, paths.json, steps.json)
- **Hierarchical Structure**: Clear organization of pipeline steps and parameters
- **Variable Substitution**: Reference outputs from previous steps automatically
- **Enable/Disable Steps**: Easily control which analyses to run
- **Comments**: Built-in documentation within the config file

Quick Start: 5-Minute Tutorial
-------------------------------

**1. Get a Configuration Template**

Configuration templates are included in the repository under ``yaml_configs/``:

.. code-block:: bash

    # Clone the repository (if you haven't already)
    git clone https://github.com/cge-tubingens/ideal-genom-qc.git
    cd ideal-genom-qc
    
    # Copy the QC pipeline template
    cp yaml_configs/qc_pipeline_config_template.yaml my_qc_pipeline.yaml

Available templates:
- ``qc_pipeline_config_template.yaml`` - Complete QC pipeline
- ``gwas_config_template.yaml`` - GWAS analysis pipeline
- ``vcf_config_template.yaml`` - VCF post-imputation processing

**2. Edit the Configuration**

Open ``my_qc_pipeline.yaml`` and update the paths to match your data:

.. code-block:: yaml

    pipeline:
      name: "my_study_qc"
      base_output_dir: "/path/to/output"
      
      steps:
        - name: "sample_qc"
          enabled: true
          module: "ideal_genom.qc.sample_qc"
          class: "SampleQC"
          init_params:
            input_path: "/path/to/your/data"
            input_name: "mydata"
            output_path: "${base_output_dir}"
            output_name: "mydata_sampleQCed"
            high_ld_regions_file: "/path/to/high_ld_regions.txt"
            build: "38"

**3. Validate Your Configuration**

.. code-block:: bash

    ideal-genom validate --config my_qc_pipeline.yaml

**4. Run the Pipeline**

.. code-block:: bash

    ideal-genom run --config my_qc_pipeline.yaml

That's it! The pipeline will execute all enabled steps in order.

Step-by-Step Guide
------------------

Step 1: Prepare Your Data
Step 1: Prepare Your Data
^^^^^^^^^^^^^^^^^^^^^^^^^

IDEAL-GENOM works with PLINK1.9 binary format files:

- ``.bed``: Binary genotype data
- ``.bim``: Variant information (chromosome, position, alleles, etc.)
- ``.fam``: Sample information (family ID, individual ID, phenotype, etc.)

**Convert from VCF (if needed):**

.. code-block:: bash

    plink --vcf mydata.vcf.gz --make-bed --out mydata

**Data Requirements:**

- Genome build: GRCh37 (hg19) or GRCh38 (hg38)
- For ancestry QC: 1000 Genomes reference files (auto-downloaded if not provided)
- For high LD region filtering: high-LD-regions file (included with package)

Step 2: Create Your Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Option A: Use a Template (Recommended)**

Copy one of the provided templates from the repository:

.. code-block:: bash

    # Copy the QC pipeline template
    cp yaml_configs/qc_pipeline_config_template.yaml my_qc_pipeline.yaml
    
    # Or for GWAS analysis
    cp yaml_configs/gwas_config_template.yaml my_gwas_pipeline.yaml
    
    # Or for VCF processing
    cp yaml_configs/vcf_config_template.yaml my_vcf_pipeline.yaml

**Option B: Start from Scratch**

Create a minimal configuration file:

.. code-block:: yaml

    pipeline:
      name: "my_analysis"
      base_output_dir: "/data/output"
      
      steps:
        - name: "sample_qc"
          enabled: true
          module: "ideal_genom.qc.sample_qc"
          class: "SampleQC"
          init_params:
            input_path: "/data/input"
            input_name: "mydata"
            output_path: "${base_output_dir}"
            output_name: "mydata_sampleQCed"
            high_ld_regions_file: "auto"  # Use built-in file
            build: "38"
          execute_params:
            mind: 0.02
            sex_check: [0.2, 0.8]
            maf: 0.01
            het_deviation: 3
            kinship: 0.354

    settings:
      logging:
        level: "INFO"
      resources:
        max_memory: null  # Auto-detect
        max_threads: null  # Auto-detect

Step 3: Understanding the Configuration Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The YAML configuration has three main sections:

**Pipeline Section**

.. code-block:: yaml

    pipeline:
      name: "pipeline_name"              # Descriptive name for your analysis
      base_output_dir: "/path/to/output" # All outputs will go here
      steps:                             # List of analysis steps (in order)
        - name: "step_name"
          enabled: true                  # Set to false to skip this step
          module: "ideal_genom.module"   # Python module path
          class: "ClassName"             # Class to instantiate
          init_params:                   # Parameters passed to __init__
            # ...
          execute_params:                # Parameters passed to execute()
            # ...

**Variable Substitution**

Reference values from elsewhere in the config:

.. code-block:: yaml

    pipeline:
      base_output_dir: "/data/output"
      steps:
        - name: "sample_qc"
          init_params:
            output_path: "${base_output_dir}"  # Uses /data/output
        
        - name: "variant_qc"
          init_params:
            # Use output from previous step
            input_path: "${steps.sample_qc.clean_dir}"

**Settings Section**

.. code-block:: yaml

    settings:
      logging:
        level: "INFO"                    # DEBUG, INFO, WARNING, ERROR
        file_logging: true               # Log to file
        console_logging: true            # Log to console
      
      resources:
        max_memory: null                 # null = auto-detect (uses 2/3 available)
        max_threads: null                # null = auto-detect (uses cores - 2)
      
      files:
        keep_intermediate: true          # Keep temporary files
        compress_outputs: false          # Compress output files
        overwrite_existing: false        # Overwrite existing results

Step 4: Configure Your Pipeline Steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Sample QC** - Remove low-quality samples

.. code-block:: yaml

    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_sampleQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        rename_snp: true              # Rename SNPs to chr:pos format
        hh_to_missing: true           # Convert homozygous haploid calls to missing
        use_kinship: true             # Use kinship instead of IBD
        ind_pair: [50, 5, 0.2]        # LD pruning: window, step, r² threshold
        mind: 0.02                    # Max missing rate per individual (2%)
        sex_check: [0.2, 0.8]         # F coefficient bounds [female_max, male_min]
        maf: 0.01                     # Minor allele frequency threshold
        het_deviation: 3              # Heterozygosity SD threshold
        kinship: 0.354                # Kinship coefficient (2nd degree relatives)
        ibd_threshold: 0.185          # IBD threshold for duplicate detection

**Ancestry QC** - Detect population outliers

.. code-block:: yaml

    - name: "ancestry_qc"
      enabled: true
      module: "ideal_genom.qc.ancestry_qc"
      class: "AncestryQC"
      init_params:
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_ancestryQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        ind_pair: [50, 5, 0.2]        # LD pruning for PCA
        pca: 10                       # Number of PCs to compute
        maf: 0.01                     # MAF threshold
        ref_threshold: 4              # SD threshold for reference outliers
        stu_threshold: 4              # SD threshold for study outliers
        reference_pop: "EUR"          # Expected population (EUR, AFR, AMR, EAS, SAS)
        num_pcs: 10                   # Number of PCs for ancestry assignment

**Variant QC** - Remove low-quality variants

.. code-block:: yaml

    - name: "variant_qc"
      enabled: true
      module: "ideal_genom.qc.variant_qc"
      class: "VariantQC"
      init_params:
        input_path: "${steps.ancestry_qc.clean_dir}"
        input_name: "${steps.ancestry_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_variantQCed"
      execute_params:
        miss_data_rate: 0.02          # Max missing rate across all samples
        diff_genotype_rate: 1.0e-5    # Differential missingness p-value
        geno: 0.02                    # Max missing rate per variant
        maf: 0.01                     # Minor allele frequency threshold
        hwe: 1.0e-6                   # Hardy-Weinberg equilibrium p-value
        chr_y: 24                     # Y chromosome code (24 for hg38)

Step 5: Validate Your Configuration
Step 5: Validate Your Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before running the pipeline, validate your configuration:

.. code-block:: bash

    ideal-genom validate --config qc_pipeline.yaml

This checks for:

- File paths existence
- Required parameters
- Parameter value ranges
- Module and class availability
- Configuration syntax

**Example output:**

.. code-block:: text

    ✓ Configuration file is valid
    ✓ Pipeline 'my_study_qc' configured with 3/3 enabled steps

Step 6: Run the Pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^

**Basic Execution**

.. code-block:: bash

    ideal-genom run --config qc_pipeline.yaml

**Dry Run (Preview Without Executing)**

.. code-block:: bash

    ideal-genom run --config qc_pipeline.yaml --dry-run

**Example dry-run output:**

.. code-block:: text

    ============================================================
    PIPELINE SUMMARY (DRY RUN)
    ============================================================
    Pipeline Name: my_study_qc
    Output Directory: /data/output
    Total Steps: 3
    Enabled Steps: 3
    
    Enabled Steps:
      1. sample_qc (ideal_genom.qc.sample_qc.SampleQC)
      2. ancestry_qc (ideal_genom.qc.ancestry_qc.AncestryQC)
      3. variant_qc (ideal_genom.qc.variant_qc.VariantQC)
    ============================================================

**Custom Logging Level**

.. code-block:: bash

    ideal-genom run --config qc_pipeline.yaml --log-level DEBUG

Step 7: Understanding the Results
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After pipeline execution, your output directory will contain:

.. code-block:: text

    /data/output/
    ├── my_study_qc/                    # Pipeline-specific directory
    │   ├── sample_qc/
    │   │   ├── clean_files/            # QC-passed data
    │   │   │   ├── mydata_sampleQCed.bed
    │   │   │   ├── mydata_sampleQCed.bim
    │   │   │   └── mydata_sampleQCed.fam
    │   │   ├── fail_samples/           # Removed samples with reasons
    │   │   │   ├── failed_mind.txt
    │   │   │   ├── failed_sexcheck.txt
    │   │   │   ├── failed_het.txt
    │   │   │   └── failed_kinship.txt
    │   │   └── plots/                  # Visualization reports
    │   │       ├── call_rate.png
    │   │       ├── heterozygosity.png
    │   │       ├── sex_check.png
    │   │       └── kinship_distribution.png
    │   ├── ancestry_qc/
    │   │   ├── clean_files/
    │   │   ├── fail_samples/
    │   │   │   └── ancestry_outliers.txt
    │   │   └── plots/
    │   │       ├── pca_all_samples.png
    │   │       ├── pca_after_qc.png
    │   │       └── scree_plot.png
    │   └── variant_qc/
    │       ├── clean_files/            # Final QC-passed variants
    │       │   ├── mydata_variantQCed.bed  # Ready for GWAS!
    │       │   ├── mydata_variantQCed.bim
    │       │   └── mydata_variantQCed.fam
    │       ├── fail_variants/
    │       │   ├── failed_geno.txt
    │       │   ├── failed_hwe.txt
    │       │   └── failed_maf.txt
    │       └── plots/
    │           ├── maf_distribution.png
    │           ├── hwe_distribution.png
    │           └── missingness.png
    └── pipeline.log                    # Complete execution log

**Key Output Files:**

- **clean_files/**: Final PLINK binary files ready for downstream analysis (GWAS, etc.)
- **fail_samples/fail_variants/**: Lists of excluded samples/variants with QC failure reasons
- **plots/**: Publication-ready visualizations for QC reporting
- **pipeline.log**: Detailed log of all operations, parameters, and results

Using the Python API
---------------------

For more control, use the Python API directly:

**Basic Example**

.. code-block:: python

    from ideal_genom.core.config import load_config
    from ideal_genom.core.pipeline import PipelineExecutor
    
    # Load configuration
    config = load_config("qc_pipeline.yaml")
    
    # Create and execute pipeline
    executor = PipelineExecutor(config)
    executor.execute()

**Advanced Example with Custom Handling**

.. code-block:: python

    from ideal_genom.core.config import load_config
    from ideal_genom.core.pipeline import PipelineExecutor
    import logging
    
    # Setup custom logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    
    # Load and modify configuration
    config = load_config("qc_pipeline.yaml")
    
    # Create executor
    executor = PipelineExecutor(config, dry_run=False)
    
    # Get pipeline summary
    summary = executor.get_pipeline_summary()
    print(f"Running pipeline: {summary['pipeline_name']}")
    print(f"Enabled steps: {summary['enabled_steps']}")
    
    # Execute
    try:
        executor.execute()
        print("✓ Pipeline completed successfully!")
    except Exception as e:
        print(f"✗ Pipeline failed: {e}")
        raise

**Using Individual Modules**

.. code-block:: python

    from ideal_genom.qc.sample_qc import SampleQC
    from pathlib import Path
    
    # Initialize Sample QC
    sample_qc = SampleQC(
        input_path=Path("/data/input"),
        input_name="mydata",
        output_path=Path("/data/output"),
        output_name="mydata_sampleQCed",
        high_ld_regions_file="auto",
        build="38"
    )
    
    # Run with custom parameters
    sample_qc.execute_sample_qc_pipeline(sample_params={
        "rename_snp": True,
        "mind": 0.02,
        "sex_check": [0.2, 0.8],
        "maf": 0.01,
        "het_deviation": 3,
        "kinship": 0.354
    })
    
    # Access results
    print(f"Clean data saved to: {sample_qc.clean_dir}")

Common Workflows
----------------

**Workflow 1: Complete QC Pipeline**

.. code-block:: yaml

    pipeline:
      name: "full_qc"
      base_output_dir: "/data/output"
      steps:
        - name: "sample_qc"
          enabled: true
          # ... (sample QC config)
        - name: "ancestry_qc"
          enabled: true
          # ... (ancestry QC config)
        - name: "variant_qc"
          enabled: true
          # ... (variant QC config)

**Workflow 2: Skip Ancestry QC (Homogeneous Population)**

.. code-block:: yaml

    pipeline:
      steps:
        - name: "sample_qc"
          enabled: true
          # ...
        - name: "ancestry_qc"
          enabled: false  # Skip ancestry analysis
        - name: "variant_qc"
          enabled: true
          init_params:
            # Connect directly to sample QC output
            input_path: "${steps.sample_qc.clean_dir}"
            input_name: "${steps.sample_qc.output_name}"

**Workflow 3: Resume from Previous Step**

.. code-block:: yaml

    pipeline:
      steps:
        - name: "sample_qc"
          enabled: false  # Already completed
        - name: "ancestry_qc"
          enabled: false  # Already completed
        - name: "variant_qc"
          enabled: true
          init_params:
            # Use existing ancestry QC results
            input_path: "/data/output/my_study/ancestry_qc/clean_files"
            input_name: "mydata_ancestryQCed"

Tips and Best Practices
------------------------

**Configuration Management**

- Use descriptive pipeline names
- Comment your configuration extensively
- Keep configuration files in version control (git)
- Create separate configs for different studies/populations

**Resource Management**

- Set ``max_memory`` and ``max_threads`` to ``null`` for auto-detection
- For large datasets (>100K samples), consider increasing memory allocation
- Monitor logs for memory/performance issues

**Quality Control Thresholds**

- Standard thresholds work for most datasets
- For rare variant analysis, lower MAF thresholds (e.g., 0.001)
- For array data, stricter HWE thresholds (1e-10)
- Adjust kinship threshold based on study design (family vs. unrelated)

**File Organization**

- Use consistent naming conventions
- Keep intermediate files during initial runs (``keep_intermediate: true``)
- Enable logging to files (``file_logging: true``)
- Generate visualization reports (``generate_reports: true``)

**Debugging**

- Always validate configuration before running
- Use ``--dry-run`` to preview pipeline execution
- Set ``--log-level DEBUG`` for detailed troubleshooting
- Check fail_samples/fail_variants files to understand QC failures

**Debugging**

- Always validate configuration before running
- Use ``--dry-run`` to preview pipeline execution
- Set ``--log-level DEBUG`` for detailed troubleshooting
- Check fail_samples/fail_variants files to understand QC failures

Troubleshooting Common Issues
------------------------------

**Issue: "Module not found" error**

.. code-block:: text

    Solution: Check that the module path in your config is correct.
    Example: "ideal_genom.qc.sample_qc" not "ideal_genom_qc.sample_qc"

**Issue: "File not found" for input data**

.. code-block:: text

    Solution: Ensure paths are absolute or relative to execution directory.
    Use ${base_output_dir} for variable substitution.

**Issue: Pipeline runs but produces no output**

.. code-block:: text

    Solution: Check that steps are enabled: true in configuration.
    Verify input files exist at specified paths.

**Issue: High memory usage**

.. code-block:: text

    Solution: Set max_memory explicitly in settings.resources.
    Consider splitting large datasets or increasing available RAM.

Next Steps
----------

Now that you understand the basics:

- **Explore Examples**: See :doc:`examples` for complete workflows
- **Understand Configuration**: Read :doc:`configuration` for all parameters
- **Learn GWAS**: Check :doc:`gwas_pipeline` for association analysis
- **Process VCF Files**: See :doc:`vcf_pipeline` for post-imputation workflows
- **API Reference**: Browse module documentation for advanced usage

**Additional Resources:**

- Configuration templates: Clone the repository to access ``yaml_configs/`` directory
- Example notebooks in ``notebooks/`` directory
- :doc:`faq` for frequently asked questions
- :doc:`troubleshooting` for detailed problem-solving

**Getting Help:**

- GitHub Issues: https://github.com/cge-tubingens/IDEAL-GENOM-QC/issues
- Check logs: Review ``pipeline.log`` for detailed execution information
- Community: Join discussions on the GitHub repository