Ancestry QC Module

The Ancestry QC module performs population structure analysis and ancestry-based quality control.

Main Class

class ideal_genom.qc.ancestry_qc.AncestryQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False)[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False) None[source]

Initialize AncestryQC class.

This class performs ancestry quality control analysis on genetic data by merging it with 1000 Genomes reference data and running principal component analysis.

Parameters:

input_path: Path

Path to directory containing input files

input_name: str

Base name of input files (without extension)

output_path: Path

Path to directory where output files will be saved

output_name: str

Base name for output files

high_ld_regions_file: Path

Path to file containing high LD regions to exclude

reference_files: dict (optional)

Dictionary with paths to reference files. Must contain ‘bim’, ‘bed’, ‘fam’ and ‘psam’ keys. If not provided, will download 1000 Genomes reference files. Defaults to empty dict.

recompute_merge: bool (optional):

Whether to recompute merge with reference even if merged files exist. Defaults to True.

build: str (optional)

Genome build version, either ‘37’ or ‘38’. Defaults to ‘38’.

rename_snps: bool (optional):

Whether to rename SNPs to avoid duplicates during merge. Defaults to False.

raises TypeError:

If input arguments are not of expected types

raises ValueError:

If build is not ‘37’ or ‘38’

raises FileNotFoundError:

If input_path or output_path do not exist

Notes

Creates the following directory structure under output_path:

  • ancestry_qc_results/
    • merging/

    • plots/

    • fail_samples/

    • clean_files/

merge_reference_study(ind_pair: list = [50, 5, 0.2]) None[source]

Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets

Parameters:

ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]

Return type:

None

Notes

If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.

Raises:

TypeError – If ind_pair is not a list

execute_pca(ref_population: str, pca: int = 10, maf: float = 0.01, num_pca: int = 10, ref_threshold: float = 4, stu_threshold: float = 4, distance_metric: str | float = 'infinity') None[source]

Performs Principal Component Analysis (PCA) on genetic data and identifies ancestry outliers.

This method executes a complete PCA workflow including: 1. Running the PCA analysis 2. Identifying ancestry outliers using distance-based detection 3. Removing identified outliers

Parameters:
  • ref_population (str) – Reference population identifier for ancestry comparison

  • pca (int, optional) – Number of principal components to calculate (default=10)

  • maf (float, optional) – Minor allele frequency threshold for filtering (default=0.01)

  • num_pca (int, optional) – Number of principal components to use in outlier detection (default=10)

  • ref_threshold (float, optional) – Distance threshold for reference population outlier detection (default=4)

  • stu_threshold (float, optional) – Distance threshold for study population outlier detection (default=4)

  • aspect_ratio (str or float, optional) – Aspect ratio for PCA plots (default=’equal’)

  • distance_metric (str or float, optional) – Distance metric to use for outlier detection: - ‘infinity’ or ‘chebyshev’ → Chebyshev distance (L∞ norm) - numeric p >= 1 → Minkowski distance with order p (e.g., 2 for Euclidean) Default is ‘infinity’ (Chebyshev distance)

  • explained_variance_threshold (float, optional) – Threshold for reporting significant principal components based on explained variance (default=0.01)

Returns:

Results are saved to specified output directories

Return type:

None

Notes

The method uses the GenomicOutlierAnalyzer class to perform the analysis and saves results in the directories specified during class initialization. The distance-based outlier detection provides more robust identification of ancestry outliers compared to per-dimension thresholds.

execute_ancestry_qc_pipeline(ancestry_params: dict) None[source]

Execute complete ancestry QC pipeline.

This method runs the full ancestry quality control workflow including: 1. Merging reference and study data 2. Cleaning intermediate files 3. Running PCA analysis and outlier detection

Parameters:

ancestry_params (dict) – Dictionary containing pipeline parameters. Required keys: - ind_pair : list - LD pruning parameters [window, step, r2] - reference_pop : str - Reference population name - pca : int - Number of PCs to calculate - maf : float - MAF threshold - num_pcs : int - Number of PCs for outlier detection - ref_threshold : float - Reference outlier threshold - stu_threshold : float - Study outlier threshold - aspect_ratio : str or float - Plot aspect ratio - distance_metric : str or float - Distance metric for outliers - explained_variance_threshold : float - Variance threshold for reporting

Return type:

None

Supporting Classes

class ideal_genom.qc.ancestry_qc.AncestryQCReport(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path)[source]

Bases: object

__init__(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path) None[source]

Initialize ReportAncestryCheck class for generating ancestry QC reports and visualizations.

Parameters:
  • output_path (Path) – Path to output directory for reports and plots

  • population_tags (Path) – Path to population tags file

  • einvectors (Path) – Path to eigenvectors file from PCA

  • eigenvalues (Path) – Path to eigenvalues file from PCA

  • ancestry_fails (Path) – Path to ancestry fails file

Raises:

TypeError – If output_path is not a Path object If output_name is not a string

report_ancestry_qc(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float = 'equal', format: str = 'svg') None[source]
draw_pca_plot(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float, exclude_outliers: bool = False, plot_dir: Path = PosixPath('.'), plot_name: str = 'pca_plot', format: str = 'svg') None[source]

Generate 2D and 3D PCA plots from eigenvector data and population tags. This method creates two PCA visualization plots: - A 2D scatter plot showing PC1 vs PC2 colored by super-population - A 3D scatter plot showing PC1 vs PC2 vs PC3 colored by super-population

Parameters:
  • reference_pop (str) – Reference population identifier for zoomed plots

  • aspect_ratio (Union[Literal['auto', 'equal'], float]) – Aspect ratio for the plot axes. Can be ‘auto’, ‘equal’, or a numeric value

  • exclude_outliers (bool, default=False) – Whether to exclude ancestry outliers from the plots

  • plot_dir (Path, optional) – Directory path where plots will be saved. Defaults to current directory. If directory doesn’t exist, plots will be saved in self.output_path

  • plot_name (str, optional) – Base name for the plot files. Defaults to ‘pca_plot.svg’. Final filenames will be prefixed with ‘2D-’ and ‘3D-’

Return type:

None

Raises:
  • TypeError – If plot_dir is not a Path object If plot_name is not a string If reference_pop is not a string

  • ValueError – If required attributes (population_tags, einvectors, eigenvalues) are not set

Notes

Requires the following class attributes to be set: - self.population_tags : Path to population tags file (tab-separated) - self.einvectors : Path to eigenvectors file (space-separated) - self.eigenvalues : Path to eigenvalues file - self.ancestry_fails : Path to ancestry fails file (if exclude_outliers=True) The population tags file should contain columns ‘ID1’, ‘ID2’, and ‘SuperPop’ The eigenvectors file should contain the principal components data

report_pca(threshold: float = 0.01) None[source]

Generate PCA report including scree plot and variance explained analysis.

Parameters:

threshold (float, default=0.01) – Threshold for determining significant principal components (as fraction, e.g., 0.01 = 1%)

Return type:

None

Raises:

Notes

Creates two output files: - Scree plot with eigenvalues and cumulative variance - TSV file with detailed PCA statistics