Ancestry QC Module

The Ancestry QC module performs population structure analysis and ancestry-based quality control.

Main Class

class ideal_genom.qc.ancestry_qc.AncestryQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False)[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False) → None[source]

Initialize AncestryQC class.

This class performs ancestry quality control analysis on genetic data by merging it with 1000 Genomes reference data and running principal component analysis.

Parameters:

input_path: Path: Path to directory containing input files
input_name: str: Base name of input files (without extension)
output_path: Path: Path to directory where output files will be saved
output_name: str: Base name for output files
high_ld_regions_file: Path: Path to file containing high LD regions to exclude
reference_files: dict (optional): Dictionary with paths to reference files. Must contain ‘bim’, ‘bed’, ‘fam’ and ‘psam’ keys. If not provided, will download 1000 Genomes reference files. Defaults to empty dict.
recompute_merge: bool (optional):: Whether to recompute merge with reference even if merged files exist. Defaults to True.
build: str (optional): Genome build version, either ‘37’ or ‘38’. Defaults to ‘38’.
rename_snps: bool (optional):: Whether to rename SNPs to avoid duplicates during merge. Defaults to False.

raises TypeError:: If input arguments are not of expected types
raises ValueError:: If build is not ‘37’ or ‘38’
raises FileNotFoundError:: If input_path or output_path do not exist

Notes

Creates the following directory structure under output_path:

ancestry_qc_results/
- merging/
- plots/
- fail_samples/
- clean_files/

merge_reference_study(ind_pair: list = [50, 5, 0.2]) → None[source]

Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets

Parameters:: ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
Return type:: None

Notes

If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.

Raises:: TypeError – If ind_pair is not a list

execute_pca(ref_population: str, pca: int = 10, maf: float = 0.01, num_pca: int = 10, ref_threshold: float = 4, stu_threshold: float = 4, distance_metric: str | float = 'infinity') → None[source]

Performs Principal Component Analysis (PCA) on genetic data and identifies ancestry outliers.

This method executes a complete PCA workflow including: 1. Running the PCA analysis 2. Identifying ancestry outliers using distance-based detection 3. Removing identified outliers

Parameters:

ref_population (str) – Reference population identifier for ancestry comparison
pca (int, optional) – Number of principal components to calculate (default=10)
maf (float, optional) – Minor allele frequency threshold for filtering (default=0.01)
num_pca (int, optional) – Number of principal components to use in outlier detection (default=10)
ref_threshold (float, optional) – Distance threshold for reference population outlier detection (default=4)
stu_threshold (float, optional) – Distance threshold for study population outlier detection (default=4)
aspect_ratio (str or float, optional) – Aspect ratio for PCA plots (default=’equal’)
distance_metric (str or float, optional) – Distance metric to use for outlier detection: - ‘infinity’ or ‘chebyshev’ → Chebyshev distance (L∞ norm) - numeric p >= 1 → Minkowski distance with order p (e.g., 2 for Euclidean) Default is ‘infinity’ (Chebyshev distance)
explained_variance_threshold (float, optional) – Threshold for reporting significant principal components based on explained variance (default=0.01)

Returns:

Results are saved to specified output directories

Return type:

None

Notes

The method uses the GenomicOutlierAnalyzer class to perform the analysis and saves results in the directories specified during class initialization. The distance-based outlier detection provides more robust identification of ancestry outliers compared to per-dimension thresholds.

execute_ancestry_qc_pipeline(ancestry_params: dict) → None[source]

Execute complete ancestry QC pipeline.

This method runs the full ancestry quality control workflow including: 1. Merging reference and study data 2. Cleaning intermediate files 3. Running PCA analysis and outlier detection

Parameters:: ancestry_params (dict) – Dictionary containing pipeline parameters. Required keys: - ind_pair : list - LD pruning parameters [window, step, r2] - reference_pop : str - Reference population name - pca : int - Number of PCs to calculate - maf : float - MAF threshold - num_pcs : int - Number of PCs for outlier detection - ref_threshold : float - Reference outlier threshold - stu_threshold : float - Study outlier threshold - aspect_ratio : str or float - Plot aspect ratio - distance_metric : str or float - Distance metric for outliers - explained_variance_threshold : float - Variance threshold for reporting
Return type:: None

Supporting Classes

class ideal_genom.qc.ancestry_qc.AncestryQCReport(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path)[source]

Bases: object

__init__(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path) → None[source]

Initialize ReportAncestryCheck class for generating ancestry QC reports and visualizations.

Parameters:

output_path (Path) – Path to output directory for reports and plots
population_tags (Path) – Path to population tags file
einvectors (Path) – Path to eigenvectors file from PCA
eigenvalues (Path) – Path to eigenvalues file from PCA
ancestry_fails (Path) – Path to ancestry fails file

Raises:

TypeError – If output_path is not a Path object If output_name is not a string

report_ancestry_qc(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float = 'equal', format: str = 'svg') → None[source]

draw_pca_plot(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float, exclude_outliers: bool = False, plot_dir: Path = PosixPath('.'), plot_name: str = 'pca_plot', format: str = 'svg') → None[source]

Generate 2D and 3D PCA plots from eigenvector data and population tags. This method creates two PCA visualization plots: - A 2D scatter plot showing PC1 vs PC2 colored by super-population - A 3D scatter plot showing PC1 vs PC2 vs PC3 colored by super-population

Parameters:

reference_pop (str) – Reference population identifier for zoomed plots
aspect_ratio (Union[Literal['auto', 'equal'], float]) – Aspect ratio for the plot axes. Can be ‘auto’, ‘equal’, or a numeric value
exclude_outliers (bool, default=False) – Whether to exclude ancestry outliers from the plots
plot_dir (Path, optional) – Directory path where plots will be saved. Defaults to current directory. If directory doesn’t exist, plots will be saved in self.output_path
plot_name (str, optional) – Base name for the plot files. Defaults to ‘pca_plot.svg’. Final filenames will be prefixed with ‘2D-’ and ‘3D-’

Return type:

None

Raises:

TypeError – If plot_dir is not a Path object If plot_name is not a string If reference_pop is not a string
ValueError – If required attributes (population_tags, einvectors, eigenvalues) are not set

Notes

Requires the following class attributes to be set: - self.population_tags : Path to population tags file (tab-separated) - self.einvectors : Path to eigenvectors file (space-separated) - self.eigenvalues : Path to eigenvalues file - self.ancestry_fails : Path to ancestry fails file (if exclude_outliers=True) The population tags file should contain columns ‘ID1’, ‘ID2’, and ‘SuperPop’ The eigenvectors file should contain the principal components data

report_pca(threshold: float = 0.01) → None[source]

Generate PCA report including scree plot and variance explained analysis.

Parameters:

threshold (float, default=0.01) – Threshold for determining significant principal components (as fraction, e.g., 0.01 = 1%)

Return type:

None

Raises:

ValueError – If eigenvalues attribute is not set
TypeError – If threshold is not a float

Notes

Creates two output files: - Scree plot with eigenvalues and cumulative variance - TSV file with detailed PCA statistics