Ancestry QC Module
The Ancestry QC module performs population structure analysis and ancestry-based quality control.
Main Class
- class ideal_genom.qc.ancestry_qc.AncestryQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False)[source]
Bases:
object- __init__(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, reference_files: dict = {}, recompute_merge: bool = True, build: str = '38', rename_snps: bool = False) None[source]
Initialize AncestryQC class.
This class performs ancestry quality control analysis on genetic data by merging it with 1000 Genomes reference data and running principal component analysis.
Parameters:
- input_path: Path
Path to directory containing input files
- input_name: str
Base name of input files (without extension)
- output_path: Path
Path to directory where output files will be saved
- output_name: str
Base name for output files
- high_ld_regions_file: Path
Path to file containing high LD regions to exclude
- reference_files: dict (optional)
Dictionary with paths to reference files. Must contain ‘bim’, ‘bed’, ‘fam’ and ‘psam’ keys. If not provided, will download 1000 Genomes reference files. Defaults to empty dict.
- recompute_merge: bool (optional):
Whether to recompute merge with reference even if merged files exist. Defaults to True.
- build: str (optional)
Genome build version, either ‘37’ or ‘38’. Defaults to ‘38’.
- rename_snps: bool (optional):
Whether to rename SNPs to avoid duplicates during merge. Defaults to False.
- raises TypeError:
If input arguments are not of expected types
- raises ValueError:
If build is not ‘37’ or ‘38’
- raises FileNotFoundError:
If input_path or output_path do not exist
Notes
Creates the following directory structure under output_path:
- ancestry_qc_results/
merging/
plots/
fail_samples/
clean_files/
- merge_reference_study(ind_pair: list = [50, 5, 0.2]) None[source]
Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets
- Parameters:
ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
- Return type:
None
Notes
If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.
- Raises:
TypeError – If ind_pair is not a list
- execute_pca(ref_population: str, pca: int = 10, maf: float = 0.01, num_pca: int = 10, ref_threshold: float = 4, stu_threshold: float = 4, distance_metric: str | float = 'infinity') None[source]
Performs Principal Component Analysis (PCA) on genetic data and identifies ancestry outliers.
This method executes a complete PCA workflow including: 1. Running the PCA analysis 2. Identifying ancestry outliers using distance-based detection 3. Removing identified outliers
- Parameters:
ref_population (str) – Reference population identifier for ancestry comparison
pca (int, optional) – Number of principal components to calculate (default=10)
maf (float, optional) – Minor allele frequency threshold for filtering (default=0.01)
num_pca (int, optional) – Number of principal components to use in outlier detection (default=10)
ref_threshold (float, optional) – Distance threshold for reference population outlier detection (default=4)
stu_threshold (float, optional) – Distance threshold for study population outlier detection (default=4)
aspect_ratio (str or float, optional) – Aspect ratio for PCA plots (default=’equal’)
distance_metric (str or float, optional) – Distance metric to use for outlier detection: - ‘infinity’ or ‘chebyshev’ → Chebyshev distance (L∞ norm) - numeric p >= 1 → Minkowski distance with order p (e.g., 2 for Euclidean) Default is ‘infinity’ (Chebyshev distance)
explained_variance_threshold (float, optional) – Threshold for reporting significant principal components based on explained variance (default=0.01)
- Returns:
Results are saved to specified output directories
- Return type:
None
Notes
The method uses the GenomicOutlierAnalyzer class to perform the analysis and saves results in the directories specified during class initialization. The distance-based outlier detection provides more robust identification of ancestry outliers compared to per-dimension thresholds.
- execute_ancestry_qc_pipeline(ancestry_params: dict) None[source]
Execute complete ancestry QC pipeline.
This method runs the full ancestry quality control workflow including: 1. Merging reference and study data 2. Cleaning intermediate files 3. Running PCA analysis and outlier detection
- Parameters:
ancestry_params (dict) – Dictionary containing pipeline parameters. Required keys: - ind_pair : list - LD pruning parameters [window, step, r2] - reference_pop : str - Reference population name - pca : int - Number of PCs to calculate - maf : float - MAF threshold - num_pcs : int - Number of PCs for outlier detection - ref_threshold : float - Reference outlier threshold - stu_threshold : float - Study outlier threshold - aspect_ratio : str or float - Plot aspect ratio - distance_metric : str or float - Distance metric for outliers - explained_variance_threshold : float - Variance threshold for reporting
- Return type:
None
Supporting Classes
- class ideal_genom.qc.ancestry_qc.AncestryQCReport(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path)[source]
Bases:
object- __init__(output_path: Path, einvectors: Path, eigenvalues: Path, ancestry_fails: Path, population_tags: Path) None[source]
Initialize ReportAncestryCheck class for generating ancestry QC reports and visualizations.
- Parameters:
output_path (Path) – Path to output directory for reports and plots
population_tags (Path) – Path to population tags file
einvectors (Path) – Path to eigenvectors file from PCA
eigenvalues (Path) – Path to eigenvalues file from PCA
ancestry_fails (Path) – Path to ancestry fails file
- Raises:
TypeError – If output_path is not a Path object If output_name is not a string
- report_ancestry_qc(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float = 'equal', format: str = 'svg') None[source]
- draw_pca_plot(reference_pop: str, aspect_ratio: Literal['auto', 'equal'] | float, exclude_outliers: bool = False, plot_dir: Path = PosixPath('.'), plot_name: str = 'pca_plot', format: str = 'svg') None[source]
Generate 2D and 3D PCA plots from eigenvector data and population tags. This method creates two PCA visualization plots: - A 2D scatter plot showing PC1 vs PC2 colored by super-population - A 3D scatter plot showing PC1 vs PC2 vs PC3 colored by super-population
- Parameters:
reference_pop (str) – Reference population identifier for zoomed plots
aspect_ratio (Union[Literal['auto', 'equal'], float]) – Aspect ratio for the plot axes. Can be ‘auto’, ‘equal’, or a numeric value
exclude_outliers (bool, default=False) – Whether to exclude ancestry outliers from the plots
plot_dir (Path, optional) – Directory path where plots will be saved. Defaults to current directory. If directory doesn’t exist, plots will be saved in self.output_path
plot_name (str, optional) – Base name for the plot files. Defaults to ‘pca_plot.svg’. Final filenames will be prefixed with ‘2D-’ and ‘3D-’
- Return type:
None
- Raises:
TypeError – If plot_dir is not a Path object If plot_name is not a string If reference_pop is not a string
ValueError – If required attributes (population_tags, einvectors, eigenvalues) are not set
Notes
Requires the following class attributes to be set: - self.population_tags : Path to population tags file (tab-separated) - self.einvectors : Path to eigenvectors file (space-separated) - self.eigenvalues : Path to eigenvalues file - self.ancestry_fails : Path to ancestry fails file (if exclude_outliers=True) The population tags file should contain columns ‘ID1’, ‘ID2’, and ‘SuperPop’ The eigenvectors file should contain the principal components data
- report_pca(threshold: float = 0.01) None[source]
Generate PCA report including scree plot and variance explained analysis.
- Parameters:
threshold (float, default=0.01) – Threshold for determining significant principal components (as fraction, e.g., 0.01 = 1%)
- Return type:
None
- Raises:
ValueError – If eigenvalues attribute is not set
TypeError – If threshold is not a float
Notes
Creates two output files: - Scree plot with eigenvalues and cumulative variance - TSV file with detailed PCA statistics