Utility Modules

Core utilities and helper functions.

Core Utilities

Shared utility functions for genomic analysis pipelines.

ideal_genom.core.utils.get_optimal_threads(reserve: int = 2, default: int = 10, max_threads: int | None = None) → int[source]

Calculate optimal thread count for genomic analysis operations.

Determines the number of threads to use based on available CPU cores, reserving some cores for system operations with robust fallback handling.

Parameters:

reserve (int, default=2) – Number of cores to reserve for system operations
default (int, default=10) – Default thread count if CPU detection fails
max_threads (int, optional) – Maximum number of threads to use (caps the result)

Returns:

Optimal number of threads to use (always >= 1)

Return type:

int

Examples

>>> threads = get_optimal_threads()  # On 16-core system, returns 14
>>> threads = get_optimal_threads(reserve=4)  # Returns 12
>>> threads = get_optimal_threads(max_threads=8)  # Never exceeds 8

ideal_genom.core.utils.get_available_memory(fraction: float = 0.6666666666666666, min_mb: int = 512, max_mb: int | None = None, safety_buffer_mb: int = 1024) → int[source]

Calculate available memory for genomic analysis operations with safety checks.

Determines the amount of memory to allocate based on currently available system memory, using a configurable fraction with minimum and maximum limits to avoid system instability.

Parameters:

fraction (float, default=2/3) – Fraction of available memory to use (should be between 0 and 1)
min_mb (int, default=512) – Minimum memory to allocate in MB
max_mb (int, optional) – Maximum memory to allocate in MB (None for no limit)
safety_buffer_mb (int, default=1024) – Safety buffer to always leave available for system (MB)

Returns:

Memory in MB to allocate

Return type:

int

Raises:

ValueError – If fraction is not between 0 and 1, or other parameters are invalid
RuntimeError – If insufficient memory is available

Examples

>>> memory_mb = get_available_memory()  # Uses 2/3 of available memory
>>> memory_mb = get_available_memory(fraction=0.5, max_mb=8192)  # Uses half, max 8GB

ideal_genom.core.utils.count_file_lines(file_path: Path) → int[source]

Count lines in a file efficiently.

Uses a generator expression for memory-efficient line counting, suitable for large genomic data files.

Parameters:

file_path (Path) – Path to the file to count

Returns:

Number of lines in the file

Return type:

int

Raises:

FileNotFoundError – If the file does not exist
IOError – If the file cannot be read

Examples

>>> from pathlib import Path
>>> count = count_file_lines(Path('variants.bim'))

ideal_genom.core.utils.validate_input_file(file_path: Path, extensions: List[str] | None = None) → Path[source]

Validate that a file exists and optionally check its extension.

Validates file existence and optionally ensures the file has one of the specified extensions. Useful for genomic data files that must have specific formats (e.g., .vcf, .bim, .fam).

Parameters:

file_path (Path) – Path to the file to validate
extensions (List[str], optional) – List of valid file extensions (including the dot, e.g., [‘.vcf’, ‘.vcf.gz’]). If None, no extension validation is performed.

Returns:

Validated file path

Return type:

Path

Raises:

TypeError – If file_path is not a Path object
FileNotFoundError – If the file does not exist
IsADirectoryError – If the path points to a directory instead of a file
ValueError – If the file extension is not in the allowed extensions list

Examples

>>> from pathlib import Path
>>> vcf_file = validate_input_file(Path('data.vcf'), ['.vcf', '.vcf.gz'])
>>> any_file = validate_input_file(Path('output.txt'))  # No extension check

ideal_genom.core.utils.validate_file_path(file_path: Path, must_exist: bool = True, must_be_file: bool = True) → Path[source]

Generic file path validation with flexible requirements.

Provides flexible validation for file paths with configurable requirements. Useful when you need different validation rules for different scenarios (e.g., input files that must exist vs. output files that may not exist yet).

Parameters:

file_path (Path) – Path to validate
must_exist (bool, default=True) – If True, the path must already exist
must_be_file (bool, default=True) – If True, the path must be a file (not a directory). Only checked if must_exist=True and the path exists.

Returns:

Validated file path

Return type:

Path

Raises:

TypeError – If file_path is not a Path object
FileNotFoundError – If must_exist=True and the path does not exist
IsADirectoryError – If must_be_file=True and the path is a directory

Examples

>>> from pathlib import Path
>>> # Validate existing input file
>>> input_file = validate_file_path(Path('input.txt'))
>>>
>>> # Validate output file path (may not exist yet)
>>> output_file = validate_file_path(Path('output.txt'), must_exist=False)
>>>
>>> # Validate path that could be file or directory
>>> path = validate_file_path(Path('data'), must_be_file=False)

ideal_genom.core.utils.validate_output_dir(output_dir: Path, create: bool = True) → Path[source]

Validate and optionally create output directory.

Parameters:

output_dir (Path) – Path to the output directory
create (bool, default=True) – If True, create the directory if it doesn’t exist

Returns:

Validated output directory path

Return type:

Path

Raises:

FileNotFoundError – If directory doesn’t exist and create=False
PermissionError – If directory cannot be created due to permissions

Examples

>>> from pathlib import Path
>>> output = validate_output_dir(Path('/data/results'))

ideal_genom.core.utils.format_memory_size(bytes_size: int) → str[source]

Format byte size into human-readable string.

Parameters:: bytes_size (int) – Size in bytes
Returns:: Formatted size string (e.g., ‘1.5 GB’, ‘256 MB’)
Return type:: str

Examples

>>> format_memory_size(1536 * 1024 * 1024)
'1.50 GB'
>>> format_memory_size(512 * 1024)
'512.00 KB'

ideal_genom.core.utils.get_system_resource_info() → dict[source]

Get comprehensive system resource information.

Returns:: Dictionary containing CPU, memory, and disk information
Return type:: dict

Examples

>>> info = get_system_resource_info()
>>> print(f"Available memory: {info['memory']['available_mb']:.0f} MB")

ideal_genom.core.utils.download_file(url: str, local_filename: Path) → None[source]

ideal_genom.core.utils.unzip_file_flat(in_file: Path, target_file: str, out_dir: Path, remove_zip: bool = False) → Path[source]

Extracts a specific file from a ZIP archive, decompresses it if it’s a .gz file, and optionally deletes original files.

Parameters:

in_file (str) – Path to the ZIP file.
target_file (str) – The file inside the ZIP to extract.
out_dir (str) – Directory where the extracted file will be saved.
remove_zip (bool) – If True, delete the original ZIP file after extraction.
remove_gz (bool) – If True, delete the .gz file after decompression.

Returns:

Path to the final extracted file.

Return type:

Path

ideal_genom.core.utils.extract_gz_file(gz_file: Path, out_dir: Path, remove_gz: bool = False) → Path[source]

Extracts a .gz file and saves the decompressed content in the same directory.

Parameters:

gz_file (str) – Path to the .gz file.
out_dir (str) – Directory where the decompressed file will be saved.
remove_gz (bool) – If True, delete the .gz file after extraction.

Returns:

Path to the extracted file.

Return type:

Path

Annotations

This module provides functions to annotate genomic variants with gene information and effects.

Includes functions to: - Find the closest gene to a given SNP position. - Map chromosome numbers to identifiers. - Convert GTF files to a format containing all genes. - Annotate SNPs with gene names using Ensembl or RefSeq databases. - Prepare genome data from GTF files. - Annotate variants with their effects relying on Ensembl VEP.

ideal_genom.utilities.annotations.get_closest_gene(x, data: Genome, chrom: str = 'CHR', pos: str = 'POS', max_iter: int = 20000, step: int = 50, source: str = 'ensembl', build: str = '38') → tuple[source]

Find the closest gene to a given position in the genome.

This function searches for the closest gene to a specified SNP position in the genome. It checks the position in the specified chromosome and returns the distance to the closest gene along with the gene name(s). If no gene is found within the specified distance, it returns “intergenic”.

Parameters:

x – SNP information.
data (pyensembl.Genome) – An instance of the Genome class containing gene annotations.
chrom (str, optional) – The key in the dictionary x that corresponds to the chromosome. Default is “CHR”.
pos (str, optional) – The key in the dictionary x that corresponds to the position. Default is “POS”.
max_iter (int, optional) – The maximum number of iterations to search for a gene. Default is 20000.
step (int, optional) – The step size for each iteration when searching for a gene. Default is 50.
source (str, optional) – The source of the gene annotations, either “ensembl” or “refseq”. Default is “ensembl”.
build (str, optional) – The genome build version, used when source is “refseq”. Default is “38”.

Returns:

A tuple containing the distance to the closest gene and the gene name(s). If no gene is found, returns the distance and “intergenic”.

Return type:

tuple

Raises:

TypeError – If data is not an instance of Genome, or if chrom or pos are not strings, or if max_iter or step are not integers.
ValueError – If source is not “ensembl” or “refseq”, or if build is not “37” or “38”.

ideal_genom.utilities.annotations.get_number_to_chr(in_chr: bool = False, xymt: list = ['X', 'Y', 'MT'], xymt_num: list = [23, 24, 25], prefix: str = '') → dict[source]

Creates a dictionary mapping chromosome numbers to chromosome identifiers.

This function generates a mapping between chromosome numbers (as keys) and chromosome identifiers (as values), with special handling for sex chromosomes and mitochondrial chromosome.

Parameters:

in_chr (bool, default=False) – If True, dictionary keys will be strings; if False, keys will be integers.
xymt (list, default=["X","Y","MT"]) – List of string identifiers for the X, Y, and mitochondrial chromosomes.
xymt_num (list, default=[23,24,25]) – List of numeric identifiers corresponding to X, Y, and MT chromosomes.
prefix (str, default="") – String prefix to add to all chromosome identifiers.

Returns:

A dictionary mapping chromosome numbers to chromosome identifiers. For autosomal chromosomes (1-199), maps to prefix+number. For sex and mitochondrial chromosomes, maps to prefix+X/Y/MT.

Return type:

dict

Raises:

TypeError – If in_chr is not a boolean, xymt or xymt_num are not lists, or prefix is not a string.

Examples

>>> get_number_to_chr()
{1: '1', 2: '2', ..., 23: 'X', 24: 'Y', 25: 'MT', ...}

>>> get_number_to_chr(in_chr=True, prefix="chr")
{'1': 'chr1', '2': 'chr2', ..., '23': 'chrX', '24': 'chrY', '25': 'chrMT', ...}

ideal_genom.utilities.annotations.get_chr_to_NC(build: str = '38', inverse: bool = False) → dict[source]

Returns a dictionary mapping between chromosome names and NCBI NC identifiers.

This function provides a mapping between chromosome names (like “1”, “X”, “MT”) and their corresponding NCBI RefSeq accession numbers (like “NC_000001.10”) for different human genome builds.

Parameters:

build (str, optional) – The genome build version. Accepted values are “19”, “37”, or “38”. Note that builds “19” and “37” return the same mapping.
inverse (bool, optional) – If True, returns an inverted dictionary where NC identifiers are keys and chromosome names are values. Defaults to False.

Returns:

A dictionary mapping chromosome names to NC identifiers (if inverse=False) or NC identifiers to chromosome names (if inverse=True).

Return type:

dict

Raises:

TypeError – If build is not a string or inverse is not a boolean.
ValueError – If build is not one of “19”, “37”, or “38”.

References

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13

ideal_genom.utilities.annotations.gtf_to_all_genes(gtfpath: str) → str[source]

Extract all gene records from a GTF file and save them to a new file.

This function reads a GTF file, extracts all gene records, and saves them to a new file with the suffix ‘_all_genes.gtf.gz’. If the output file already exists, it will be returned without regenerating it.

Parameters:: gtfpath (str) – Path to the input GTF file.
Returns:: Path to the output file containing all gene records.
Return type:: str
Raises:: TypeError – If gtfpath is not a string.

Notes

The function uses the read_gtf function for initial parsing and pandas for extraction. The function assumes the GTF file has a standard format with gene_id attributes.

ideal_genom.utilities.annotations.annotate_snp(insumstats: pandas.DataFrame, gtf_path: str, chrom: str = 'CHR', pos: str = 'POS', build: str = '38', source: str = 'ensembl') → pandas.DataFrame[source]

Annotate SNPs with nearest gene name(s) using either Ensembl or RefSeq databases.

This function takes a DataFrame containing SNP data and annotates each variant with information about the nearest gene(s) based on genomic coordinates.

Parameters:

insumstats (pandas.DataFrame) – DataFrame containing SNP data with chromosome and position information.
gtf_path (str) – Path to the GTF (Gene Transfer Format) file for gene annotations.
chrom (str, optional) – Column name in the DataFrame that contains chromosome information. Defaults to “CHR”.
pos (str, optional) – Column name in the DataFrame that contains position information. Defaults to “POS”.
build (str, optional) – Genome build version. Must be one of “19”, “37”, or “38”. Defaults to “38”.
source (str, optional) – Source for gene annotation. Must be either “ensembl” or “refseq”. Defaults to “ensembl”.

Returns:

A copy of the input DataFrame with additional gene annotation columns.

Return type:

pandas.DataFrame

Raises:

TypeError – If input is not a pandas DataFrame or if GTF path is not a string.
ValueError – If required columns are missing in the input DataFrame or if build/source parameters are invalid.

ideal_genom.utilities.annotations.annotate_with_ensembl(output: pandas.DataFrame, chrom: str, pos: str, build: str, gtf_path: str, is_gtf_path: bool) → pandas.DataFrame[source]

Annotate variants with gene information from Ensembl database.

This function adds gene annotations to a DataFrame containing variant information by looking up the genomic coordinates in Ensembl data. It adds ‘LOCATION’ and ‘GENE’ columns to the input DataFrame.

Parameters:

output (pandas.DataFrame) – DataFrame containing variant information with chromosome and position columns.
chrom (str) – Name of the column in the DataFrame that contains chromosome information.
pos (str) – Name of the column in the DataFrame that contains position information.
build (str) – Genome build version to use. Must be one of ‘19’, ‘37’, or ‘38’. Note that ‘19’ and ‘37’ are treated as equivalent (GRCh37).
gtf_path (str) – Path to GTF file with gene annotations or None to use default paths. If None, the appropriate GTF file will be downloaded or used from cache.
is_gtf_path (bool) – If True, gtf_path is treated as a direct path to a GTF file. If False, gtf_path is treated as a directory where the GTF file should be downloaded.

Returns:

The input DataFrame with additional ‘LOCATION’ and ‘GENE’ columns containing gene annotations from Ensembl.

Return type:

pandas.DataFrame

Raises:

TypeError – If output is not a pandas DataFrame or if gtf_path is not a string (when provided).
ValueError – If the required columns are not in the DataFrame or if the build is invalid.

Notes

The function supports both GRCh37 (build ‘19’ or ‘37’) and GRCh38 (build ‘38’) and will download the appropriate annotation files if not already available.

ideal_genom.utilities.annotations.annotate_with_refseq(output: pandas.DataFrame, chrom: str, pos: str, build: str, gtf_path: str, is_gtf_path: bool) → pandas.DataFrame[source]

Annotate genomic variants with RefSeq gene information.

This function adds gene and location annotations to genomic variants using NCBI RefSeq data. It processes the input DataFrame and adds two new columns: ‘LOCATION’ and ‘GENE’.

Parameters:

output (pandas.DataFrame) – DataFrame containing variant information to annotate.
chrom (str) – Column name in DataFrame that contains chromosome information.
pos (str) – Column name in DataFrame that contains position information.
build (str) – Genome build version. Must be one of ‘19’, ‘37’, or ‘38’.
gtf_path (str) – Path to the GTF file. If None, a default path will be used.
is_gtf_path (bool) – If True, gtf_path is treated as a direct file path. If False, gtf_path is treated as a directory.

Returns:

The input DataFrame with added ‘LOCATION’ and ‘GENE’ columns.

Return type:

pandas.DataFrame

Raises:

TypeError – If output is not a pandas DataFrame or if gtf_path is provided but not a string.
ValueError – If required columns are missing from output or if build is invalid.

Notes

For builds ‘19’ and ‘37’, GRCh37 RefSeq annotations are used.
For build ‘38’, GRCh38 RefSeq annotations are used.
Only protein-coding genes are considered for annotation.

ideal_genom.utilities.annotations.prepare_gtf_path(gtf_path: str, is_gtf_path: bool, source: str, build: str) → str[source]

Prepares the path to a GTF (Gene Transfer Format) file for annotation purposes.

This function either uses a user-provided GTF file or downloads one from the specified source (Ensembl or RefSeq) for the given genome build (GRCh37 or GRCh38). If a download is required, it fetches the latest release, unzips it, and processes it to extract all genes.

Parameters:

gtf_path (str) – Path to an existing GTF file, or None if one should be downloaded.
is_gtf_path (bool) – Flag indicating whether the provided gtf_path is valid.
source (str) – Source database for GTF file (‘ensembl’ or ‘refseq’).
build (str) – Genome build version (‘37’ or ‘38’).

Returns:

Path to the prepared GTF file with all genes.

Return type:

str

Notes

If gtf_path is None or is_gtf_path is False, a new GTF file will be downloaded.
If a valid gtf_path is provided, it will be processed using gtf_to_all_genes.
The function logs the actions being performed.

ideal_genom.utilities.annotations.prepare_genome(gtf_path: str, reference_name: str, annotation_name: str) → Genome[source]

Prepare a genome annotation by loading or creating a database from a GTF file.

This function creates a Genome object from a GTF file and ensures that the corresponding database is indexed for efficient access.

Parameters:

gtf_path (str) – Path to the GTF (Gene Transfer Format) file
reference_name (str) – Name of the reference genome
annotation_name (str) – Name of the annotation

Returns:

A Genome object initialized with the provided reference and annotation

Return type:

pyensemble.Genome

Notes

If the database file (with extension .db) doesn’t exist, this function will create it by calling the index() method on the Genome object.

ideal_genom.utilities.annotations.annotate_variants(output: pandas.DataFrame, data: Genome, chrom: str, pos: str, source: str, build: str = '38') → pandas.DataFrame[source]

Annotate variants with their closest genes.

This function processes a DataFrame containing genomic variants and enriches it with gene annotation information by finding the closest gene for each variant.

Parameters:

output (pandas.DataFrame) – DataFrame containing variant information to be annotated.
data (Genome) – Genome object containing reference data for annotation.
chrom (str) – Column name in the output DataFrame that contains chromosome information.
pos (str) – Column name in the output DataFrame that contains position information.
source (str) – Source of the gene annotation data.
build (str, default='38') – Genome build version (default is GRCh38).

Returns:

DataFrame containing gene annotation information for each variant.

Return type:

pandas.DataFrame

Notes

This function applies the get_closest_gene function to each row in the input DataFrame and returns the results as a DataFrame with the same index as the input.