Get References

Module for downloading and managing reference datasets (1000 Genomes, reference genomes, dbSNP).

class ideal_genom.core.get_references.Fetcher1000Genome(destination: Path | None = None, build: str = '38')[source]

Bases: object

__init__(destination: Path | None = None, build: str = '38')[source]

Initialize a reference data handler.

This class manages reference data files from 1000 Genomes Project.

Parameters:

destination (Path, optional) – Path where reference files will be stored. If not provided, defaults to ‘../data/1000genomes_build_{build}’.
build (str, optional) – Human genome build version. Defaults to ‘38’.

destination

Directory path where reference files are stored

Type:: Path

build

Human genome build version being used

Type:: str

pgen_file

Path to PGEN format file

Type:: Path

pvar_file

Path to PVAR format file

Type:: Path

psam_file

Path to PSAM format file

Type:: Path

bed_file

Path to BED format file

Type:: Path

bim_file

Path to BIM format file

Type:: Path

fam_file

Path to FAM format file

Type:: Path

get_1000genomes(url_pgen: str | None = None, url_pvar: str | None = None, url_psam: str | None = None) → Path[source]

Download and decompress 1000 Genomes reference data. This method downloads the PLINK2 binary files (.pgen, .pvar, .psam) for the 1000 Genomes reference dataset, corresponding to the specified genome build (37 or 38). If the files already exist in the destination directory, the download is skipped.

Parameters:

url_pgen (str, optional): Custom URL for downloading the .pgen file.: If None, uses default URL based on genome build.
url_pvar (str, optional): Custom URL for downloading the .pvar file.: If None, uses default URL based on genome build.
url_psam (str, optional): Custom URL for downloading the .psam file.: If None, uses default URL based on genome build.

Returns:

Path: Path object pointing to the decompressed .pgen file location.

Note:

The method requires plink2 to be installed and accessible in the system path for decompressing the .pgen file.

get_1000genomes_binaries() → Path[source]

Convert downloaded 1000 Genomes data into PLINK binary files (.bed, .bim, .fam). This method processes the downloaded 1000 Genomes data files and converts them into PLINK binary format. If the binary files already exist, it skips the conversion process. The method handles file cleanup and proper renaming of output files. The conversion is done in two steps: 1. Convert pfile to binary format including only SNPs from chromosomes 1-22,X,Y,MT 2. Update variant IDs and create final binary files

Returns:: Path object pointing to the generated binary files (without extension) The actual files created will be .bed, .bim, .fam and .psam with the same prefix
Return type:: Path

class ideal_genom.core.get_references.ReferenceDataFetcher(base_url: str, build: str, source: str, destination_folder: str | None = None)[source]

Bases: object

A class for fetching, downloading, and processing reference genome data.

This class provides a framework for retrieving genomic reference data from various sources. It handles downloading compressed files, unzipping them, and extracting gene information from GTF files.

build

The genome build (e.g., ‘hg38’, ‘GRCh38’).

Type:: str

source

The data source (e.g., ‘ensembl’, ‘ucsc’).

Type:: str

base_url

The base URL to fetch data from.

Type:: str

destination_folder

The directory to save downloaded files. If None, defaults to project_root/data/{source}_latest.

Type:: Optional[str]

latest_url

The URL of the latest release after calling get_latest_release().

Type:: Optional[str]

gz_file

Path to the downloaded compressed file.

Type:: Optional[str]

gtf_file

Path to the uncompressed GTF file.

Type:: Optional[str]

Notes

This is an abstract base class that requires subclasses to implement the get_latest_release() method for specific data sources.

__init__(base_url: str, build: str, source: str, destination_folder: str | None = None) → None[source]

get_latest_release() → None[source]: Determine the specific URL for fetching data.

download_latest() → str[source]

Downloads the latest file from self.latest_url to self.destination_folder.

Raises:

AttributeError – If self.latest_url is not set.
requests.exceptions.RequestException – If the HTTP request fails.

get_destination_folder() → Path[source]: Determine the destination folder for downloads.

unzip_latest() → str[source]: Unzips the latest downloaded file and stores it as a GTF file.

get_all_genes() → str[source]

Extract all genes from the GTF file and save them to a new compressed file.

This method reads the GTF file specified in self.gtf_file, filters for gene features, and creates a new GTF file containing only the gene entries. If the output file already exists, it will return the path without reprocessing.

Returns:

Path to the compressed GTF file containing all genes

Return type:

str

Raises:

FileNotFoundError – If the reference GTF file (self.gtf_file) is not found
TypeError – If read_gtf does not return a pandas DataFrame

Note

The output file will be named based on the input GTF file with “-all_genes.gtf.gz” suffix

class ideal_genom.core.get_references.FetcherLDRegions(destination: Path | None = None, build: str = '38')[source]

Bases: ReferenceDataFetcher

__init__(destination: Path | None = None, build: str = '38')[source]

Initialize LDRegions object. This initializer sets up the destination path for LD regions files and the genome build version. If no destination is provided, it defaults to a ‘data/ld_regions_files’ directory relative to the parent directory of the current file.

Parameters:

destination (Path, optional) – Path where LD region files will be stored. If None, uses default path.
built (str, optional) – Genome build version, defaults to ‘38’.

destination

Directory path where LD region files are stored

Type:: Path

built

Genome build version being used

Type:: str

ld_regions

Placeholder for LD regions data, initially set to None

Type:: None

get_ld_regions() → Path[source]

Download or create high LD regions file based on genome build version.

This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.

Returns:: Path to the created/downloaded LD regions file. Returns empty Path if download fails for build 37.
Return type:: Path
Raises:: None – Explicitly, but may raise standard I/O related exceptions.

Notes

For build 37: Downloads from genepi-freiburg/gwas repository
For build 38: Creates file from hardcoded coordinates from GWAS-pipeline
Files are named as ‘high-LD-regions_GRCh{build}.txt’
Creates destination directory if it doesn’t exist

class ideal_genom.core.get_references.Ensembl38Fetcher(destination_folder=None)[source]

Bases: ReferenceDataFetcher

A class for fetching human genome reference data from Ensembl based on GRCh38 build.

This class extends ReferenceDataFetcher to specifically handle Ensembl’s human genome data with build 38. It provides functionality to find and retrieve the latest GTF file from Ensembl’s FTP server.

base_url

Base URL for Ensembl FTP server where GTF files are stored

Type:: str

build

Genome build version (‘38’)

Type:: str

source

Data source (‘ensembl’)

Type:: str

destination_folder

Local folder to store downloaded files

Type:: str

latest_url

URL of the latest GTF file after calling get_latest_release()

Type:: str

Raises:

Exception – If the Ensembl FTP server cannot be accessed
FileNotFoundError – If no matching GTF file is found

__init__(destination_folder=None)[source]

get_latest_release() → None[source]

Retrieves the URL of the latest GTF file for human genome (GRCh38) from the base URL.

This method scrapes the base URL to find the most recent Homo_sapiens GRCh38 GTF file available for download. Upon finding the file, it constructs the complete URL and stores it in the instance variable latest_url.

Return type:

None

Raises:

Exception – If the base URL cannot be accessed (non-200 response)
FileNotFoundError – If no GTF file matching the criteria is found

class ideal_genom.core.get_references.Ensembl37Fetcher(destination_folder=None)[source]

Bases: ReferenceDataFetcher

A class for fetching reference genome data from Ensembl’s GRCh37 (hg19) repository.

This class specializes the ReferenceDataFetcher to work specifically with Ensembl’s GRCh37 human genome build. It provides functionality to automatically detect and download the latest available GTF file for Homo sapiens from the Ensembl GRCh37 archive. The fetcher connects to Ensembl’s FTP server, identifies the most recent release available for GRCh37, and locates the chromosome GTF file for human genome data.

base_url

The base URL for Ensembl’s GRCh37 repository

Type:: str

build

The genome build identifier (‘37’)

Type:: str

source

The data source identifier (‘ensembl’)

Type:: str

latest_url

The complete URL to the latest GTF file, populated after calling get_latest_release()

Type:: str

__init__(destination_folder=None)[source]

Initialize a reference genome downloader for Ensembl GRCh37. This constructor configures the downloader to retrieve data from Ensembl’s GRCh37 repository.

Parameters:: destination_folder (str, optional) – The folder where downloaded files will be stored. If None, a default location will be used based on the parent class implementation.

get_latest_release() → None[source]

Fetches the URL of the latest GTF file for Homo sapiens GRCh37 from Ensembl.

This method: 1. Connects to the base URL and identifies all available release folders 2. Determines the latest release by finding the highest release number 3. Navigates to the GTF directory for that release 4. Locates the Homo sapiens GRCh37 chromosome GTF file 5. Stores the complete download URL in self.latest_url

Raises:

Exception – If the base URL cannot be accessed
Exception – If no release folders are found
Exception – If the latest release folder cannot be accessed
FileNotFoundError – If the GTF file is not found in the latest release

Return type:

None

class ideal_genom.core.get_references.RefSeqFetcher(build: str, destination_folder: str | None = None)[source]

Bases: ReferenceDataFetcher

A class for fetching and downloading reference genome data from NCBI’s RefSeq repository.

This class extends ReferenceDataFetcher to specifically handle downloading human genome reference files from the RefSeq database. It supports different genome builds (e.g., ‘GRCh37’, ‘GRCh38’) and automatically identifies the latest version available for the specified build. The class handles navigating the NCBI FTP directory structure, finding the appropriate GTF files for the requested genome build, and managing the download process.

base_url

The base URL for the NCBI RefSeq FTP server directory.

Type:: str

build

The genome build version (‘37’ for GRCh37, ‘38’ for GRCh38).

Type:: str

source

The source of the reference data (set to ‘refseq’).

Type:: str

latest_url

URL to the latest GTF file, set after calling get_latest_release().

Type:: str

__init__(build: str, destination_folder: str | None = None)[source]

get_latest_release() → None[source]

Fetches the latest GTF file dynamically from the specified base URL.

This method sends a GET request to the base URL, parses the HTML response to find the latest GTF file link, and sets the latest_url attribute to the full URL of the latest GTF file.

Raises:: FileNotFoundError – If no GTF file is found in the HTML response.
Return type:: None

class ideal_genom.core.get_references.AssemblyReferenceFetcher(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra')[source]

Bases: ReferenceDataFetcher

A class for fetching and preparing genomic reference files from online repositories.

This class handles the process of: 1. Finding the appropriate reference file URL based on build parameters 2. Downloading the reference file 3. Unzipping compressed reference files if necessary

Parameters:

base_url (str) – The base URL where reference files are hosted
build (str) – The genome build identifier (e.g., ‘GRCh38’, ‘hg19’)
extension (str) – File extension to look for (e.g., ‘.gtf.gz’, ‘.fa.gz’)
destination_folder (Optional[str], default=None) – Path where files should be downloaded. If None, uses project_root/data/assembly_references
avoid_substring (str, default='extra') – Substring to avoid when selecting reference files

reference_url

URL of the identified reference file

Type:: str or None

reference_file

Filename of the identified reference file

Type:: str or None

file_path

Local path to the downloaded reference file

Type:: Path or None

Raises:

Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found
AttributeError – If methods are called out of sequence
ValueError – If required attributes are None when needed

__init__(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra') → None[source]

get_reference_url() → str[source]

Retrieves the URL for the reference file from the base URL.

This method performs an HTTP GET request to the base URL, parses the HTML content, and searches for links matching specific criteria: - Contains the build version string - Ends with the specified extension - Does not contain the specified substring to avoid The first matching link is considered the reference file.

Returns:

str

Return type:

The complete URL to the reference file

Raises:

Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found

Notes

Sets self.reference_file to the name of the found file
Sets self.reference_url to the complete URL
Logs information about the found file and URL

download_reference_file() → str[source]

Downloads a reference file from the specified URL to the destination folder.

This method first checks whether the reference file already exists locally. If not found, it also looks for an alternative version with a ‘.fa’ extension. If neither is present, it downloads the file from the given URL.

Raises:

AttributeError – If self.reference_url or self.reference_file are not set.
ValueError – If self.reference_url or self.reference_file are set to None.

Returns:

The path to the downloaded or existing reference file.

Return type:

str

Note

self.reference_url and self.reference_file must be set by calling get_reference_url() before using this method.

unzip_reference_file() → str[source]

Unzips a reference genome file (typically .fa.gz to .fa) and returns the path to the unzipped file.

This method checks if the file is already unzipped, and if not, unzips it using gzip. After successful unzipping, the original compressed file is deleted.

Returns:

Path to the unzipped reference file (.fa)

Return type:

str

Raises:

AttributeError – If self.reference_file is not set (get_reference_url should be called first)
AttributeError – If self.file_path is not set or None (download_reference_file should be called first)
OSError – If an error occurs during the unzipping process