Utilities
nplinker.genomics.utils
¶
generate_mappings_genome_id_bgc_id
¶
generate_mappings_genome_id_bgc_id(
bgc_dir: str | PathLike,
output_file: str | PathLike | None = None,
) -> None
Generate a file that maps genome id to BGC id.
The input bgc_dir
must follow the structure of the antismash
directory defined in
Working Directory Structure, e.g.:
bgc_dir
├── genome_id_1
│ ├── bgc_id_1.gbk
│ └── ...
├── genome_id_2
│ ├── bgc_id_2.gbk
│ └── ...
└── ...
Parameters:
-
bgc_dir
(str | PathLike
) –The directory has one-layer of subfolders and each subfolder contains BGC files in
.gbk
format.It assumes that
- the subfolder name is the genome id (e.g. refseq),
- the BGC file name is the BGC id.
-
output_file
(str | PathLike | None
, default:None
) –The path to the output file. The file will be overwritten if it already exists.
Defaults to None, in which case the output file will be placed in the directory
bgc_dir
with the file name GENOME_BGC_MAPPINGS_FILENAME.
Source code in src/nplinker/genomics/utils.py
add_strain_to_bgc
¶
Assign a Strain object to BGC.strain
for input BGCs.
BGC id is used to find the corresponding Strain object. It's possible that no Strain object is found for a BGC id.
Note
The input bgcs
will be changed in place.
Parameters:
-
strains
(StrainCollection
) –A collection of all strain objects.
-
bgcs
(Sequence[BGC]
) –A list of BGC objects.
Returns:
-
tuple[list[BGC], list[BGC]]
–A tuple of two lists of BGC objects,
- the first list contains BGC objects that are updated with Strain object;
- the second list contains BGC objects that are not updated with Strain object because no Strain object is found.
Raises:
-
ValueError
–Multiple strain objects found for a BGC id.
Source code in src/nplinker/genomics/utils.py
add_bgc_to_gcf
¶
add_bgc_to_gcf(
bgcs: Sequence[BGC], gcfs: Sequence[GCF]
) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]
Add BGC objects to GCF object based on GCF's BGC ids.
The attribute of GCF.bgc_ids
contains the ids of BGC objects. These ids
are used to find BGC objects from the input bgcs
list. The found BGC
objects are added to the bgcs
attribute of GCF object. It is possible that
some BGC ids are not found in the input bgcs
list, and so their BGC
objects are missing in the GCF object.
Note
This method changes the lists bgcs
and gcfs
in place.
Parameters:
Returns:
-
tuple[list[GCF], list[GCF], dict[GCF, set[str]]]
–A tuple of two lists and a dictionary,
- The first list contains GCF objects that are updated with BGC objects;
- The second list contains GCF objects that are not updated with BGC objects because no BGC objects are found;
- The dictionary contains GCF objects as keys and a set of ids of missing BGC objects as values.
Source code in src/nplinker/genomics/utils.py
get_mibig_from_gcf
¶
Get MIBiG BGCs and strains from GCF objects.
Parameters:
Returns:
-
tuple[list[BGC], StrainCollection]
–A tuple of two objects,
- the first is a list of MIBiG BGC objects used in the GCFs;
- the second is a StrainCollection object that contains all Strain objects used in the GCFs.
Source code in src/nplinker/genomics/utils.py
extract_mappings_strain_id_original_genome_id
¶
extract_mappings_strain_id_original_genome_id(
podp_project_json_file: str | PathLike,
) -> dict[str, set[str]]
Extract mappings "strain_id <-> original_genome_id".
Tip
The podp_project_json_file
is the JSON file downloaded from PODP platform.
For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
extract_mappings_original_genome_id_resolved_genome_id
¶
extract_mappings_original_genome_id_resolved_genome_id(
genome_status_json_file: str | PathLike,
) -> dict[str, str]
Extract mappings "original_genome_id <-> resolved_genome_id".
Tip
The genome_status_json_file
is generated by the podp_download_and_extract_antismash_data
function with a default file name GENOME_STATUS_FILENAME.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
extract_mappings_resolved_genome_id_bgc_id
¶
extract_mappings_resolved_genome_id_bgc_id(
genome_bgc_mappings_file: str | PathLike,
) -> dict[str, set[str]]
Extract mappings "resolved_genome_id <-> bgc_id".
Tip
The genome_bgc_mappings_file
is usually generated by the
generate_mappings_genome_id_bgc_id
function with a default file name GENOME_BGC_MAPPINGS_FILENAME.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
get_mappings_strain_id_bgc_id
¶
get_mappings_strain_id_bgc_id(
mappings_strain_id_original_genome_id: Mapping[
str, set[str]
],
mappings_original_genome_id_resolved_genome_id: Mapping[
str, str
],
mappings_resolved_genome_id_bgc_id: Mapping[
str, set[str]
],
) -> dict[str, set[str]]
Get mappings "strain_id <-> bgc_id".
Parameters:
-
mappings_strain_id_original_genome_id
(Mapping[str, set[str]]
) –Mappings "strain_id <-> original_genome_id".
-
mappings_original_genome_id_resolved_genome_id
(Mapping[str, str]
) –Mappings "original_genome_id <-> resolved_genome_id".
-
mappings_resolved_genome_id_bgc_id
(Mapping[str, set[str]]
) –Mappings "resolved_genome_id <-> bgc_id".
Returns:
See Also
extract_mappings_strain_id_original_genome_id
: Extract mappings "strain_id <-> original_genome_id".extract_mappings_original_genome_id_resolved_genome_id
: Extract mappings "original_genome_id <-> resolved_genome_id".extract_mappings_resolved_genome_id_bgc_id
: Extract mappings "resolved_genome_id <-> bgc_id".- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.