Dataset Arranger

nplinker.arranger ¶

PODP_PROJECT_URL `module-attribute` ¶

PODP_PROJECT_URL = "https://pairedomicsdata.bioinformatics.nl/api/projects/{}"

DatasetArranger ¶

DatasetArranger(config: Dynaconf)

Arrange datasets based on the fixed working directory structure with the given configuration.

Concept and Diagram

Working Directory Structure

Dataset Arranging Pipeline

"Arrange datasets" means:

For local mode (config.mode is local), the datasets provided by users are validated.
For podp mode (config.mode is podp), the datasets are automatically downloaded or generated, then validated.

The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE data.

Attributes:

config –

A Dynaconf object that contains the configuration settings.
root_dir –

The root directory of the datasets.
downloads_dir –

The directory to store downloaded files.
mibig_dir –

The directory to store MIBiG metadata.
gnps_dir –

The directory to store GNPS data.
antismash_dir –

The directory to store antiSMASH data.
bigscape_dir –

The directory to store BiG-SCAPE data.
bigscape_running_output_dir –

The directory to store the running output of BiG-SCAPE.

Parameters:

config (Dynaconf) –

A Dynaconf object that contains the configuration settings.

Examples:

>>> from nplinker.config import load_config
>>> from nplinker.arranger import DatasetArranger
>>> config = load_config("nplinker.toml")
>>> arranger = DatasetArranger(config)
>>> arranger.arrange()

config `instance-attribute` ¶

config = config

root_dir `instance-attribute` ¶

root_dir = root_dir

downloads_dir `instance-attribute` ¶

downloads_dir = root_dir / DOWNLOADS_DIRNAME

mibig_dir `instance-attribute` ¶

mibig_dir = root_dir / MIBIG_DIRNAME

gnps_dir `instance-attribute` ¶

gnps_dir = root_dir / GNPS_DIRNAME

antismash_dir `instance-attribute` ¶

antismash_dir = root_dir / ANTISMASH_DIRNAME

bigscape_dir `instance-attribute` ¶

bigscape_dir = root_dir / BIGSCAPE_DIRNAME

bigscape_running_output_dir `instance-attribute` ¶

bigscape_running_output_dir = (
    bigscape_dir / BIGSCAPE_RUNNING_OUTPUT_DIRNAME
)

arrange ¶

arrange() -> None

Arrange all datasets according to the configuration.

The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.

Source code in src/nplinker/arranger.py

def arrange(self) -> None:
    """Arrange all datasets according to the configuration.

    The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.
    """
    # The order of arranging the datasets matters, as some datasets depend on others
    self.arrange_mibig()
    self.arrange_gnps()
    self.arrange_antismash()
    self.arrange_bigscape()
    self.arrange_strain_mappings()
    self.arrange_strains_selected()

arrange_podp_project_json ¶

arrange_podp_project_json() -> None

Arrange the PODP project JSON file.

This method only works for the podp mode. If the JSON file does not exist, download it first; then the downloaded or existing JSON file will be validated according to the PODP_ADAPTED_SCHEMA.

Source code in src/nplinker/arranger.py

def arrange_podp_project_json(self) -> None:
    """Arrange the PODP project JSON file.

    This method only works for the `podp` mode. If the JSON file does not exist, download it
    first; then the downloaded or existing JSON file will be validated according to the
    [PODP_ADAPTED_SCHEMA][nplinker.schemas.PODP_ADAPTED_SCHEMA].
    """
    if self.config.mode == "podp":
        file_name = f"paired_datarecord_{self.config.podp_id}.json"
        podp_file = self.downloads_dir / file_name
        if not podp_file.exists():
            download_url(
                PODP_PROJECT_URL.format(self.config.podp_id),
                self.downloads_dir,
                file_name,
            )

        with open(podp_file, "r") as f:
            json_data = json.load(f)
        validate_podp_json(json_data)

arrange_mibig ¶

arrange_mibig() -> None

Arrange the MIBiG metadata.

If config.mibig.to_use is True, download and extract the MIBiG metadata and override the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always up-to-date to the specified version in the configuration.

Source code in src/nplinker/arranger.py

def arrange_mibig(self) -> None:
    """Arrange the MIBiG metadata.

    If `config.mibig.to_use` is `True`, download and extract the MIBiG metadata and override
    the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always
    up-to-date to the specified version in the configuration.
    """
    if self.config.mibig.to_use:
        if self.mibig_dir.exists():
            # remove existing mibig data
            shutil.rmtree(self.mibig_dir)
        download_and_extract_mibig_metadata(
            self.downloads_dir,
            self.mibig_dir,
            version=self.config.mibig.version,
        )

arrange_gnps ¶

arrange_gnps() -> None

Arrange the GNPS data.

For local mode, validate the GNPS data directory.

For podp mode, if the GNPS data does not exist, download it; if it exists but not valid, remove the data and re-downloads it.

The validation process includes:

Check if the GNPS data directory exists.
Check if the required files exist in the GNPS data directory, including:
- file_mappings.tsv or file_mappings.csv
- spectra.mgf
- molecular_families.tsv
- annotations.tsv

Source code in src/nplinker/arranger.py

def arrange_gnps(self) -> None:
    """Arrange the GNPS data.

    For `local` mode, validate the GNPS data directory.

    For `podp` mode, if the GNPS data does not exist, download it; if it exists but not valid,
    remove the data and re-downloads it.

    The validation process includes:

    - Check if the GNPS data directory exists.
    - Check if the required files exist in the GNPS data directory, including:
        - `file_mappings.tsv` or `file_mappings.csv`
        - `spectra.mgf`
        - `molecular_families.tsv`
        - `annotations.tsv`
    """
    pass_validation = False
    if self.config.mode == "podp":
        # retry downloading at most 3 times if downloaded data has problems
        for _ in range(3):
            try:
                validate_gnps(self.gnps_dir)
                pass_validation = True
                break
            except (FileNotFoundError, ValueError):
                # Don't need to remove downloaded archive, as it'll be overwritten
                shutil.rmtree(self.gnps_dir, ignore_errors=True)
                self._download_and_extract_gnps(self.config.gnps.version)

    if not pass_validation:
        validate_gnps(self.gnps_dir)

    # get the path to file_mappings file (csv or tsv)
    self.gnps_file_mappings_file = self._get_gnps_file_mappings_file()

arrange_antismash ¶

arrange_antismash() -> None

Arrange the antiSMASH data.

For local mode, validate the antiSMASH data.

For podp mode, if the antiSMASH data does not exist, download it; if it exists but not valid, remove the data and re-download it.

The validation process includes:

Check if the antiSMASH data directory exists.
Check if the antiSMASH data directory contains at least one sub-directory, and each sub-directory contains at least one BGC file (with the suffix .region???.gbk where ??? is a number).

AntiSMASH BGC directory must follow the structure below:

antismash
    ├── genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)
    │  ├── GCF_000514775.1.gbk
    │  ├── NZ_AZWO01000004.region001.gbk
    │  └── ...
    ├── genome_id_2
    │  ├── ...
    └── ...

Source code in src/nplinker/arranger.py

def arrange_antismash(self) -> None:
    """Arrange the antiSMASH data.

    For `local` mode, validate the antiSMASH data.

    For `podp` mode, if the antiSMASH data does not exist, download it; if it exists but not
    valid, remove the data and re-download it.

    The validation process includes:

    - Check if the antiSMASH data directory exists.
    - Check if the antiSMASH data directory contains at least one sub-directory, and each
        sub-directory contains at least one BGC file (with the suffix `.region???.gbk` where
        `???` is a number).

    AntiSMASH BGC directory must follow the structure below:
    ```
    antismash
        ├── genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)
        │  ├── GCF_000514775.1.gbk
        │  ├── NZ_AZWO01000004.region001.gbk
        │  └── ...
        ├── genome_id_2
        │  ├── ...
        └── ...
    ```
    """
    pass_validation = False
    if self.config.mode == "podp":
        for _ in range(3):
            try:
                validate_antismash(self.antismash_dir)
                pass_validation = True
                break
            except FileNotFoundError:
                shutil.rmtree(self.antismash_dir, ignore_errors=True)
                self._download_and_extract_antismash()

    if not pass_validation:
        validate_antismash(self.antismash_dir)

arrange_bigscape ¶

arrange_bigscape() -> None

Arrange the BiG-SCAPE data.

For local mode, if the BiG-SCAPE data is provided by users, validate it and raise an error if it is invalid. If the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the clustering file.

For podp mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate the data.

The running output of BiG-SCAPE will be saved to the directory bigscape_running_output in the default BiG-SCAPE directory, and the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv will be copied to the default BiG-SCAPE directory.

The validation process includes:

Check if the default BiG-SCAPE data directory exists.
Check if the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv exists in the BiG-SCAPE data directory.
Check if the data_sqlite.db file exists in the BiG-SCAPE data directory.

Source code in src/nplinker/arranger.py

def arrange_bigscape(self) -> None:
    """Arrange the BiG-SCAPE data.

    For `local` mode, if the BiG-SCAPE data is provided by users, validate it and raise an error
    if it is invalid. If the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the
    clustering file.

    For `podp` mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the
    clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate
    the data.

    The running output of BiG-SCAPE will be saved to the directory `bigscape_running_output`
    in the default BiG-SCAPE directory, and the clustering file
    `mix_clustering_c{self.config.bigscape.cutoff}.tsv` will be copied to the default BiG-SCAPE
    directory.

    The validation process includes:

    - Check if the default BiG-SCAPE data directory exists.
    - Check if the clustering file `mix_clustering_c{self.config.bigscape.cutoff}.tsv` exists in the
            BiG-SCAPE data directory.
    - Check if the `data_sqlite.db` file exists in the BiG-SCAPE data directory.
    """
    if self.config.mode == "local" and self.bigscape_dir.exists():
        validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)
        return

    pass_validation = False
    for _ in range(3):
        try:
            validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)
            pass_validation = True
            break
        except FileNotFoundError:
            shutil.rmtree(self.bigscape_dir, ignore_errors=True)
            self._run_bigscape()

    if not pass_validation:
        validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)

arrange_strain_mappings ¶

arrange_strain_mappings() -> None

Arrange the strain mappings file.

For local mode, validate the strain mappings file.

For podp mode, always generate the new strain mappings file and validate it.

The validation checks if the strain mappings file exists and if it is a valid JSON file according to STRAIN_MAPPINGS_SCHEMA.

Source code in src/nplinker/arranger.py

def arrange_strain_mappings(self) -> None:
    """Arrange the strain mappings file.

    For `local` mode, validate the strain mappings file.

    For `podp` mode, always generate the new strain mappings file and validate it.

    The validation checks if the strain mappings file exists and if it is a valid JSON file
    according to [STRAIN_MAPPINGS_SCHEMA][nplinker.schemas.STRAIN_MAPPINGS_SCHEMA].
    """
    if self.config.mode == "podp":
        self._generate_strain_mappings()

    self._validate_strain_mappings()

arrange_strains_selected ¶

arrange_strains_selected() -> None

Arrange the strains selected file.

If the file exists, validate it according to the schema defined in user_strains.json.

Source code in src/nplinker/arranger.py

def arrange_strains_selected(self) -> None:
    """Arrange the strains selected file.

    If the file exists, validate it according to the schema defined in `user_strains.json`.
    """
    strains_selected_file = self.root_dir / defaults.STRAINS_SELECTED_FILENAME
    if strains_selected_file.exists():
        with open(strains_selected_file, "r") as f:
            json_data = json.load(f)
        validate(instance=json_data, schema=USER_STRAINS_SCHEMA)

validate_gnps ¶

validate_gnps(gnps_dir: str | PathLike) -> None

Validate the GNPS data directory and its contents.

The GNPS data directory must contain the following files:

file_mappings.tsv or file_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv

Parameters:

gnps_dir (str | PathLike) –

Path to the GNPS data directory.

Raises:

FileNotFoundError –

If the GNPS data directory is not found or any of the required files is not found.
ValueError –

If both file_mappings.tsv and file_mapping.csv are found.

Source code in src/nplinker/arranger.py

def validate_gnps(gnps_dir: str | PathLike) -> None:
    """Validate the GNPS data directory and its contents.

    The GNPS data directory must contain the following files:

    - `file_mappings.tsv` or `file_mappings.csv`
    - `spectra.mgf`
    - `molecular_families.tsv`
    - `annotations.tsv`

    Args:
        gnps_dir: Path to the GNPS data directory.

    Raises:
        FileNotFoundError: If the GNPS data directory is not found or any of the required files
            is not found.
        ValueError: If both file_mappings.tsv and file_mapping.csv are found.
    """
    gnps_dir = Path(gnps_dir)
    if not gnps_dir.exists():
        raise FileNotFoundError(f"GNPS data directory not found at {gnps_dir}")

    file_mappings_tsv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_TSV
    file_mappings_csv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_CSV
    if file_mappings_tsv.exists() and file_mappings_csv.exists():
        raise ValueError(
            f"Both {file_mappings_tsv.name} and {file_mappings_csv.name} found in GNPS directory "
            f"{gnps_dir}, only one is allowed."
        )
    elif not file_mappings_tsv.exists() and not file_mappings_csv.exists():
        raise FileNotFoundError(
            f"Neither {file_mappings_tsv.name} nor {file_mappings_csv.name} found in GNPS directory"
            f" {gnps_dir}"
        )

    required_files = [
        gnps_dir / defaults.GNPS_SPECTRA_FILENAME,
        gnps_dir / defaults.GNPS_MOLECULAR_FAMILY_FILENAME,
        gnps_dir / defaults.GNPS_ANNOTATIONS_FILENAME,
    ]
    list_not_found = [f.name for f in required_files if not f.exists()]
    if list_not_found:
        raise FileNotFoundError(
            f"Files not found in GNPS directory {gnps_dir}: ', '.join({list_not_found})"
        )

validate_antismash ¶

validate_antismash(antismash_dir: str | PathLike) -> None

Validate the antiSMASH data directory and its contents.

The validation only checks the structure of the antiSMASH data directory and file names. It does not check

the content of the BGC files
the consistency between the antiSMASH data and the PODP project JSON file for the podp mode

The antiSMASH data directory must exist and contain at least one sub-directory. The name of the sub-directories must not contain any space. Each sub-directory must contain at least one BGC file (with the suffix .region???.gbk where ??? is the region number).

Parameters:

antismash_dir (str | PathLike) –

Path to the antiSMASH data directory.

Raises:

FileNotFoundError –

If the antiSMASH data directory is not found, or no sub-directories are found in the antiSMASH data directory, or no BGC files are found in any sub-directory.
ValueError –

If any sub-directory name contains a space.

Source code in src/nplinker/arranger.py

def validate_antismash(antismash_dir: str | PathLike) -> None:
    """Validate the antiSMASH data directory and its contents.

    The validation only checks the structure of the antiSMASH data directory and file names.
    It does not check

    - the content of the BGC files
    - the consistency between the antiSMASH data and the PODP project JSON file for the `podp` mode

    The antiSMASH data directory must exist and contain at least one sub-directory. The name of the
    sub-directories must not contain any space. Each sub-directory must contain at least one BGC
    file (with the suffix `.region???.gbk` where `???` is the region number).

    Args:
        antismash_dir: Path to the antiSMASH data directory.

    Raises:
        FileNotFoundError: If the antiSMASH data directory is not found, or no sub-directories
            are found in the antiSMASH data directory, or no BGC files are found in any
            sub-directory.
        ValueError: If any sub-directory name contains a space.
    """
    antismash_dir = Path(antismash_dir)
    if not antismash_dir.exists():
        raise FileNotFoundError(f"antiSMASH data directory not found at {antismash_dir}")

    sub_dirs = list_dirs(antismash_dir)
    if not sub_dirs:
        raise FileNotFoundError(
            "No BGC directories found in antiSMASH data directory {antismash_dir}"
        )

    for sub_dir in sub_dirs:
        dir_name = Path(sub_dir).name
        if " " in dir_name:
            raise ValueError(
                f"antiSMASH sub-directory name {dir_name} contains space, which is not allowed"
            )

        gbk_files = list_files(sub_dir, suffix=".gbk", keep_parent=False)
        bgc_files = fnmatch.filter(gbk_files, "*.region???.gbk")
        if not bgc_files:
            raise FileNotFoundError(f"No BGC files found in antiSMASH sub-directory {sub_dir}")

validate_bigscape ¶

validate_bigscape(
    bigscape_dir: str | PathLike, cutoff: str
) -> None

Validate the BiG-SCAPE data directory and its contents.

The BiG-SCAPE data directory must exist and contain the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv where {self.config.bigscape.cutoff} is the bigscape cutoff value set in the config file.

Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2. At the moment, all the family assignments in the database will be used, so this database should contain results from a single run with the desired cutoff.

Parameters:

bigscape_dir (str | PathLike) –

Path to the BiG-SCAPE data directory.
cutoff (str) –

The BiG-SCAPE cutoff value.

Raises:

FileNotFoundError –

If the BiG-SCAPE data directory or the clustering file is not found.

Source code in src/nplinker/arranger.py

def validate_bigscape(bigscape_dir: str | PathLike, cutoff: str) -> None:
    """Validate the BiG-SCAPE data directory and its contents.

    The BiG-SCAPE data directory must exist and contain the clustering file
    `mix_clustering_c{self.config.bigscape.cutoff}.tsv` where `{self.config.bigscape.cutoff}` is the
    bigscape cutoff value set in the config file.

    Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2.
    At the moment, all the family assignments in the database will be used, so this database should
    contain results from a single run with the desired cutoff.

    Args:
        bigscape_dir: Path to the BiG-SCAPE data directory.
        cutoff: The BiG-SCAPE cutoff value.

    Raises:
        FileNotFoundError: If the BiG-SCAPE data directory or the clustering file is not found.
    """
    bigscape_dir = Path(bigscape_dir)
    if not bigscape_dir.exists():
        raise FileNotFoundError(f"BiG-SCAPE data directory not found at {bigscape_dir}")

    clustering_file = bigscape_dir / f"mix_clustering_c{cutoff}.tsv"
    database_file = bigscape_dir / "data_sqlite.db"
    if not clustering_file.exists() and not database_file.exists():
        raise FileNotFoundError(f"BiG-SCAPE data not found in {clustering_file} or {database_file}")

Dataset Arranger

nplinker.arranger ¶

PODP_PROJECT_URL module-attribute ¶