Dataset Arranger
nplinker.arranger
¶
PODP_PROJECT_URL
module-attribute
¶
DatasetArranger
¶
Arrange datasets based on the fixed working directory structure with the given configuration.
"Arrange datasets" means:
- For
local
mode (config.mode
islocal
), the datasets provided by users are validated. - For
podp
mode (config.mode
ispodp
), the datasets are automatically downloaded or generated, then validated.
The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE data.
Attributes:
-
config
–A Dynaconf object that contains the configuration settings.
-
root_dir
–The root directory of the datasets.
-
downloads_dir
–The directory to store downloaded files.
-
mibig_dir
–The directory to store MIBiG metadata.
-
gnps_dir
–The directory to store GNPS data.
-
antismash_dir
–The directory to store antiSMASH data.
-
bigscape_dir
–The directory to store BiG-SCAPE data.
-
bigscape_running_output_dir
–The directory to store the running output of BiG-SCAPE.
Parameters:
-
config
(Dynaconf
) –A Dynaconf object that contains the configuration settings.
Examples:
>>> from nplinker.config import load_config
>>> from nplinker.arranger import DatasetArranger
>>> config = load_config("nplinker.toml")
>>> arranger = DatasetArranger(config)
>>> arranger.arrange()
See Also
DatasetLoader: Load all data from files to memory.
Source code in src/nplinker/arranger.py
bigscape_running_output_dir
instance-attribute
¶
bigscape_running_output_dir = (
bigscape_dir / BIGSCAPE_RUNNING_OUTPUT_DIRNAME
)
arrange
¶
Arrange all datasets according to the configuration.
The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.
Source code in src/nplinker/arranger.py
arrange_podp_project_json
¶
Arrange the PODP project JSON file.
This method only works for the podp
mode. If the JSON file does not exist, download it
first; then the downloaded or existing JSON file will be validated according to the
PODP_ADAPTED_SCHEMA.
Source code in src/nplinker/arranger.py
arrange_mibig
¶
Arrange the MIBiG metadata.
If config.mibig.to_use
is True
, download and extract the MIBiG metadata and override
the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always
up-to-date to the specified version in the configuration.
Source code in src/nplinker/arranger.py
arrange_gnps
¶
Arrange the GNPS data.
For local
mode, validate the GNPS data directory.
For podp
mode, if the GNPS data does not exist, download it; if it exists but not valid,
remove the data and re-downloads it.
The validation process includes:
- Check if the GNPS data directory exists.
- Check if the required files exist in the GNPS data directory, including:
file_mappings.tsv
orfile_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
Source code in src/nplinker/arranger.py
arrange_antismash
¶
Arrange the antiSMASH data.
For local
mode, validate the antiSMASH data.
For podp
mode, if the antiSMASH data does not exist, download it; if it exists but not
valid, remove the data and re-download it.
The validation process includes:
- Check if the antiSMASH data directory exists.
- Check if the antiSMASH data directory contains at least one sub-directory, and each
sub-directory contains at least one BGC file (with the suffix
.region???.gbk
where???
is a number).
AntiSMASH BGC directory must follow the structure below:
antismash
├── genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)
│ ├── GCF_000514775.1.gbk
│ ├── NZ_AZWO01000004.region001.gbk
│ └── ...
├── genome_id_2
│ ├── ...
└── ...
Source code in src/nplinker/arranger.py
arrange_bigscape
¶
Arrange the BiG-SCAPE data.
For local
mode, if the BiG-SCAPE data is provided by users, validate it and raise an error
if it is invalid. If the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the
clustering file.
For podp
mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the
clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate
the data.
The running output of BiG-SCAPE will be saved to the directory bigscape_running_output
in the default BiG-SCAPE directory, and the clustering file
mix_clustering_c{self.config.bigscape.cutoff}.tsv
will be copied to the default BiG-SCAPE
directory.
The validation process includes:
- Check if the default BiG-SCAPE data directory exists.
- Check if the clustering file
mix_clustering_c{self.config.bigscape.cutoff}.tsv
exists in the BiG-SCAPE data directory. - Check if the
data_sqlite.db
file exists in the BiG-SCAPE data directory.
Source code in src/nplinker/arranger.py
arrange_strain_mappings
¶
Arrange the strain mappings file.
For local
mode, validate the strain mappings file.
For podp
mode, always generate the new strain mappings file and validate it.
The validation checks if the strain mappings file exists and if it is a valid JSON file according to STRAIN_MAPPINGS_SCHEMA.
Source code in src/nplinker/arranger.py
arrange_strains_selected
¶
Arrange the strains selected file.
If the file exists, validate it according to the schema defined in user_strains.json
.
Source code in src/nplinker/arranger.py
validate_gnps
¶
Validate the GNPS data directory and its contents.
The GNPS data directory must contain the following files:
file_mappings.tsv
orfile_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
Parameters:
Raises:
-
FileNotFoundError
–If the GNPS data directory is not found or any of the required files is not found.
-
ValueError
–If both file_mappings.tsv and file_mapping.csv are found.
Source code in src/nplinker/arranger.py
validate_antismash
¶
Validate the antiSMASH data directory and its contents.
The validation only checks the structure of the antiSMASH data directory and file names. It does not check
- the content of the BGC files
- the consistency between the antiSMASH data and the PODP project JSON file for the
podp
mode
The antiSMASH data directory must exist and contain at least one sub-directory. The name of the
sub-directories must not contain any space. Each sub-directory must contain at least one BGC
file (with the suffix .region???.gbk
where ???
is the region number).
Parameters:
Raises:
-
FileNotFoundError
–If the antiSMASH data directory is not found, or no sub-directories are found in the antiSMASH data directory, or no BGC files are found in any sub-directory.
-
ValueError
–If any sub-directory name contains a space.
Source code in src/nplinker/arranger.py
validate_bigscape
¶
Validate the BiG-SCAPE data directory and its contents.
The BiG-SCAPE data directory must exist and contain the clustering file
mix_clustering_c{self.config.bigscape.cutoff}.tsv
where {self.config.bigscape.cutoff}
is the
bigscape cutoff value set in the config file.
Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2. At the moment, all the family assignments in the database will be used, so this database should contain results from a single run with the desired cutoff.
Parameters:
-
bigscape_dir
(str | PathLike
) –Path to the BiG-SCAPE data directory.
-
cutoff
(str
) –The BiG-SCAPE cutoff value.
Raises:
-
FileNotFoundError
–If the BiG-SCAPE data directory or the clustering file is not found.