1 Requirements and installation

1.1 Requirements

This pipeline is written using nextflow DSL2. You need to have Nextflow >= 23.04.4 available on your system. Please refer to corresponding Nextflow documentation Nextflow requires Java to be available. This pipeline has been tested using Java v 17.0.4

While nextflow will download/make software used in this pipeline available, you sill need to download some external databases and provide their path to use this pipeline.

1.2 Installation

1.3 Basic installation

To be able to run, [Nextflow](https://www.nextflow.io/) v > XX

must be installed.

We use containers to run each software. A container system (Docker, Singularity or Apptainer must be available our your system).

Notes:

We use Apptainer on our HPC, which is then what the pipeline has been tested with.

Running the pipeline with Conda is currently not available

1.3.1 Obtain necessary external databases

This pipeline make use of several external databases, you will have to modify the path of those in the nextflow.configfile, so it corresponds to your system (OR you can provide those in the running command, but that is certainly more combersome).

We unfortunately cannot provide image nor create those databases as ready to use for the pipeline, as they would require too much space or need to be adapted to the organism you are working with.

Moreover, internet access might not be accessible at run time on HPC clusters, so downloading those databases is the only option.

1.3.1.1 Blastn database (FILTER_CONTIGS track / option in COMPASS track)

A local blastn database must have been pre-downloaded before using this track (no internet query during compute). For more information, please see NCBI here for download. Please provide the path of blast database to the parameter blastDB in nextflow.config.

1.3.1.2 NCBI taxonomic rank file (FILTER_CONTIGS track / option in COMPASS track)

The taxonomic rank file rankedlineage.dmp from NCBI new taxonomy is used to link the taxon_id (number) to the scientific name.

This file must be downloaded and its path must be provided as parameter ranked_taxo_file in nextflow.config at installation.

1.3.1.3 KRAKEN2 (COMPASS track)

If you are working with organism that is not (or very poorly) represented in the database, you will get taxonomic classification that is attributed to other taxonomic groups.

If you decide to download an already made database, please ensure that the organism you are interested in is included in the database. Note that Kraken mini database is likely to be too restrictive and this will lead to spurious results.

For better classification results, we recomend you build your own database, ensuring that the organism of interst is included in the database.

Please have a look at Kraken2 documentation for database creating.

Please provide the path of kraken2 database to the parameter krakenDB in nextflow.config.

1.3.1.4 Busco (COMPASS track)

Please refer to BUSCO documentation for downloading the lineage markers sets that you will want to work with, and provide the path of the busco_downloads directory (that is created at download by BUSCO) to the parameter busco_download_path in nextflow.config.

1.3.2 Configuration on your system

Clone the diplotopia repository. Create a configuration file to run on your system, using a template in ./config

1.3.3 Configuration of your bashrc for nextflow

As per today, this pipeline will only run with containers. You will need some configuration of paths nextflow will use to cache and store those containers, according to your system. This is related to nextflow usage, and not to diplotopia specifically.

Please have a look at nextflow documentation to configure containers.

NB: SAGA users, please look HERE, to configure your system (.bashrc) ready to use nexflow