Updating the config
files
1 Introduction
Running GPMelt pipeline (locally or on an HPC cluster) requires three config files
:
nextflow.config
should contain general configurations, valid for running the pipeline locally or on an HPC cluster.one of
nextflow_local.config
contains additional configuration to run the pipeline locally,nextflow_cluster.config
contains the configurations specific to the HPC cluster (typically directly provided by the computing center. Otherwise examples are provided on this page).
user_params.config
contains the set of input parameters for GPMelt workflow.
These files already exist in the Nextflow
folder, and simply need to be updated accordingly.
2 nextflow.config
The default nextflow.config
only contains the following lines:
1.overwrite = true
report
{
plugins 'nf-co2footprint@1.0.0-beta'
id 2}
{
co2footprint = "reports/co2footprint_summary.txt"
summaryFile = "reports/co2footprint_trace.txt"
traceFile = "reports/co2footprint_report.html"
reportFile 3}
- 1
-
This ensures that the workflow report (named
GPMelt_workflow_report.html
here) overwrites the previous one (can be set tofalse
if necessary). - 2
-
Add the
nf-co2footprint
plugin to estimate the CO2 footprint of pipeline runs (see the website for more info). - 3
-
Specify that the created co2footprint reports will be saved in the
reports
folder.
3 user_params.config
The user_params.config
has two parts:
- the
params
is common to both running the workflow locally and on an HPC cluster, - the
process
is specific to either running it locally or on an HPC cluster.
The user_params.config
looks like this:
{
params
//////////////////////////////////////
// Place the general parameters here
//////////////////////////////////////
}
{
process
/////////////////////////////////////////////////////////////////
// Place the resources requirements per process (defined on an HPC cluster or locally) here //
/////////////////////////////////////////////////////////////////
}
3.1 General parameters
// Should the dataset be subsetted to few IDs first ?
1= true
dataset_subsetting // how to parallelise IDs for GPMelt workflow
2= 4
number_IDs_per_jobs 3// Should all the output be saved or only the most important ones?
= 'True'
minimise_required_memory_space 4// Permanent directory where the outputs will be saved
= ""
permanent_results_directory
5= "$projectDir/dummy_data/ATP2019"
input_data_dir 6= "$params.input_data_dir/dataForGPMelt.csv"
dataset 7= "$params.input_data_dir/NumberSamples_perID.csv"
nb_sample 8= "$params.input_data_dir/parameters.txt"
parameter_file 9= "$params.input_data_dir/subset_ID.csv"
ids4subset_file
- 1
-
Can be set to
true
orfalse
(see Section 3.1.1). - 2
- Should be adapted accordingly (see Section 3.1.2).
- 3
-
If
true
is selected, additional outputs of the pipeline will be saved. This is up to the user to decide, depending on the amount of available memory and the interest of the user in these additional outputs. - 4
- Directory where final results should be saved (see Section 3.1.3).
- 5
-
Absolute path to the directory containing the data (see below variables
dataset
toids4subset_file
). - 6
-
Path, from the previously defined directory
input_data_dir
, to the dataset saved as.csv
file. - 7
-
Path, from the previously defined directory
input_data_dir
, to the.csv
file specifying for each ID in the dataset, how many samples should be drawn (allows for different number of samples per IDs, for example in case the IDs have different number of conditions). See here for how to define fileNumberSamples_perID.csv
. - 8
-
Path, from the previously defined directory
input_data_dir
, to theparameters.txt
file, containing additional parameters for GPMelt functions. See here for how to defined theparameters.txt
file. - 9
-
Path, from the previously defined directory
input_data_dir
, to the.csv
file containing the name of the IDs to use for subsetting. If not specified, no subsetting will be done. See here for how to define this file.
3.1.1 Why defining a subset of IDs?
We encourage the GPMelt user to define a subset of IDs to test:
- The model specification:
- are the replicates, groups, conditions and IDs variables well defined and correctly understood by the pipeline?
- is the number of levels in the HGP model adapted?
- should some constraints on the parameters (lengthscales, output-scales) be added or removed?
- The resources requirement (see the
GPMelt_workflow_report.html
generated by Nextflow):- how much memory do the small jobs need?
- how much memory do the most intensive job(s) need?
- how long does the pipeline take?
- is the number of CPUs appropriate?
- To answer these questions, we encourage the user to select a set of IDs with properties as different as possible (see here), for example:
- different number of conditions
- different number of total tasks (typically replicates x groups x conditions for a four-level HGP model or replicates x conditions for a three-level HGP model)
- different type of curves (sigmoidal and non-sigmoidal)
- different amount of noise.
See here for how to define and save this subset of Ids.
Important note: once the model specification and the fits are satisfying, do not forget to set dataset_subsetting
to false
to run GPMelt on the full dataset!
3.1.2 Parallelisation efficiency
The process RUN_GPMELT_IN_BATCH
runs sequentially the different steps of GPMelt on number_IDs_per_jobs
IDs.
This means that this number defines the parallelisation efficiency.
For example, if you have \(N=1000\) IDs in the dataset, and \(number\_IDs\_per\_jobs = 10\), then Nextflow will take care of parallelising \(\frac{N}{number\_IDs\_per\_jobs} = 100\) jobs, while each job will sequentially run \(10\) IDs.
In our tutorial case, we have sub-setted the Ids to \(N=16\) test IDs: by setting \(number\_IDs\_per\_jobs = 4\), we will have 4 jobs running in parallel.
3.1.3 Path to final results
In Nextflow, temporary results are saved in the work
directory, which typically has to be cleaned by the user at some point.
The localisation of the temporary directory is defined by the scratch
variable in the process
block (see Nextflow documentation).
If the permanent directory is not specified appropriately, the results of GPMelt pipeline might be lost.
3.1.4 The resource requirement
We propose to divide the resources requirement in two blocks: one for all processes, and one specific to RUN_GPMELT_IN_BATCH
which is the most ‘resource consuming’ part of the workflow.
For all processes:
- 1
- Number of CPUs to allocate for the process (see Nextflow documentation).
- 2
- Amount of memory to allocate for the process (see Nextflow documentation).
- 3
- Path to the docker container required for GPMelt.
- 4
- Maximal allowed time for a process to run (see Nextflow documentation).
With Nextflow, the resource requirements of each process can be specified individually. If specified both globally and specifically for a process, the specific resource requirements are used in place of the global resource requirements.
: RUN_GPMELT_IN_BATCH {
withName= 8.GB
memory = 8
cpus }
RUN_GPMELT_IN_BATCH
is the most ‘resource consuming’ part of the workflow. We exemplify here how the memory and CPUS requirement could be increased for this specific process. Important: these resource requirements should be adjusted to the needs of each specific dataset.
4 If running the pipeline on an HPC cluster
4.1 nextflow_cluster.config
This file contains the configurations specific to each HPC cluster. It is typically provided by the computing center. Otherwise examples are provided on this page)
4.2 process requirements of user_params.config
Below is an example of the process section when running the pipeline on a HPC cluster. The resource requirement have to be updated accordingly to the HPC cluster resources and to the dataset.
{
process = 2
cpus = 1.GB
memory = 'docker://ceccads/gpmelt-image-with-dask:0.0.1'
container = '24h'
time
: RUN_GPMELT_IN_BATCH {
withName= 8.GB
memory = 8
cpus }
}
5 If running the pipeline locally
5.1 nextflow_local.config
The only requirement of this file is to contain the following line. Additional lines are at the discretion of the user.
.enabled = true docker
5.2 process requirements of user_params.config
Below is an example of the process section when running the pipeline locally. The resource requirement have to be updated accordingly to the computer resources and to the dataset. Note that the CPU usage has not been updated for RUN_GPMELT_IN_BATCH
, to avoid local CPU starvation (too many CPUs allocated for the Nextflow pipeline could cause the rest of the system to slow down and/or become unresponsive).
{
process = 'local'
executor = 2
cpus = 1.GB
memory = 'docker://ceccads/gpmelt-image-with-dask:0.0.1'
container = '24h'
time
: RUN_GPMELT_IN_BATCH {
withName= 8.GB
memory }
}
We can now run the Nextflow pipeline !