Updating the config files

1 Introduction

Running GPMelt pipeline (locally or on an HPC cluster) requires three config files :

  1. nextflow.config should contain general configurations, valid for running the pipeline locally or on an HPC cluster.

  2. one of

  • nextflow_local.config contains additional configuration to run the pipeline locally,
  • nextflow_cluster.config contains the configurations specific to the HPC cluster (typically directly provided by the computing center. Otherwise examples are provided on this page).
  1. user_params.config contains the set of input parameters for GPMelt workflow.

These files already exist in the Nextflow folder, and simply need to be updated accordingly.

2 nextflow.config

The default nextflow.config only contains the following lines:


1report.overwrite = true

plugins {
  id 'nf-co2footprint@1.0.0-beta'
2}

co2footprint {
    summaryFile  = "reports/co2footprint_summary.txt"
    traceFile   = "reports/co2footprint_trace.txt"
    reportFile  = "reports/co2footprint_report.html"
3}
1
This ensures that the workflow report (named GPMelt_workflow_report.html here) overwrites the previous one (can be set to false if necessary).
2
Add the nf-co2footprint plugin to estimate the CO2 footprint of pipeline runs (see the website for more info).
3
Specify that the created co2footprint reports will be saved in the reports folder.

3 user_params.config

The user_params.config has two parts:

  • the params is common to both running the workflow locally and on an HPC cluster,
  • the process is specific to either running it locally or on an HPC cluster.

The user_params.config looks like this:


params{

//////////////////////////////////////
// Place the general parameters here
//////////////////////////////////////

}


process {

/////////////////////////////////////////////////////////////////
// Place the resources requirements per process (defined on an HPC cluster or locally) here //
/////////////////////////////////////////////////////////////////

}

3.1 General parameters


// Should the dataset be subsetted to few IDs first ? 
1dataset_subsetting = true
// how to parallelise IDs for GPMelt workflow 
2number_IDs_per_jobs = 4
3// Should all the output be saved or only the most important ones?
minimise_required_memory_space = 'True'
4// Permanent directory where the outputs will be saved
permanent_results_directory = ""

5input_data_dir = "$projectDir/dummy_data/ATP2019"
6dataset = "$params.input_data_dir/dataForGPMelt.csv"
7nb_sample = "$params.input_data_dir/NumberSamples_perID.csv"
8parameter_file = "$params.input_data_dir/parameters.txt"
9ids4subset_file = "$params.input_data_dir/subset_ID.csv"
    
1
Can be set to trueor false (see Section 3.1.1).
2
Should be adapted accordingly (see Section 3.1.2).
3
If true is selected, additional outputs of the pipeline will be saved. This is up to the user to decide, depending on the amount of available memory and the interest of the user in these additional outputs.
4
Directory where final results should be saved (see Section 3.1.3).
5
Absolute path to the directory containing the data (see below variables dataset to ids4subset_file).
6
Path, from the previously defined directory input_data_dir, to the dataset saved as .csv file.
7
Path, from the previously defined directory input_data_dir, to the .csv file specifying for each ID in the dataset, how many samples should be drawn (allows for different number of samples per IDs, for example in case the IDs have different number of conditions). See here for how to define file NumberSamples_perID.csv.
8
Path, from the previously defined directory input_data_dir, to the parameters.txt file, containing additional parameters for GPMelt functions. See here for how to defined the parameters.txt file.
9
Path, from the previously defined directory input_data_dir, to the .csv file containing the name of the IDs to use for subsetting. If not specified, no subsetting will be done. See here for how to define this file.

3.1.1 Why defining a subset of IDs?

We encourage the GPMelt user to define a subset of IDs to test:

  • The model specification:
    • are the replicates, groups, conditions and IDs variables well defined and correctly understood by the pipeline?
    • is the number of levels in the HGP model adapted?
    • should some constraints on the parameters (lengthscales, output-scales) be added or removed?
  • The resources requirement (see the GPMelt_workflow_report.html generated by Nextflow):
    • how much memory do the small jobs need?
    • how much memory do the most intensive job(s) need?
    • how long does the pipeline take?
    • is the number of CPUs appropriate?
  • To answer these questions, we encourage the user to select a set of IDs with properties as different as possible (see here), for example:
    • different number of conditions
    • different number of total tasks (typically replicates x groups x conditions for a four-level HGP model or replicates x conditions for a three-level HGP model)
    • different type of curves (sigmoidal and non-sigmoidal)
    • different amount of noise.

See here for how to define and save this subset of Ids.

Important note: once the model specification and the fits are satisfying, do not forget to set dataset_subsetting to false to run GPMelt on the full dataset!

3.1.2 Parallelisation efficiency

The process RUN_GPMELT_IN_BATCH runs sequentially the different steps of GPMelt on number_IDs_per_jobs IDs.

This means that this number defines the parallelisation efficiency.

For example, if you have \(N=1000\) IDs in the dataset, and \(number\_IDs\_per\_jobs = 10\), then Nextflow will take care of parallelising \(\frac{N}{number\_IDs\_per\_jobs} = 100\) jobs, while each job will sequentially run \(10\) IDs.

In our tutorial case, we have sub-setted the Ids to \(N=16\) test IDs: by setting \(number\_IDs\_per\_jobs = 4\), we will have 4 jobs running in parallel.

3.1.3 Path to final results

In Nextflow, temporary results are saved in the work directory, which typically has to be cleaned by the user at some point.

The localisation of the temporary directory is defined by the scratch variable in the process block (see Nextflow documentation).

If the permanent directory is not specified appropriately, the results of GPMelt pipeline might be lost.

3.1.4 The resource requirement

We propose to divide the resources requirement in two blocks: one for all processes, and one specific to RUN_GPMELT_IN_BATCH which is the most ‘resource consuming’ part of the workflow.

For all processes:


1cpus = 2
2memory = 1.GB
3container = 'docker://ceccads/gpmelt-image-with-dask:0.0.1'
4time = '24h'
1
Number of CPUs to allocate for the process (see Nextflow documentation).
2
Amount of memory to allocate for the process (see Nextflow documentation).
3
Path to the docker container required for GPMelt.
4
Maximal allowed time for a process to run (see Nextflow documentation).

With Nextflow, the resource requirements of each process can be specified individually. If specified both globally and specifically for a process, the specific resource requirements are used in place of the global resource requirements.


withName: RUN_GPMELT_IN_BATCH {
       memory = 8.GB
       cpus = 8
    }
    
  1. RUN_GPMELT_IN_BATCH is the most ‘resource consuming’ part of the workflow. We exemplify here how the memory and CPUS requirement could be increased for this specific process. Important: these resource requirements should be adjusted to the needs of each specific dataset.

4 If running the pipeline on an HPC cluster

4.1 nextflow_cluster.config

This file contains the configurations specific to each HPC cluster. It is typically provided by the computing center. Otherwise examples are provided on this page)

4.2 process requirements of user_params.config

Below is an example of the process section when running the pipeline on a HPC cluster. The resource requirement have to be updated accordingly to the HPC cluster resources and to the dataset.


process {
    cpus = 2
    memory = 1.GB
    container = 'docker://ceccads/gpmelt-image-with-dask:0.0.1'
    time = '24h'

    withName: RUN_GPMELT_IN_BATCH {
       memory = 8.GB
       cpus = 8
    } 

}

5 If running the pipeline locally

5.1 nextflow_local.config

The only requirement of this file is to contain the following line. Additional lines are at the discretion of the user.

docker.enabled = true

5.2 process requirements of user_params.config

Below is an example of the process section when running the pipeline locally. The resource requirement have to be updated accordingly to the computer resources and to the dataset. Note that the CPU usage has not been updated for RUN_GPMELT_IN_BATCH, to avoid local CPU starvation (too many CPUs allocated for the Nextflow pipeline could cause the rest of the system to slow down and/or become unresponsive).


process {
    executor = 'local'
    cpus = 2
    memory = 1.GB
    container = 'docker://ceccads/gpmelt-image-with-dask:0.0.1'
    time = '24h'

    withName: RUN_GPMELT_IN_BATCH {
       memory = 8.GB
    }

}

We can now run the Nextflow pipeline !