Matching variables names to the GPMelt nomenclature
1 Introduction
Although originally developed to model TPP-TR datasets, GPMelt is broadly applicable to any time-series datasets with replicates and conditions. For this reason, the naming conventions used in GPMelt have been designed to be general.
In this step, we will learn how to redefine the variables of a dataset to match the GPMelt nomenclature.
2 From experiments to hierarchy
The GPMelt framework is based on the idea of translating the experimental design parameters of an experiment (e.g. a biological protocol) into a hierarchy. By experimental design parameters, we refer to information such as replicates, conditions, batches, and more.
To translate these experimental design parameters into a hierarchy, the user needs to identify each level of the hierarchy, with Level_1 being the top of the hierarchy and Level_L the bottom level, where \(L\) is the number of levels in the hierarchy (e.g. \(L=3\) for a three-level hierarchical Gaussian process (HGP) model, \(L=4\) for a four-level HGP model).
We refer the reader to this page for a detailed explanation.
2.1 One possible translation
We illustrate hereafter how to translate the experimental design parameters of the ATP 2019 into levels of the hierarchy.
It is important to understand that these parameters carry a-priori knowledge on expected similarities between the modeled curves.
For example, an experimental design with replicates & conditions for each ID can be translated in a three-level HGP model, translating the following assumptions:
The bottom level of the hierarchy (Level_3) corresponds to the individual replicates. They are the leaves of the hierarchy: observations measured through a replicate (i.e at different temperatures for TPP-TR or different time points for a time-series dataset) are expected to be correlated. By correlated, we mean that each observation is expected to carry some information to help predicting the surrounding observations.
Observations from replicates coming from the same condition are expected to be similar, and this similarity can be captured by measuring the correlations between these replicates. Hence the second level of the hierarchy (Level_2) corresponds to the conditions.
Finally, observations from replicates coming from any conditions but one ID have the potential to share more similarities than observations from a different ID. For this reason, the top level of the hierarchy (Level_1) corresponds to the ID.
Note: This is one possible translation of the data into a hierarchy. If you have different hypotheses about your data, the levels of the hierarchy should be adjusted accordingly.
3 Renaming the variables
Below is a summary of the content of the dataset for each protein_id. By content, we mean the number of observations per replicate, denoted NumberOfTemperatures:
Leveraging the principle of hierarchical models, we aim to improve the individual replicates fits by sharing information between replicates within a condition.
To translate this assumption in terms of hierarchy, we add a second level, which captures which replicate belongs to which condition.
This second level is directly given by the variable condition. Let’s rename this variable Level_2.
Finally, we hypothesis that observations from replicates across different conditions but within the same ID may share more similarities than those from different IDs.
Typically, if the treatment condition(s) have no effect, we would expect the melting curves across all conditions to be similar and therefore strongly correlated.
To translate this assumption in our model, we define the IDs as the top level of the hierarchy. This will allow information sharing between replicates across different conditions within each ID.
Let’s rename the variable gene_name by Level_1:
Data_forPython <- Data_forPython %>% dplyr::rename("Level_1"="gene_name") %>%mutate(Level_1 =gsub("/|\\._" , "-", Level_1, fixed=TRUE)) #names with slash can be annoyinghead(Data_forPython)
An additional variable condition can be added to the dataset. For this variable, only the control condition will have the exact same value for both condition and Level_2.
This condition variable is especially useful when more than one treatment condition is measured. In this case, the condition variable differentiates between control and treatment, while the Level_2 variable specifies the different treatment conditions.
Note that we use the variable name y without specifying the scaling method used to scale the intensities: this makes GPMelt applicable with any scaling method (see scaling methods explained here)
3.6 Variables ordering
For ease of readability, we can reorder the variables:
To ensure that the Python code correctly handles the data, the input data should be properly sorted. The dataset will be sorted using the following variables in this specific order (this step is handled directly in the Nextflow pipeline, no need to do it manually):
Level_1
condition
Level_2
Level_3
rep (replicate)
x (temperature)
4 Check the data
Before running GPMelt, we strongly encourage you to check how the preprocessed data look like.
Here is an example R code to do so. See here, section 5 for more examples of R codes.
In the previous code, we selected a set of proteins which can be used to further test different model specifications.
We selected IDs with different melting curve patterns (sigmoidal and non-sigmoidal), and different levels of noise (e.g. BANF1 present an outlier replicate, the vehicle condition of MYO1G and NDUFB6 are also noisy, while other IDs like IMPDH1 or AAR2 have well-reproducible replicates). Comparing the quality of the fits for these IDs will help us deciding how to update the model specification if needed (see here).
If you have no idea of the IDs you want to use to test the model specification, you can randomly pick names of IDs in the dataset:
set.seed(837456) # for reproducibilityRandomSample <-sample(unique(Data_forPython$Level_1), size=10) # randomly select 10 names from the Level_1
5 Save the data
We then save the data in the folder Nextflow/dummy_data/ATP2019, using the name dataForGPMelt.csv: