Choosing the null distribution estimation


The null distribution is discussed on this page, please have a look to understand what we do in this step!

1 Introduction

With the current GPMelt implementation, we need to define a file called NumberSamples_perID.csv.

This file contains one row per value of Level_1, and the following two variables: Level_1 and NSamples.

We explain in the following how to define it in the case of the different null distribution approximation methods.

Note: This file can also be used to define a custom null distribution approximation method, in which the user can define for each ID in Level_1 the desired number of samples per condition for this ID. We only recommend this custom null distribution approximation method for advanced users who know why they are doing it.

2 A general explanation on how to create NumberSamples_perID.csv using R

For the explanation, we assume here that Level_1 corresponds to the IDs/proteins \(p\) observed in the dataset.

NSamples thus corresponds to the variable \(\widetilde{S_p}\), (as explained in Section 2.1 and in the GPMelt paper (Le Sueur, Rattray, and Savitski 2024), especially in Supporting Information B and Table E in S1 file.)

2.1 Notation

Again, check this page of the website to understand the principle of the null distribution approximation if not done already!

We consider the case where the IDs would present different numbers of observed conditions. We aim to generate a final size \(S\) of the null distribution approximation, with \(S=1e4\).

S=1e4

Using the notation used in the GPMelt paper, and denoting by \(S\) the total number of samples used to approximate the null distribution, we consider an ID \(p\) and define:

  • \(C_p\) the number of conditions for this ID \(p\),
  • \(\widetilde{S_p}\) the number of samples to be drawn for each condition of this ID \(p\),
  • \(S_p\) the total number of samples to be drawn for this ID \(p\).

These four quantities are linked as follows:

\[ S = \sum_p S_p \quad \text{with} \quad \widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil \ . \] with:

  • \(S = \sum_p S_p\): The final size is obtained by summing the results obtained for each protein.
  • \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\): The number of values per condition \(\widetilde{S_p}\) is obtained by dividing the total number of values for a protein \(S_p\) by the total number of conditions for this protein \(C_p\).

2.2 Creation of a dummy dataset

We start by creating a dummy dataset of 100 IDs presenting different numbers of conditions.

Code
library(dplyr)
library(ggplot2)
source("../../utils.R")
set.seed(123)

ExampleDataset<- data.frame(
  Level_1 = paste0("ID_", c(1:100)),
  NumberOfObservedConditions = sample(x= c(1:20), replace =TRUE, size=100)
  )

head(ExampleDataset)
  Level_1 NumberOfObservedConditions
1    ID_1                         15
2    ID_2                         19
3    ID_3                         14
4    ID_4                          3
5    ID_5                         10
6    ID_6                         18

2.3 The dataset-wise approximation (method C, Table E in S1 file)

It consists in computing the same number \(\widetilde{S_p}\) of values for each comparison, and combining all values across proteins to form a unique null distribution.

In the present case, we have:

\[ S_p = \left\lceil C_p \cdot \frac{S}{\sum_{p'} C_{p'}} \right\rceil \ .\]

Let’s first compute \(\sum_{p'} C_{p'}\), the total number of conditions over all IDs of the dataset.

Total_N_cond <- sum(ExampleDataset$NumberOfObservedConditions)

We now compute the number of samples per condition \(\widetilde{S_p} = \frac{S}{\sum_{p'} C_{p'}}\) (done row-wise) and compute the associated value of \(S_p\) (ToTal_NSamples_perID):

NumberSamples_perID <- ExampleDataset %>%
  rowwise() %>%
  mutate(NSamples = ceiling(S / Total_N_cond ) ) %>%
  mutate(ToTal_NSamples_perID =  NSamples *  NumberOfObservedConditions)
  

NumberSamples_perID
# A tibble: 100 × 4
# Rowwise: 
   Level_1 NumberOfObservedConditions NSamples ToTal_NSamples_perID
   <chr>                        <int>    <dbl>                <dbl>
 1 ID_1                            15       10                  150
 2 ID_2                            19       10                  190
 3 ID_3                            14       10                  140
 4 ID_4                             3       10                   30
 5 ID_5                            10       10                  100
 6 ID_6                            18       10                  180
 7 ID_7                            11       10                  110
 8 ID_8                             5       10                   50
 9 ID_9                            20       10                  200
10 ID_10                           14       10                  140
# ℹ 90 more rows

The total number of samples should be at least \(1e4\):

sum(NumberSamples_perID$ToTal_NSamples_perID)
[1] 11030

2.4 The ID-wise approximation (method D, Table E in S1 file)

In this case \(S_p \equiv S = 1e4\) and \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\)

NumberSamples_perID <- ExampleDataset %>%
  rowwise() %>%
  mutate(NSamples = ceiling(S / NumberOfObservedConditions)) %>%
  mutate(ToTal_NSamples_perID =  NSamples *  NumberOfObservedConditions)
  

NumberSamples_perID
# A tibble: 100 × 4
# Rowwise: 
   Level_1 NumberOfObservedConditions NSamples ToTal_NSamples_perID
   <chr>                        <int>    <dbl>                <dbl>
 1 ID_1                            15      667                10005
 2 ID_2                            19      527                10013
 3 ID_3                            14      715                10010
 4 ID_4                             3     3334                10002
 5 ID_5                            10     1000                10000
 6 ID_6                            18      556                10008
 7 ID_7                            11      910                10010
 8 ID_8                             5     2000                10000
 9 ID_9                            20      500                10000
10 ID_10                           14      715                10010
# ℹ 90 more rows

The total number of samples should be at least \(1e4\) for each ID:

which(NumberSamples_perID$ToTal_NSamples_perID < 1e4)
integer(0)

2.5 The group-wise approximation (method E, Table E in S1 file)

We propose to group IDs according to the number of observed conditions, thus reducing the number of samples to draw per ID.

This method offers a compromise between the ID-wise and dataset-wise approximation. We decided here to group proteins by number of observed conditions because this maximises the similarity of IDs within a group in terms of number of measurements .

We group IDs by the observed numbers of conditions, and measure the size of the groups.

NumberSamples_perID <- ExampleDataset %>%
  group_by(NumberOfObservedConditions) %>%
  mutate(Ng = n_distinct(Level_1))

1NumberSamples_perID
1
\(N_g\) refers to the size of the group presenting NumberOfObservedConditions. For example, below, \(7\) IDs present \(15\) observed conditions
# A tibble: 100 × 3
# Groups:   NumberOfObservedConditions [20]
   Level_1 NumberOfObservedConditions    Ng
   <chr>                        <int> <int>
 1 ID_1                            15     7
 2 ID_2                            19     6
 3 ID_3                            14    10
 4 ID_4                             3     6
 5 ID_5                            10     8
 6 ID_6                            18     4
 7 ID_7                            11     4
 8 ID_8                             5     6
 9 ID_9                            20     4
10 ID_10                           14    10
# ℹ 90 more rows

We have \[S_p = \left\lceil C_p \cdot \frac{S}{\sum_{p' \in g} C_{p'}} \right\rceil \ , \] with \(g\) the group of the protein \(p\).

Noticing that \(C_{p'} \equiv C_p\) for all proteins \(p'\) in group \(g\), we get that \[ S_p = \left\lceil C_p \cdot \frac{S}{N_g \cdot C_{p}} \right\rceil = \left\lceil \frac{S}{N_g} \right\rceil \ .\]

As usual \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\).

We first calculate \(\sum_{p' \in G_g} C_{p'}\), the total number of conditions in group \(g\) considering all IDs in group \(g\), and we denote it by Total_N_cond_group_g:

NumberSamples_perID <- NumberSamples_perID %>%
  group_by(NumberOfObservedConditions) %>%
  rowwise() %>%
  mutate(Total_N_cond_group_g = NumberOfObservedConditions * Ng )%>%
  ungroup() 

NumberSamples_perID
# A tibble: 100 × 4
   Level_1 NumberOfObservedConditions    Ng Total_N_cond_group_g
   <chr>                        <int> <int>                <int>
 1 ID_1                            15     7                  105
 2 ID_2                            19     6                  114
 3 ID_3                            14    10                  140
 4 ID_4                             3     6                   18
 5 ID_5                            10     8                   80
 6 ID_6                            18     4                   72
 7 ID_7                            11     4                   44
 8 ID_8                             5     6                   30
 9 ID_9                            20     4                   80
10 ID_10                           14    10                  140
# ℹ 90 more rows

We then compute \(\widetilde{S_p}\) (NSamples):

NumberSamples_perID <- NumberSamples_perID %>%
  rowwise() %>%
  mutate(NSamples = ceiling(S / Total_N_cond_group_g)) %>%
  mutate(ToTal_NSamples_perID =  NSamples *  NumberOfObservedConditions)

NumberSamples_perID
# A tibble: 100 × 6
# Rowwise: 
   Level_1 NumberOfObservedConditions    Ng Total_N_cond_group_g NSamples
   <chr>                        <int> <int>                <int>    <dbl>
 1 ID_1                            15     7                  105       96
 2 ID_2                            19     6                  114       88
 3 ID_3                            14    10                  140       72
 4 ID_4                             3     6                   18      556
 5 ID_5                            10     8                   80      125
 6 ID_6                            18     4                   72      139
 7 ID_7                            11     4                   44      228
 8 ID_8                             5     6                   30      334
 9 ID_9                            20     4                   80      125
10 ID_10                           14    10                  140       72
# ℹ 90 more rows
# ℹ 1 more variable: ToTal_NSamples_perID <dbl>

The total number of samples should be at least \(1e4\) per group:

NumberSamples_perID %>% 
  group_by( NumberOfObservedConditions ) %>% 
  summarise(ToTal_NSamples = sum(ToTal_NSamples_perID))
# A tibble: 20 × 2
   NumberOfObservedConditions ToTal_NSamples
                        <int>          <dbl>
 1                          1          10000
 2                          2          10002
 3                          3          10008
 4                          4          10000
 5                          5          10020
 6                          6          10008
 7                          7          10024
 8                          8          10032
 9                          9          10035
10                         10          10000
11                         11          10032
12                         12          10020
13                         13          10023
14                         14          10080
15                         15          10080
16                         16          10032
17                         17          10030
18                         18          10008
19                         19          10032
20                         20          10000

Important note: while the group-wise approximation is not directly implemented in GPMelt, it can be obtained by defining NumberSamples_perID.csv as above and recomputing the p-values accordingly using GPMelt output files likelihood_ratios_df.csv and likelihood_ratios_sampled_dataset_df.csv, described here, section 5.

3 Application to the ATP 2019 dataset

Coming back to the example of the ATP 2019 dataset, we choose the dataset-wise approximation (see Section 2.3), which consists in computing the same number of samples for each ID, and combining all samples across ID to form a unique null distribution.

We choose 10 samples per ID:

n_samples=10

NumberSamples_perID <- Data_forPython %>%
  dplyr::select(Level_1) %>%
  distinct() %>%
  mutate(NSamples = n_samples) 
  

head(NumberSamples_perID)
# A tibble: 6 × 2
  Level_1 NSamples
  <chr>      <dbl>
1 AAGAB         10
2 AAK1          10
3 AAMDC         10
4 AAMP          10
5 AAR2          10
6 BANF1         10

The total number of samples is given by:

sum(NumberSamples_perID$NSamples)
[1] 47720

3.1 Save NumberSamples_perID.csv

We can now save this file in the right folder:

write.csv(NumberSamples_perID,
          file ="Nextflow/dummy_data/ATP2019/NumberSamples_perID.csv")

References

Le Sueur, Cecile, Magnus Rattray, and Mikhail Savitski. 2024. “GPMelt: A Hierarchical Gaussian Process Framework to Explore the Dark Meltome of Thermal Proteome Profiling Experiments.” PLOS Computational Biology 20 (9): e1011632.