=1e4 S
Choosing the null distribution estimation
The null distribution is discussed on this page, please have a look to understand what we do in this step!
1 Introduction
With the current GPMelt implementation, we need to define a file called NumberSamples_perID.csv
.
This file contains one row per value of Level_1
, and the following two variables: Level_1
and NSamples
.
We explain in the following how to define it in the case of the different null distribution approximation methods.
Note: This file can also be used to define a custom null distribution approximation method, in which the user can define for each ID in Level_1
the desired number of samples per condition for this ID. We only recommend this custom null distribution approximation method for advanced users who know why they are doing it.
2 A general explanation on how to create NumberSamples_perID.csv
using R
For the explanation, we assume here that Level_1
corresponds to the IDs/proteins \(p\) observed in the dataset.
NSamples
thus corresponds to the variable \(\widetilde{S_p}\), (as explained in Section 2.1 and in the GPMelt paper (Le Sueur, Rattray, and Savitski 2024), especially in Supporting Information B
and Table E
in S1 file.)
2.1 Notation
Again, check this page of the website to understand the principle of the null distribution approximation if not done already!
We consider the case where the IDs would present different numbers of observed conditions. We aim to generate a final size \(S\) of the null distribution approximation, with \(S=1e4\).
Using the notation used in the GPMelt paper, and denoting by \(S\) the total number of samples used to approximate the null distribution, we consider an ID \(p\) and define:
- \(C_p\) the number of conditions for this ID \(p\),
- \(\widetilde{S_p}\) the number of samples to be drawn for each condition of this ID \(p\),
- \(S_p\) the total number of samples to be drawn for this ID \(p\).
These four quantities are linked as follows:
\[ S = \sum_p S_p \quad \text{with} \quad \widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil \ . \] with:
- \(S = \sum_p S_p\): The final size is obtained by summing the results obtained for each protein.
- \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\): The number of values per condition \(\widetilde{S_p}\) is obtained by dividing the total number of values for a protein \(S_p\) by the total number of conditions for this protein \(C_p\).
2.2 Creation of a dummy dataset
We start by creating a dummy dataset of 100 IDs presenting different numbers of conditions.
Code
library(dplyr)
library(ggplot2)
source("../../utils.R")
set.seed(123)
<- data.frame(
ExampleDatasetLevel_1 = paste0("ID_", c(1:100)),
NumberOfObservedConditions = sample(x= c(1:20), replace =TRUE, size=100)
)
head(ExampleDataset)
Level_1 NumberOfObservedConditions
1 ID_1 15
2 ID_2 19
3 ID_3 14
4 ID_4 3
5 ID_5 10
6 ID_6 18
2.3 The dataset-wise
approximation (method C, Table E
in S1 file)
It consists in computing the same number \(\widetilde{S_p}\) of values for each comparison, and combining all values across proteins to form a unique null distribution.
In the present case, we have:
\[ S_p = \left\lceil C_p \cdot \frac{S}{\sum_{p'} C_{p'}} \right\rceil \ .\]
Let’s first compute \(\sum_{p'} C_{p'}\), the total number of conditions over all IDs of the dataset.
<- sum(ExampleDataset$NumberOfObservedConditions) Total_N_cond
We now compute the number of samples per condition \(\widetilde{S_p} = \frac{S}{\sum_{p'} C_{p'}}\) (done row-wise) and compute the associated value of \(S_p\) (ToTal_NSamples_perID
):
<- ExampleDataset %>%
NumberSamples_perID rowwise() %>%
mutate(NSamples = ceiling(S / Total_N_cond ) ) %>%
mutate(ToTal_NSamples_perID = NSamples * NumberOfObservedConditions)
NumberSamples_perID
# A tibble: 100 × 4
# Rowwise:
Level_1 NumberOfObservedConditions NSamples ToTal_NSamples_perID
<chr> <int> <dbl> <dbl>
1 ID_1 15 10 150
2 ID_2 19 10 190
3 ID_3 14 10 140
4 ID_4 3 10 30
5 ID_5 10 10 100
6 ID_6 18 10 180
7 ID_7 11 10 110
8 ID_8 5 10 50
9 ID_9 20 10 200
10 ID_10 14 10 140
# ℹ 90 more rows
The total number of samples should be at least \(1e4\):
sum(NumberSamples_perID$ToTal_NSamples_perID)
[1] 11030
2.4 The ID-wise
approximation (method D, Table E
in S1 file)
In this case \(S_p \equiv S = 1e4\) and \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\)
<- ExampleDataset %>%
NumberSamples_perID rowwise() %>%
mutate(NSamples = ceiling(S / NumberOfObservedConditions)) %>%
mutate(ToTal_NSamples_perID = NSamples * NumberOfObservedConditions)
NumberSamples_perID
# A tibble: 100 × 4
# Rowwise:
Level_1 NumberOfObservedConditions NSamples ToTal_NSamples_perID
<chr> <int> <dbl> <dbl>
1 ID_1 15 667 10005
2 ID_2 19 527 10013
3 ID_3 14 715 10010
4 ID_4 3 3334 10002
5 ID_5 10 1000 10000
6 ID_6 18 556 10008
7 ID_7 11 910 10010
8 ID_8 5 2000 10000
9 ID_9 20 500 10000
10 ID_10 14 715 10010
# ℹ 90 more rows
The total number of samples should be at least \(1e4\) for each ID:
which(NumberSamples_perID$ToTal_NSamples_perID < 1e4)
integer(0)
2.5 The group-wise
approximation (method E, Table E
in S1 file)
We propose to group IDs according to the number of observed conditions, thus reducing the number of samples to draw per ID.
This method offers a compromise between the ID-wise
and dataset-wise
approximation. We decided here to group proteins by number of observed conditions because this maximises the similarity of IDs within a group in terms of number of measurements .
We group IDs by the observed numbers of conditions, and measure the size of the groups.
<- ExampleDataset %>%
NumberSamples_perID group_by(NumberOfObservedConditions) %>%
mutate(Ng = n_distinct(Level_1))
1 NumberSamples_perID
- 1
-
\(N_g\) refers to the size of the group presenting
NumberOfObservedConditions
. For example, below, \(7\) IDs present \(15\) observed conditions
# A tibble: 100 × 3
# Groups: NumberOfObservedConditions [20]
Level_1 NumberOfObservedConditions Ng
<chr> <int> <int>
1 ID_1 15 7
2 ID_2 19 6
3 ID_3 14 10
4 ID_4 3 6
5 ID_5 10 8
6 ID_6 18 4
7 ID_7 11 4
8 ID_8 5 6
9 ID_9 20 4
10 ID_10 14 10
# ℹ 90 more rows
We have \[S_p = \left\lceil C_p \cdot \frac{S}{\sum_{p' \in g} C_{p'}} \right\rceil \ , \] with \(g\) the group of the protein \(p\).
Noticing that \(C_{p'} \equiv C_p\) for all proteins \(p'\) in group \(g\), we get that \[ S_p = \left\lceil C_p \cdot \frac{S}{N_g \cdot C_{p}} \right\rceil = \left\lceil \frac{S}{N_g} \right\rceil \ .\]
As usual \(\widetilde{S_p} = \left\lceil \frac{S_p}{C_p} \right\rceil\).
We first calculate \(\sum_{p' \in G_g} C_{p'}\), the total number of conditions in group \(g\) considering all IDs in group \(g\), and we denote it by Total_N_cond_group_g
:
<- NumberSamples_perID %>%
NumberSamples_perID group_by(NumberOfObservedConditions) %>%
rowwise() %>%
mutate(Total_N_cond_group_g = NumberOfObservedConditions * Ng )%>%
ungroup()
NumberSamples_perID
# A tibble: 100 × 4
Level_1 NumberOfObservedConditions Ng Total_N_cond_group_g
<chr> <int> <int> <int>
1 ID_1 15 7 105
2 ID_2 19 6 114
3 ID_3 14 10 140
4 ID_4 3 6 18
5 ID_5 10 8 80
6 ID_6 18 4 72
7 ID_7 11 4 44
8 ID_8 5 6 30
9 ID_9 20 4 80
10 ID_10 14 10 140
# ℹ 90 more rows
We then compute \(\widetilde{S_p}\) (NSamples
):
<- NumberSamples_perID %>%
NumberSamples_perID rowwise() %>%
mutate(NSamples = ceiling(S / Total_N_cond_group_g)) %>%
mutate(ToTal_NSamples_perID = NSamples * NumberOfObservedConditions)
NumberSamples_perID
# A tibble: 100 × 6
# Rowwise:
Level_1 NumberOfObservedConditions Ng Total_N_cond_group_g NSamples
<chr> <int> <int> <int> <dbl>
1 ID_1 15 7 105 96
2 ID_2 19 6 114 88
3 ID_3 14 10 140 72
4 ID_4 3 6 18 556
5 ID_5 10 8 80 125
6 ID_6 18 4 72 139
7 ID_7 11 4 44 228
8 ID_8 5 6 30 334
9 ID_9 20 4 80 125
10 ID_10 14 10 140 72
# ℹ 90 more rows
# ℹ 1 more variable: ToTal_NSamples_perID <dbl>
The total number of samples should be at least \(1e4\) per group:
%>%
NumberSamples_perID group_by( NumberOfObservedConditions ) %>%
summarise(ToTal_NSamples = sum(ToTal_NSamples_perID))
# A tibble: 20 × 2
NumberOfObservedConditions ToTal_NSamples
<int> <dbl>
1 1 10000
2 2 10002
3 3 10008
4 4 10000
5 5 10020
6 6 10008
7 7 10024
8 8 10032
9 9 10035
10 10 10000
11 11 10032
12 12 10020
13 13 10023
14 14 10080
15 15 10080
16 16 10032
17 17 10030
18 18 10008
19 19 10032
20 20 10000
Important note: while the group-wise approximation is not directly implemented in GPMelt, it can be obtained by defining NumberSamples_perID.csv
as above and recomputing the p-values accordingly using GPMelt output files likelihood_ratios_df.csv
and likelihood_ratios_sampled_dataset_df.csv
, described here, section 5.
3 Application to the ATP 2019 dataset
Coming back to the example of the ATP 2019 dataset, we choose the dataset-wise
approximation (see Section 2.3), which consists in computing the same number of samples for each ID, and combining all samples across ID to form a unique null distribution.
We choose 10 samples per ID:
=10
n_samples
<- Data_forPython %>%
NumberSamples_perID ::select(Level_1) %>%
dplyrdistinct() %>%
mutate(NSamples = n_samples)
head(NumberSamples_perID)
# A tibble: 6 × 2
Level_1 NSamples
<chr> <dbl>
1 AAGAB 10
2 AAK1 10
3 AAMDC 10
4 AAMP 10
5 AAR2 10
6 BANF1 10
The total number of samples is given by:
sum(NumberSamples_perID$NSamples)
[1] 47720
3.1 Save NumberSamples_perID.csv
We can now save this file in the right folder:
write.csv(NumberSamples_perID,
file ="Nextflow/dummy_data/ATP2019/NumberSamples_perID.csv")