Pārlūkot izejas kodu

Expand on peak calling and FDR in intro

Ryan C. Thompson 5 gadi atpakaļ
vecāks
revīzija
e35d6afcfa
1 mainītis faili ar 130 papildinājumiem un 13 dzēšanām
  1. 130 13
      thesis.lyx

+ 130 - 13
thesis.lyx

@@ -2063,6 +2063,23 @@ ChIP-seq
  
 \end_layout
 
+\begin_layout Standard
+The challenge in peak calling is that the immunoprecipitation step is not
+ 100% selective, so some fraction of reads are 
+\emph on
+not 
+\emph default
+derived from DNA fragments that were bound by the immunoprecipitated protein.
+ These are referred to as background reads.
+ Biases in amplification and sequencing, as well as the aforementioned Poisson
+ randomness of the sequencing itself, can cause fluctuations in the background
+ level of reads the resemble peaks, and the true peaks must be distinguished
+ from these.
+ It is common to sequence the input to the ChIP-seq reaction as well as
+ the immunoprecipitated sample in order to aid in estimating the fluctuations
+ in background level across the genome.
+\end_layout
+
 \begin_layout Standard
 There are generally two kinds of peaks that can be identified: narrow peaks
  and broadly enriched regions.
@@ -2176,18 +2193,6 @@ literal "false"
 \end_inset
 
 .
- In all cases, better results are obtained if the local background coverage
- level can be estimated from 
-\begin_inset Flex Glossary Term
-status open
-
-\begin_layout Plain Layout
-ChIP-seq
-\end_layout
-
-\end_inset
-
- input samples, since various biases can result in uneven background coverage.
 \end_layout
 
 \begin_layout Standard
@@ -2950,11 +2955,102 @@ s in the linear model in a similar fashion to known batch effects in order
 Benjamini-Hochberg + pval dist
 \end_layout
 
+\begin_layout Standard
+When testing thousands of genes for differential expression or performing
+ thousands of statistical tests for other kinds of genomic data, the result
+ is thousands of p-values.
+ By construction, p-values have a 
+\begin_inset Formula $\mathrm{Uniform}(0,1)$
+\end_inset
+
+ distribution under the null hypothesis.
+ This means that if all null hypotheses are true in a large number 
+\begin_inset Formula $N$
+\end_inset
+
+ of tests, then for any significance threshold 
+\begin_inset Formula $T$
+\end_inset
+
+, approximately 
+\begin_inset Formula $N*T$
+\end_inset
+
+ p-values will be 
+\begin_inset Quotes eld
+\end_inset
+
+significant
+\begin_inset Quotes erd
+\end_inset
+
+ at that threshold even though the null hypotheses are all true.
+ These are called false discoveries.
+\end_layout
+
+\begin_layout Standard
+When only a fraction of null hypotheses are true, the p-value distribution
+ will be a mixture of a uniform component representing the null hypotheses
+ that are true and a non-uniform component representing the null hypotheses
+ that are not true.
+ The fraction belonging to the uniform component is referred to as 
+\begin_inset Formula $\pi_{0}$
+\end_inset
+
+, which ranges from 1 (all null hypotheses true) to 0 (all null hypotheses
+ false).
+ Furthermore, the non-uniform component must be biased toward zero, since
+ any evidence against the null hypothesis must push the p-value for a test
+ toward zero.
+ We can exploit this fact to estimate the 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+FDR
+\end_layout
+
+\end_inset
+
+ for any significance threshold by estimating the degree to which the density
+ of p-values left of that threshold exceeds what would be expected for a
+ uniform distribution.
+ In genomics, the most commonly used FDR estimation method, and the one
+ used in this work, is that of 
+\begin_inset ERT
+status open
+
+\begin_layout Plain Layout
+
+
+\backslash
+glsdisp{BH}{Benjamini and Hochberg}
+\end_layout
+
+\end_inset
+
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Benjamini1995"
+literal "false"
+
+\end_inset
+
+.
+ This is a conservative method that effectively assumes 
+\begin_inset Formula $\pi_{0}=1$
+\end_inset
+
+ unconditionally.
+ Hence it gives an upper bound for the FDR at any significance threshold.
+\end_layout
+
 \begin_layout Standard
 \begin_inset Float figure
 wide false
 sideways false
-status open
+status collapsed
 
 \begin_layout Plain Layout
 \align center
@@ -2982,6 +3078,27 @@ name "fig:Example-pval-hist"
 
 \series bold
 Example p-value histogram.
+ 
+\series default
+The distribution of p-values from a large number of independent tests (such
+ as differential expression tests for each gene in the genome) is a mixture
+ of a uniform component representing the null hypotheses that are true (blue
+ shading) and a zero-biased component representing the null hypotheses that
+ are false (red shading).
+ The FDR for any column in the histogram is the fraction of that column
+ that is blue.
+ The line 
+\begin_inset Formula $y=\pi_{0}$
+\end_inset
+
+ represents the theoretical uniform component of this p-value distribution,
+ while the line 
+\begin_inset Formula $y=1$
+\end_inset
+
+ represents the uniform component when all null hypotheses are true.
+ Note that in real data, the true status of each hypothesis is unknown,
+ so only the overall shape of the distribution is known.
 \end_layout
 
 \end_inset