Przeglądaj źródła

Fixes from CMT, plus other small revisions

Ryan C. Thompson 5 lat temu
rodzic
commit
2ce3ff34c4
2 zmienionych plików z 348 dodań i 87 usunięć
  1. 0 14
      refs.bib
  2. 348 73
      thesis.lyx

Plik diff jest za duży
+ 0 - 14
refs.bib


+ 348 - 73
thesis.lyx

@@ -632,23 +632,6 @@ Thanks again for your help, and happy reading!
 Introduction
 \end_layout
 
-\begin_layout Section*
-Structure of the thesis
-\end_layout
-
-\begin_layout Standard
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-Put at end up intro
-\end_layout
-
-\end_inset
-
-
-\end_layout
-
 \begin_layout Section
 \begin_inset CommandInset label
 LatexCommand label
@@ -1188,9 +1171,9 @@ The studies presented in this work all involve the analysis of high-throughput
  genomic and epigenomic data.
  These data present many unique analysis challenges, and a wide array of
  software tools are available to analyze them.
- This section presents an overview of the methods used, including what problems
- they solve, what assumptions they make, and a basic description of how
- they work.
+ This section presents an overview of the most important methods used throughout
+ the following analyses, including what problems they solve, what assumptions
+ they make, and a basic description of how they work.
 \end_layout
 
 \begin_layout Subsection
@@ -1297,6 +1280,19 @@ RNA-seq
  modeling is appropriate.
 \end_layout
 
+\begin_layout Standard
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Include an eBayes example figure
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
 \begin_layout Standard
 The central challenge when fitting a linear model is to estimate the variance
  of the data accurately.
@@ -1306,14 +1302,15 @@ The central challenge when fitting a linear model is to estimate the variance
  A single shared variance could be estimated for all of the features together,
  and this estimate would be very stable, in contrast to the individual feature
  variance estimates.
- However, this would require the assumption that every feature is equally
- variable, which is known to be false for most genomic data sets.
+ However, this would require the assumption that all features have equal
+ variance, which is known to be false for most genomic data sets (for example,
+ some genes' expression is known to be more variable than others').
  
 \begin_inset Flex Code
 status open
 
 \begin_layout Plain Layout
-limma
+Limma
 \end_layout
 
 \end_inset
@@ -1517,8 +1514,8 @@ ChIP-seq
 
 \end_inset
 
-, which tend to be much smaller and therefore violate the assumption of
- a normal distribution more severely.
+ and other sources, which tend to be much smaller and therefore violate
+ the assumption of a normal distribution more severely.
  For all count-based data, the 
 \begin_inset Flex Code
 status open
@@ -1593,7 +1590,17 @@ NB
 \end_inset
 
  distribution rather than modeling the normalized log counts using a normal
- distribution 
+ distribution as 
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ does 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Chen2014,McCarthy2012,Robinson2010a"
@@ -1602,7 +1609,11 @@ literal "false"
 \end_inset
 
 .
- The 
+ 
+\end_layout
+
+\begin_layout Standard
+The 
 \begin_inset Flex Glossary Term
 status open
 
@@ -1612,12 +1623,136 @@ NB
 
 \end_inset
 
- is a good fit for count data because it can be derived as a gamma-distributed
- mixture of Poisson distributions.
- The Poisson distribution accurately represents the distribution of counts
- expected for a given gene abundance, and the gamma distribution is then
- used to represent the variation in gene abundance between biological replicates.
- For this reason, the square root of the dispersion parameter of the 
+ distribution is a good fit for count data because it can be derived as
+ a gamma-distributed mixture of Poisson distributions.
+ The reads in an 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+RNA-seq
+\end_layout
+
+\end_inset
+
+ sample are assumed to be sampled from a much larger population, such that
+ the sampling process does not significantly affect the proportions.
+ Under this assumption, a gene's read count in an 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+RNA-seq
+\end_layout
+
+\end_inset
+
+ sample is distributed as 
+\begin_inset Formula $\mathrm{Binomial}(n,p)$
+\end_inset
+
+, where 
+\begin_inset Formula $n$
+\end_inset
+
+ is the total number of reads sequenced from the sample and 
+\begin_inset Formula $p$
+\end_inset
+
+ is the proportion of total fragments in the sample derived from that gene.
+ When 
+\begin_inset Formula $n$
+\end_inset
+
+ is large and 
+\begin_inset Formula $p$
+\end_inset
+
+ is small, a 
+\begin_inset Formula $\mathrm{Binomial}(n,p)$
+\end_inset
+
+ distribution is well-approximated by 
+\begin_inset Formula $\mathrm{Poisson}(np)$
+\end_inset
+
+.
+ Hence, if multiple sequencing runs are performed on the same 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+RNA-seq
+\end_layout
+
+\end_inset
+
+ sample (with the same gene mixing proportions each time), each gene's read
+ count is expected to follow a Poisson distribution.
+ If the abundance of a gene, 
+\begin_inset Formula $p,$
+\end_inset
+
+ varies across biological replicates according to a gamma distribution,
+ and 
+\begin_inset Formula $n$
+\end_inset
+
+ is held constant, then the resulting distribution is a gamma-distributed
+ mixture of Poisson distributions, which is equivalent to the 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+NB
+\end_layout
+
+\end_inset
+
+ distribution.
+ The choice of a gamma distribution for the mixing weights is arbitrary,
+ motivated by the convenience of the numerically tractable 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+NB
+\end_layout
+
+\end_inset
+
+ distribution, since the true shape of the distribution of biological variance
+ is unknown.
+\end_layout
+
+\begin_layout Standard
+Thus, 
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+'s use of the 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+NB
+\end_layout
+
+\end_inset
+
+ is equivalent to an 
+\emph on
+a priori 
+\emph default
+assumption that the variation in gene abundances between replicates follows
+ a gamma distribution.
+ The gamma shape parameter in the context of the 
 \begin_inset Flex Glossary Term
 status open
 
@@ -1627,7 +1762,8 @@ NB
 
 \end_inset
 
- is sometimes referred to as the 
+ is called the dispersion, and the square root of this dispersion is referred
+ to as the 
 \begin_inset Flex Glossary Term
 status open
 
@@ -1637,8 +1773,8 @@ BCV
 
 \end_inset
 
-, since it represents the variability that was present in the samples prior
- to the Poisson 
+, since it represents the variability in abundance that was present in the
+ biological samples prior to the Poisson 
 \begin_inset Quotes eld
 \end_inset
 
@@ -1648,20 +1784,17 @@ noise
 
  that was generated by the random sampling of reads in proportion to feature
  abundances.
- The choice of a gamma distribution is arbitrary and motivated by mathematical
- convenience, since a gamma-Poisson mixture yields the numerically tractable
- 
-\begin_inset Flex Glossary Term
+ Like 
+\begin_inset Flex Code
 status open
 
 \begin_layout Plain Layout
-NB
+limma
 \end_layout
 
 \end_inset
 
- distribution.
- Thus, 
+, 
 \begin_inset Flex Code
 status open
 
@@ -1671,11 +1804,19 @@ edgeR
 
 \end_inset
 
- assumes 
-\emph on
-a prioi 
-\emph default
-that the variation in abundances between replicates follows a gamma distribution.
+ estimates the 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+BCV
+\end_layout
+
+\end_inset
+
+ for each feature using an empirical Bayes procedure that represents a compromis
+e between per-feature dispersions and a single pooled dispersion estimate
+ shared across all features.
  For differential abundance testing, 
 \begin_inset Flex Code
 status open
@@ -1686,9 +1827,34 @@ edgeR
 
 \end_inset
 
- offers a likelihood ratio test, but more recently recommends a quasi-likelihood
- test that properly factors the uncertainty in variance estimation into
- the statistical significance for each feature 
+ offers a likelihood ratio test based on the 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+NB
+\end_layout
+
+\end_inset
+
+ 
+\begin_inset Flex Glossary Term
+status open
+
+\begin_layout Plain Layout
+GLM
+\end_layout
+
+\end_inset
+
+.
+ However, this test assumes the dispersion parameter is known exactly rather
+ than estimated from the data, which can result in overstating the significance
+ of differential abundance results.
+ More recently, a quasi-likelihood test has been introduced that properly
+ factors the uncertainty in dispersion estimation into the estimates of
+ statistical significance, and this test is recommended over the likelihood
+ ratio test in most cases 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Lund2012"
@@ -2392,12 +2558,12 @@ literal "false"
  more likely to be the result of outlier observations that happen to line
  up with the batches rather than a genuine batch effect.
  The result is a batch correction that is more robust against outliers than
- simple subtraction of mean differences subtraction.
+ simple subtraction of mean differences.
 \end_layout
 
 \begin_layout Standard
 In some data sets, unknown batch effects may be present due to inherent
- variability in in the data, either caused by technical or biological effects.
+ variability in the data, either caused by technical or biological effects.
  Examples of unknown batch effects include variations in enrichment efficiency
  between 
 \begin_inset Flex Glossary Term
@@ -2431,7 +2597,8 @@ SVD
  variation in the data) and take the first few singular vectors as batch
  effects.
  While this can be effective, it makes the unreasonable assumption that
- all batch effects are uncorrelated with any of the effects being modeled.
+ all batch effects are completely uncorrelated with any of the effects being
+ modeled.
  
 \begin_inset Flex Glossary Term
 status open
@@ -2483,6 +2650,23 @@ s in the linear model in a similar fashion to known batch effects in order
  to subtract out their effects on each feature's abundance.
 \end_layout
 
+\begin_layout Subsection
+Benjamini-Hochberg + pval dist
+\end_layout
+
+\begin_layout Standard
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Include figure showing uniform and non-uniform components of p-value dist
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
 \begin_layout Subsection
 Factor analysis: PCA, MDS, MOFA
 \end_layout
@@ -2514,6 +2698,10 @@ PCA
  is informative, but careful application is required to avoid bias
 \end_layout
 
+\begin_layout Section
+Structure of the thesis
+\end_layout
+
 \begin_layout Chapter
 Reproducible genome-wide epigenetic analysis of H3K4 and H3K27 methylation
  in naïve and memory CD4
@@ -2674,9 +2862,9 @@ ChIP-seq
 
  T-cell samples in a time course before and after activation.
  Like the original analysis, this analysis looks at the dynamics of these
- marks histone marks and compare them to gene expression dynamics at the
- same time points during activation, as well as compare them between naïve
- and memory cells, in hope of discovering evidence of new mechanistic details
+ histone marks and compares them to gene expression dynamics at the same
+ time points during activation, as well as compares them between naïve and
+ memory cells, in hope of discovering evidence of new mechanistic details
  in the interplay between them.
  The original analysis of this data treated each gene promoter as a monolithic
  unit and mostly assumed that 
@@ -3138,9 +3326,31 @@ literal "false"
 .
  Comparisons of downstream results from each combination of quantification
  method and reference revealed that all quantifications gave broadly similar
- results for most genes, so shoal with the Ensembl annotation was chosen
- as the method theoretically most likely to partially mitigate some of the
- batch effect in the data.
+ results for most genes, so 
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+shoal
+\end_layout
+
+\end_inset
+
+ with the Ensembl annotation was chosen as the method theoretically most
+ likely to partially mitigate some of the batch effect in the data.
+\end_layout
+
+\begin_layout Standard
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Cite shoal
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Standard
@@ -3756,7 +3966,7 @@ literal "false"
 \begin_inset Float figure
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \align center
@@ -3872,6 +4082,19 @@ bp.
 \end_inset
 
 
+\end_layout
+
+\begin_layout Plain Layout
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Figure font too small
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -4130,13 +4353,13 @@ noprefix "false"
 \begin_inset Float figure
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \begin_inset Float figure
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \align center
@@ -4181,7 +4404,7 @@ H3K4me2, no correction
 \begin_inset Float figure
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \align center
@@ -4397,6 +4620,19 @@ H3K27me3, SVs subtracted
 \end_inset
 
 
+\end_layout
+
+\begin_layout Plain Layout
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Figure font too small
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -4701,7 +4937,7 @@ begin{landscape}
 \begin_inset Float figure
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \begin_inset Float figure
@@ -4809,6 +5045,19 @@ Scatter plots of specific pairs of MOFA latent factors.
 \end_inset
 
 
+\end_layout
+
+\begin_layout Plain Layout
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Figure font a bit too small
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -6394,8 +6643,8 @@ Expression distributions of genes with and without promoter peaks.
 \end_layout
 
 \begin_layout Subsection
-Gene expression and promoter histone methylation patterns in naïve and memory
- show convergence at day 14
+Gene expression and promoter histone methylation patterns show convergence
+ between naïve and memory cells at day 14
 \end_layout
 
 \begin_layout Standard
@@ -6519,7 +6768,7 @@ RNA-seq
 placement p
 wide false
 sideways false
-status collapsed
+status open
 
 \begin_layout Plain Layout
 \align center
@@ -6699,6 +6948,19 @@ RNA-seq PCoA showing principal coordinates 2 and 3.
 \end_inset
 
 
+\end_layout
+
+\begin_layout Plain Layout
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Figure font too small
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -7402,7 +7664,7 @@ TSS
 \end_inset
 
 .
- In order from must upstream to most downstream, they are Clusters 6, 4,
+ In order from most upstream to most downstream, they are Clusters 6, 4,
  3, 1, and 2.
  There do not appear to be any clusters representing coverage patterns other
  than lone peaks, such as coverage troughs or double peaks.
@@ -7505,7 +7767,7 @@ begin{landscape}
 \begin_inset Float figure
 wide false
 sideways false
-status open
+status collapsed
 
 \begin_layout Plain Layout
 \align center
@@ -7640,6 +7902,19 @@ Gene expression grouped by promoter coverage clusters.
 \end_inset
 
 
+\end_layout
+
+\begin_layout Plain Layout
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Figure font too small
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -9163,8 +9438,8 @@ status open
 \begin_inset Graphics
 	filename graphics/CD4-csaw/LaMere2016_fig8.pdf
 	lyxscale 50
-	width 60col%
-	groupId colwidth
+	width 100col%
+	groupId colfullwidth
 
 \end_inset
 
@@ -9342,7 +9617,7 @@ TSS
 \end_inset
 
  appears to be more strongly associated with elevated expression than coverage
- the same distance upstream, indicating that the 
+ at the same distance upstream, indicating that the 
 \begin_inset Quotes eld
 \end_inset
 

Niektóre pliki nie zostały wyświetlone z powodu dużej ilości zmienionych plików