6 лет назад · bb70ac8b47
--- a/thesis.lyx
+++ b/thesis.lyx
@@ -787,22 +787,17 @@ The studies presented in this work all involve the analysis of high-throughput
 
															  they work.
														
 
															 \end_layout
														
 
															-\begin_layout Standard
														
 
															-\begin_inset Flex TODO Note (inline)
														
 
															+\begin_layout Subsubsection
														
 
															+\begin_inset Flex Code
														
 
															 status open
														
 
															 \begin_layout Plain Layout
														
 
															-Many of these points may also be addressed in the approach/methods sections
														
 
															- of the following chapters? Redundant?
														
 
															+Limma
														
 
															 \end_layout
														
 
															 \end_inset
														
 
															-
														
 
															-\end_layout
														
 
															-
														
 
															-\begin_layout Subsubsection
														
 
															-Limma: The standard linear modeling framework for genomics
														
 
															+: The standard linear modeling framework for genomics
														
 
															 \end_layout
														
 
															 \begin_layout Standard
														
@@ -820,7 +815,7 @@ literal "false"
 
															 .
														
 
															  In a typical linear model, there is one dependent variable observation
														
 
															- per sample.
														
 
															+ per sample and a large number of samples.
														
 
															  For example, in a linear model of height as a function of age and sex,
														
 
															  there is one height measurement per person.
														
 
															  However, when analyzing genomic data, each sample consists of observations
														
@@ -833,18 +828,38 @@ literal "false"
 
															  independently to each feature.
														
 
															  However, this is undesirable for most genomics data sets.
														
 
															  Genomics assays like high-throughput sequencing are expensive, and often
														
 
															- generating the samples is also quite expensive and time-consuming.
														
 
															+ the process of generating the samples is also quite expensive and time-consumin
														
 
															+g.
														
 
															  This expense limits the sample sizes typically employed in genomics experiments
														
 
															-, and as a result the statistical power of each individual feature's linear
														
 
															- model is likewise limited.
														
 
															+, and as a result the statistical power of the linear model for each individual
														
 
															+ feature is likewise limited.
														
 
															  However, because thousands of features from the same samples are analyzed
														
 
															  together, there is an opportunity to improve the statistical power of the
														
 
															  analysis by exploiting shared patterns of variation across features.
														
 
															- This is the core feature of limma, a linear modeling framework designed
														
 
															- for genomic data.
														
 
															- Limma is typically used to analyze expression microarray data, and more
														
 
															- recently RNA-seq data, but it can also be used to analyze any other data
														
 
															- for which linear modeling is appropriate.
														
 
															+ This is the core feature of 
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+, a linear modeling framework designed for genomic data.
														
 
															+ 
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+Limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ is typically used to analyze expression microarray data, and more recently
														
 
															+ RNA-seq data, but it can also be used to analyze any other data for which
														
 
															+ linear modeling is appropriate.
														
 
															 \end_layout
														
 
															 \begin_layout Standard
														
@@ -858,7 +873,18 @@ The central challenge when fitting a linear model is to estimate the variance
 
															  variance estimates.
														
 
															  However, this would require the assumption that every feature is equally
														
 
															  variable, which is known to be false for most genomic data sets.
														
 
															- Limma offers a compromise between these two extremes by using a method
														
 
															+
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ offers a compromise between these two extremes by using a method
														
 
															  called empirical Bayes moderation to 
														
 
															 \begin_inset Quotes eld
														
 
															 \end_inset
														
@@ -885,7 +911,18 @@ on of the two yields a variance estimate for each feature with greater precision
 
															  toward the common value introduces some bias – the variance will be underestima
														
 
															 ted for features with high variance and overestimated for features with
														
 
															  low variance.
														
 
															- Essentially, limma assumes that extreme variances are less common than
														
 
															+ Essentially,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ assumes that extreme variances are less common than
														
 
															  variances close to the common value.
														
 
															  The variance estimates from this empirical Bayes procedure are shown empiricall
														
 
															 y to yield greater statistical power than either the individual feature
														
@@ -893,10 +930,32 @@ y to yield greater statistical power than either the individual feature
 
															 \end_layout
														
 
															 \begin_layout Standard
														
 
															-On top of this core framework, limma also implements many other enhancements
														
 
															+On top of this core framework,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ also implements many other enhancements
														
 
															  that, further relax the assumptions of the model and extend the scope of
														
 
															  what kinds of data it can analyze.
														
 
															- Instead of squeezing toward a single common variance value, limma can model
														
 
															+ Instead of squeezing toward a single common variance value,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ can model
														
 
															  the common variance as a function of a covariate, such as average expression
														
 
															 \begin_inset CommandInset citation
														
@@ -911,7 +970,18 @@ literal "false"
 
															  precise expression measurements and therefore smaller variances than low-count
														
 
															  genes.
														
 
															  While linear models typically assume that all samples have equal variance,
														
 
															- limma is able to relax this assumption by identifying and down-weighting
														
 
															+
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ is able to relax this assumption by identifying and down-weighting
														
 
															  samples the diverge more strongly from the linear model across many features
														
 
															 \begin_inset CommandInset citation
														
@@ -922,7 +992,18 @@ literal "false"
 
															 \end_inset
														
 
															 .
														
 
															- In addition, limma is also able to fit simple mixed models incorporating
														
 
															+ In addition,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ is also able to fit simple mixed models incorporating
														
 
															  one random effect in addition to the fixed effects represented by an ordinary
														
 
															  linear model 
														
 
															 \begin_inset CommandInset citation
														
@@ -933,21 +1014,87 @@ literal "false"
 
															 \end_inset
														
 
															 .
														
 
															- Once again, limma shares information between features to obtain a robust
														
 
															+ Once again,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ shares information between features to obtain a robust
														
 
															  estimate for the random effect correlation.
														
 
															 \end_layout
														
 
															 \begin_layout Subsubsection
														
 
															-edgeR provides limma-like analysis features for count data
														
 
															+edgeR provides
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+-like analysis features for count data
														
 
															 \end_layout
														
 
															 \begin_layout Standard
														
 
															-Although limma can be applied to read counts from RNA-seq data, it is less
														
 
															+Although
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ can be applied to read counts from RNA-seq data, it is less
														
 
															  suitable for counts from ChIP-seq data, which tend to be much smaller and
														
 
															  therefore violate the assumption of a normal distribution more severely.
														
 
															- For all count-based data, the edgeR package works similarly to limma, but
														
 
															+ For all count-based data, the
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ package works similarly to
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+, but
														
 
															  uses a generalized linear model instead of a linear model.
														
 
															- The most important difference is that the GLM in edgeR models the counts
														
 
															+ The most important difference is that the GLM in
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ models the counts
														
 
															  directly using a negative binomial distribution rather than modeling the
														
 
															  normalized log counts using a normal distribution 
														
 
															 \begin_inset CommandInset citation
														
@@ -979,12 +1126,34 @@ noise
 
															  The choice of a gamma distribution is arbitrary and motivated by mathematical
														
 
															  convenience, since a gamma-Poisson mixture yields the numerically tractable
														
 
															  negative binomial distribution.
														
 
															- Thus, edgeR assumes 
														
 
															+ Thus,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ assumes 
														
 
															 \emph on
														
 
															 a prioi 
														
 
															 \emph default
														
 
															 that the variation in abundances between replicates follows a gamma distribution.
														
 
															- For differential abundance testing, edgeR offers a likelihood ratio test,
														
 
															+ For differential abundance testing,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ offers a likelihood ratio test,
														
 
															  but more recently recommends a quasi-likelihood test that properly factors
														
 
															  the uncertainty in variance estimation into the statistical significance
														
 
															  for each feature 
														
@@ -1268,7 +1437,18 @@ In addition to well-understood effects that can be easily normalized out,
 
															  However, as with variance estimation, estimating the differences in batch
														
 
															  means is not necessarily robust at the feature level, so the ComBat method
														
 
															  adds empirical Bayes squeezing of the batch mean differences toward a common
														
 
															- value, analogous to limma's empirical Bayes squeezing of feature variance
														
 
															+ value, analogous to
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+'s empirical Bayes squeezing of feature variance
														
 
															  estimates 
														
 
															 \begin_inset CommandInset citation
														
 
															 LatexCommand cite
														
@@ -2155,7 +2335,18 @@ However, removing the systematic component of the batch effect still leaves
 
															  the noise component.
														
 
															  The gene quantifications from the first batch are substantially noisier
														
 
															  than those in the second batch.
														
 
															- This analysis corrected for this by using limma's sample weighting method
														
 
															+ This analysis corrected for this by using
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+'s sample weighting method
														
 
															  to assign lower weights to the noisy samples of batch 1 
														
 
															 \begin_inset CommandInset citation
														
 
															 LatexCommand cite
														
@@ -2200,8 +2391,30 @@ literal "false"
 
															 , and batch-corrected at this point using ComBat.
														
 
															  A linear model was fit to the batch-corrected, quality-weighted data for
														
 
															- each gene using limma, and each gene was tested for differential expression
														
 
															- using limma's empirical Bayes moderated 
														
 
															+ each gene using
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+, and each gene was tested for differential expression
														
 
															+ using
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+limma
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+'s empirical Bayes moderated 
														
 
															 \begin_inset Formula $t$
														
 
															 \end_inset
														
@@ -2869,7 +3082,18 @@ PCoA plots of ChIP-seq sliding window data, before and after subtracting
 
															 \begin_layout Standard
														
 
															 Reads in promoters, peaks, and sliding windows across the genome were counted
														
 
															  and normalized using csaw and analyzed for differential modification using
														
 
															- edgeR 
														
 
															+
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ 
														
 
															 \begin_inset CommandInset citation
														
 
															 LatexCommand cite
														
 
															 key "Lun2014,Lun2015a,Lund2012,Phipson2016"
														
@@ -13078,7 +13302,18 @@ literal "false"
 
															 .
														
 
															  Log2 counts per million values (logCPM) were calculated using the cpm function
														
 
															- in edgeR for individual samples and aveLogCPM function for averages across
														
 
															+ in
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ for individual samples and aveLogCPM function for averages across
														
 
															  groups of samples, using those functions’ default prior count values to
														
 
															  avoid taking the logarithm of 0.
														
 
															  Genes were considered “present” if their average normalized logCPM values
														
@@ -13129,7 +13364,18 @@ Differential Expression Analysis
 
															 \end_layout
														
 
															 \begin_layout Standard
														
 
															-All tests for differential gene expression were performed using edgeR, by
														
 
															+All tests for differential gene expression were performed using
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+, by
														
 
															  first fitting a negative binomial generalized linear model to the counts
														
 
															  and normalization factors and then performing a quasi-likelihood F-test
														
 
															  with robust estimation of outlier gene dispersions 
														
@@ -14311,10 +14557,32 @@ noprefix "false"
 
															 , and genes with an average logCPM below -1 were filtered out.
														
 
															  Each remaining gene was tested for differential abundance with respect
														
 
															- to globin blocking (GB) using edgeR’s quasi-likelihood F-test, fitting
														
 
															+ to globin blocking (GB) using
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+’s quasi-likelihood F-test, fitting
														
 
															  a negative binomial generalized linear model to table of read counts in
														
 
															  each library.
														
 
															- For each gene, edgeR reported average abundance (logCPM), 
														
 
															+ For each gene,
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ reported average abundance (logCPM), 
														
 
															 \begin_inset Formula $\log_{2}$
														
 
															 \end_inset
														
@@ -14439,7 +14707,18 @@ Comparison of inter-sample gene abundance correlations with and without
 
															  All libraries were normalized together as described in Figure 2, and genes
														
 
															  with an average abundance (logCPM, log2 counts per million reads counted)
														
 
															  less than -1 were filtered out.
														
 
															- Each gene’s logCPM was computed in each library using the edgeR cpm function.
														
 
															+ Each gene’s logCPM was computed in each library using the
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ cpm function.
														
 
															  For each pair of biological samples, the Pearson correlation between those
														
 
															  samples' GB libraries was plotted against the correlation between the same
														
 
															  samples’ non-GB libraries.
														
@@ -14487,7 +14766,18 @@ ons than the non-GB libraries.
 
															  sign-rank test: V = 2195, P ≪ 2.2e-16).
														
 
															  Performing the same tests on the Spearman correlations gave the same conclusion
														
 
															  (t-test: t = 26.8, df = 665, P ≪ 2.2e-16; sign-rank test: V = 8781, P ≪ 2.2e-16).
														
 
															- The edgeR package was used to compute the overall biological coefficient
														
 
															+ The
														
 
															+
														
 
															+\begin_inset Flex Code
														
 
															+status open
														
 
															+
														
 
															+\begin_layout Plain Layout
														
 
															+edgeR
														
 
															+\end_layout
														
 
															+
														
 
															+\end_inset
														
 
															+
														
 
															+ package was used to compute the overall biological coefficient
														
 
															  of variation (BCV) for GB and non-GB libraries, and found that globin blocking
														
 
															  resulted in a negligible increase in the BCV (0.417 with GB vs.
														
 
															  0.400 without).