5 anos atrás · bb70ac8b47
--- a/thesis.lyx
+++ b/thesis.lyx
@@ -787,22 +787,17 @@ The studies presented in this work all involve the analysis of high-throughput
 
				  they work.
			
 
				 \end_layout
			
 
				 
			
 
				-\begin_layout Standard
			
 
				-\begin_inset Flex TODO Note (inline)
			
 
				+\begin_layout Subsubsection
			
 
				+\begin_inset Flex Code
			
 
				 status open
			
 
				 
			
 
				 \begin_layout Plain Layout
			
 
				-Many of these points may also be addressed in the approach/methods sections
			
 
				- of the following chapters? Redundant?
			
 
				+Limma
			
 
				 \end_layout
			
 
				 
			
 
				 \end_inset
			
 
				 
			
 
				-
			
 
				-\end_layout
			
 
				-
			
 
				-\begin_layout Subsubsection
			
 
				-Limma: The standard linear modeling framework for genomics
			
 
				+: The standard linear modeling framework for genomics
			
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Standard
			
@@ -820,7 +815,7 @@ literal "false"
 
				 
			
 
				 .
			
 
				  In a typical linear model, there is one dependent variable observation
			
 
				- per sample.
			
 
				+ per sample and a large number of samples.
			
 
				  For example, in a linear model of height as a function of age and sex,
			
 
				  there is one height measurement per person.
			
 
				  However, when analyzing genomic data, each sample consists of observations
			
@@ -833,18 +828,38 @@ literal "false"
 
				  independently to each feature.
			
 
				  However, this is undesirable for most genomics data sets.
			
 
				  Genomics assays like high-throughput sequencing are expensive, and often
			
 
				- generating the samples is also quite expensive and time-consuming.
			
 
				+ the process of generating the samples is also quite expensive and time-consumin
			
 
				+g.
			
 
				  This expense limits the sample sizes typically employed in genomics experiments
			
 
				-, and as a result the statistical power of each individual feature's linear
			
 
				- model is likewise limited.
			
 
				+, and as a result the statistical power of the linear model for each individual
			
 
				+ feature is likewise limited.
			
 
				  However, because thousands of features from the same samples are analyzed
			
 
				  together, there is an opportunity to improve the statistical power of the
			
 
				  analysis by exploiting shared patterns of variation across features.
			
 
				- This is the core feature of limma, a linear modeling framework designed
			
 
				- for genomic data.
			
 
				- Limma is typically used to analyze expression microarray data, and more
			
 
				- recently RNA-seq data, but it can also be used to analyze any other data
			
 
				- for which linear modeling is appropriate.
			
 
				+ This is the core feature of 
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+, a linear modeling framework designed for genomic data.
			
 
				+ 
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+Limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ is typically used to analyze expression microarray data, and more recently
			
 
				+ RNA-seq data, but it can also be used to analyze any other data for which
			
 
				+ linear modeling is appropriate.
			
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Standard
			
@@ -858,7 +873,18 @@ The central challenge when fitting a linear model is to estimate the variance
 
				  variance estimates.
			
 
				  However, this would require the assumption that every feature is equally
			
 
				  variable, which is known to be false for most genomic data sets.
			
 
				- Limma offers a compromise between these two extremes by using a method
			
 
				+
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ offers a compromise between these two extremes by using a method
			
 
				  called empirical Bayes moderation to 
			
 
				 \begin_inset Quotes eld
			
 
				 \end_inset
			
@@ -885,7 +911,18 @@ on of the two yields a variance estimate for each feature with greater precision
 
				  toward the common value introduces some bias – the variance will be underestima
			
 
				 ted for features with high variance and overestimated for features with
			
 
				  low variance.
			
 
				- Essentially, limma assumes that extreme variances are less common than
			
 
				+ Essentially,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ assumes that extreme variances are less common than
			
 
				  variances close to the common value.
			
 
				  The variance estimates from this empirical Bayes procedure are shown empiricall
			
 
				 y to yield greater statistical power than either the individual feature
			
@@ -893,10 +930,32 @@ y to yield greater statistical power than either the individual feature
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Standard
			
 
				-On top of this core framework, limma also implements many other enhancements
			
 
				+On top of this core framework,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ also implements many other enhancements
			
 
				  that, further relax the assumptions of the model and extend the scope of
			
 
				  what kinds of data it can analyze.
			
 
				- Instead of squeezing toward a single common variance value, limma can model
			
 
				+ Instead of squeezing toward a single common variance value,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ can model
			
 
				  the common variance as a function of a covariate, such as average expression
			
 
				  
			
 
				 \begin_inset CommandInset citation
			
@@ -911,7 +970,18 @@ literal "false"
 
				  precise expression measurements and therefore smaller variances than low-count
			
 
				  genes.
			
 
				  While linear models typically assume that all samples have equal variance,
			
 
				- limma is able to relax this assumption by identifying and down-weighting
			
 
				+
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ is able to relax this assumption by identifying and down-weighting
			
 
				  samples the diverge more strongly from the linear model across many features
			
 
				  
			
 
				 \begin_inset CommandInset citation
			
@@ -922,7 +992,18 @@ literal "false"
 
				 \end_inset
			
 
				 
			
 
				 .
			
 
				- In addition, limma is also able to fit simple mixed models incorporating
			
 
				+ In addition,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ is also able to fit simple mixed models incorporating
			
 
				  one random effect in addition to the fixed effects represented by an ordinary
			
 
				  linear model 
			
 
				 \begin_inset CommandInset citation
			
@@ -933,21 +1014,87 @@ literal "false"
 
				 \end_inset
			
 
				 
			
 
				 .
			
 
				- Once again, limma shares information between features to obtain a robust
			
 
				+ Once again,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ shares information between features to obtain a robust
			
 
				  estimate for the random effect correlation.
			
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Subsubsection
			
 
				-edgeR provides limma-like analysis features for count data
			
 
				+edgeR provides
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+-like analysis features for count data
			
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Standard
			
 
				-Although limma can be applied to read counts from RNA-seq data, it is less
			
 
				+Although
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ can be applied to read counts from RNA-seq data, it is less
			
 
				  suitable for counts from ChIP-seq data, which tend to be much smaller and
			
 
				  therefore violate the assumption of a normal distribution more severely.
			
 
				- For all count-based data, the edgeR package works similarly to limma, but
			
 
				+ For all count-based data, the
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ package works similarly to
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+, but
			
 
				  uses a generalized linear model instead of a linear model.
			
 
				- The most important difference is that the GLM in edgeR models the counts
			
 
				+ The most important difference is that the GLM in
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ models the counts
			
 
				  directly using a negative binomial distribution rather than modeling the
			
 
				  normalized log counts using a normal distribution 
			
 
				 \begin_inset CommandInset citation
			
@@ -979,12 +1126,34 @@ noise
 
				  The choice of a gamma distribution is arbitrary and motivated by mathematical
			
 
				  convenience, since a gamma-Poisson mixture yields the numerically tractable
			
 
				  negative binomial distribution.
			
 
				- Thus, edgeR assumes 
			
 
				+ Thus,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ assumes 
			
 
				 \emph on
			
 
				 a prioi 
			
 
				 \emph default
			
 
				 that the variation in abundances between replicates follows a gamma distribution.
			
 
				- For differential abundance testing, edgeR offers a likelihood ratio test,
			
 
				+ For differential abundance testing,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ offers a likelihood ratio test,
			
 
				  but more recently recommends a quasi-likelihood test that properly factors
			
 
				  the uncertainty in variance estimation into the statistical significance
			
 
				  for each feature 
			
@@ -1268,7 +1437,18 @@ In addition to well-understood effects that can be easily normalized out,
 
				  However, as with variance estimation, estimating the differences in batch
			
 
				  means is not necessarily robust at the feature level, so the ComBat method
			
 
				  adds empirical Bayes squeezing of the batch mean differences toward a common
			
 
				- value, analogous to limma's empirical Bayes squeezing of feature variance
			
 
				+ value, analogous to
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+'s empirical Bayes squeezing of feature variance
			
 
				  estimates 
			
 
				 \begin_inset CommandInset citation
			
 
				 LatexCommand cite
			
@@ -2155,7 +2335,18 @@ However, removing the systematic component of the batch effect still leaves
 
				  the noise component.
			
 
				  The gene quantifications from the first batch are substantially noisier
			
 
				  than those in the second batch.
			
 
				- This analysis corrected for this by using limma's sample weighting method
			
 
				+ This analysis corrected for this by using
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+'s sample weighting method
			
 
				  to assign lower weights to the noisy samples of batch 1 
			
 
				 \begin_inset CommandInset citation
			
 
				 LatexCommand cite
			
@@ -2200,8 +2391,30 @@ literal "false"
 
				 
			
 
				 , and batch-corrected at this point using ComBat.
			
 
				  A linear model was fit to the batch-corrected, quality-weighted data for
			
 
				- each gene using limma, and each gene was tested for differential expression
			
 
				- using limma's empirical Bayes moderated 
			
 
				+ each gene using
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+, and each gene was tested for differential expression
			
 
				+ using
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+limma
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+'s empirical Bayes moderated 
			
 
				 \begin_inset Formula $t$
			
 
				 \end_inset
			
 
				 
			
@@ -2869,7 +3082,18 @@ PCoA plots of ChIP-seq sliding window data, before and after subtracting
 
				 \begin_layout Standard
			
 
				 Reads in promoters, peaks, and sliding windows across the genome were counted
			
 
				  and normalized using csaw and analyzed for differential modification using
			
 
				- edgeR 
			
 
				+
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ 
			
 
				 \begin_inset CommandInset citation
			
 
				 LatexCommand cite
			
 
				 key "Lun2014,Lun2015a,Lund2012,Phipson2016"
			
@@ -13078,7 +13302,18 @@ literal "false"
 
				 
			
 
				 .
			
 
				  Log2 counts per million values (logCPM) were calculated using the cpm function
			
 
				- in edgeR for individual samples and aveLogCPM function for averages across
			
 
				+ in
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ for individual samples and aveLogCPM function for averages across
			
 
				  groups of samples, using those functions’ default prior count values to
			
 
				  avoid taking the logarithm of 0.
			
 
				  Genes were considered “present” if their average normalized logCPM values
			
@@ -13129,7 +13364,18 @@ Differential Expression Analysis
 
				 \end_layout
			
 
				 
			
 
				 \begin_layout Standard
			
 
				-All tests for differential gene expression were performed using edgeR, by
			
 
				+All tests for differential gene expression were performed using
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+, by
			
 
				  first fitting a negative binomial generalized linear model to the counts
			
 
				  and normalization factors and then performing a quasi-likelihood F-test
			
 
				  with robust estimation of outlier gene dispersions 
			
@@ -14311,10 +14557,32 @@ noprefix "false"
 
				 
			
 
				 , and genes with an average logCPM below -1 were filtered out.
			
 
				  Each remaining gene was tested for differential abundance with respect
			
 
				- to globin blocking (GB) using edgeR’s quasi-likelihood F-test, fitting
			
 
				+ to globin blocking (GB) using
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+’s quasi-likelihood F-test, fitting
			
 
				  a negative binomial generalized linear model to table of read counts in
			
 
				  each library.
			
 
				- For each gene, edgeR reported average abundance (logCPM), 
			
 
				+ For each gene,
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ reported average abundance (logCPM), 
			
 
				 \begin_inset Formula $\log_{2}$
			
 
				 \end_inset
			
 
				 
			
@@ -14439,7 +14707,18 @@ Comparison of inter-sample gene abundance correlations with and without
 
				  All libraries were normalized together as described in Figure 2, and genes
			
 
				  with an average abundance (logCPM, log2 counts per million reads counted)
			
 
				  less than -1 were filtered out.
			
 
				- Each gene’s logCPM was computed in each library using the edgeR cpm function.
			
 
				+ Each gene’s logCPM was computed in each library using the
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ cpm function.
			
 
				  For each pair of biological samples, the Pearson correlation between those
			
 
				  samples' GB libraries was plotted against the correlation between the same
			
 
				  samples’ non-GB libraries.
			
@@ -14487,7 +14766,18 @@ ons than the non-GB libraries.
 
				  sign-rank test: V = 2195, P ≪ 2.2e-16).
			
 
				  Performing the same tests on the Spearman correlations gave the same conclusion
			
 
				  (t-test: t = 26.8, df = 665, P ≪ 2.2e-16; sign-rank test: V = 8781, P ≪ 2.2e-16).
			
 
				- The edgeR package was used to compute the overall biological coefficient
			
 
				+ The
			
 
				+
			
 
				+\begin_inset Flex Code
			
 
				+status open
			
 
				+
			
 
				+\begin_layout Plain Layout
			
 
				+edgeR
			
 
				+\end_layout
			
 
				+
			
 
				+\end_inset
			
 
				+
			
 
				+ package was used to compute the overall biological coefficient
			
 
				  of variation (BCV) for GB and non-GB libraries, and found that globin blocking
			
 
				  resulted in a negligible increase in the BCV (0.417 with GB vs.
			
 
				  0.400 without).