瀏覽代碼

Minor text edits, and format some package names as code

Ryan C. Thompson 5 年之前
父節點
當前提交
bb70ac8b47
共有 1 個文件被更改,包括 332 次插入42 次删除
  1. 332 42
      thesis.lyx

+ 332 - 42
thesis.lyx

@@ -787,22 +787,17 @@ The studies presented in this work all involve the analysis of high-throughput
  they work.
  they work.
 \end_layout
 \end_layout
 
 
-\begin_layout Standard
-\begin_inset Flex TODO Note (inline)
+\begin_layout Subsubsection
+\begin_inset Flex Code
 status open
 status open
 
 
 \begin_layout Plain Layout
 \begin_layout Plain Layout
-Many of these points may also be addressed in the approach/methods sections
- of the following chapters? Redundant?
+Limma
 \end_layout
 \end_layout
 
 
 \end_inset
 \end_inset
 
 
-
-\end_layout
-
-\begin_layout Subsubsection
-Limma: The standard linear modeling framework for genomics
+: The standard linear modeling framework for genomics
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -820,7 +815,7 @@ literal "false"
 
 
 .
 .
  In a typical linear model, there is one dependent variable observation
  In a typical linear model, there is one dependent variable observation
- per sample.
+ per sample and a large number of samples.
  For example, in a linear model of height as a function of age and sex,
  For example, in a linear model of height as a function of age and sex,
  there is one height measurement per person.
  there is one height measurement per person.
  However, when analyzing genomic data, each sample consists of observations
  However, when analyzing genomic data, each sample consists of observations
@@ -833,18 +828,38 @@ literal "false"
  independently to each feature.
  independently to each feature.
  However, this is undesirable for most genomics data sets.
  However, this is undesirable for most genomics data sets.
  Genomics assays like high-throughput sequencing are expensive, and often
  Genomics assays like high-throughput sequencing are expensive, and often
- generating the samples is also quite expensive and time-consuming.
+ the process of generating the samples is also quite expensive and time-consumin
+g.
  This expense limits the sample sizes typically employed in genomics experiments
  This expense limits the sample sizes typically employed in genomics experiments
-, and as a result the statistical power of each individual feature's linear
- model is likewise limited.
+, and as a result the statistical power of the linear model for each individual
+ feature is likewise limited.
  However, because thousands of features from the same samples are analyzed
  However, because thousands of features from the same samples are analyzed
  together, there is an opportunity to improve the statistical power of the
  together, there is an opportunity to improve the statistical power of the
  analysis by exploiting shared patterns of variation across features.
  analysis by exploiting shared patterns of variation across features.
- This is the core feature of limma, a linear modeling framework designed
- for genomic data.
- Limma is typically used to analyze expression microarray data, and more
- recently RNA-seq data, but it can also be used to analyze any other data
- for which linear modeling is appropriate.
+ This is the core feature of 
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+, a linear modeling framework designed for genomic data.
+ 
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+Limma
+\end_layout
+
+\end_inset
+
+ is typically used to analyze expression microarray data, and more recently
+ RNA-seq data, but it can also be used to analyze any other data for which
+ linear modeling is appropriate.
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -858,7 +873,18 @@ The central challenge when fitting a linear model is to estimate the variance
  variance estimates.
  variance estimates.
  However, this would require the assumption that every feature is equally
  However, this would require the assumption that every feature is equally
  variable, which is known to be false for most genomic data sets.
  variable, which is known to be false for most genomic data sets.
- Limma offers a compromise between these two extremes by using a method
+
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ offers a compromise between these two extremes by using a method
  called empirical Bayes moderation to 
  called empirical Bayes moderation to 
 \begin_inset Quotes eld
 \begin_inset Quotes eld
 \end_inset
 \end_inset
@@ -885,7 +911,18 @@ on of the two yields a variance estimate for each feature with greater precision
  toward the common value introduces some bias – the variance will be underestima
  toward the common value introduces some bias – the variance will be underestima
 ted for features with high variance and overestimated for features with
 ted for features with high variance and overestimated for features with
  low variance.
  low variance.
- Essentially, limma assumes that extreme variances are less common than
+ Essentially,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ assumes that extreme variances are less common than
  variances close to the common value.
  variances close to the common value.
  The variance estimates from this empirical Bayes procedure are shown empiricall
  The variance estimates from this empirical Bayes procedure are shown empiricall
 y to yield greater statistical power than either the individual feature
 y to yield greater statistical power than either the individual feature
@@ -893,10 +930,32 @@ y to yield greater statistical power than either the individual feature
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
-On top of this core framework, limma also implements many other enhancements
+On top of this core framework,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ also implements many other enhancements
  that, further relax the assumptions of the model and extend the scope of
  that, further relax the assumptions of the model and extend the scope of
  what kinds of data it can analyze.
  what kinds of data it can analyze.
- Instead of squeezing toward a single common variance value, limma can model
+ Instead of squeezing toward a single common variance value,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ can model
  the common variance as a function of a covariate, such as average expression
  the common variance as a function of a covariate, such as average expression
  
  
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
@@ -911,7 +970,18 @@ literal "false"
  precise expression measurements and therefore smaller variances than low-count
  precise expression measurements and therefore smaller variances than low-count
  genes.
  genes.
  While linear models typically assume that all samples have equal variance,
  While linear models typically assume that all samples have equal variance,
- limma is able to relax this assumption by identifying and down-weighting
+
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ is able to relax this assumption by identifying and down-weighting
  samples the diverge more strongly from the linear model across many features
  samples the diverge more strongly from the linear model across many features
  
  
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
@@ -922,7 +992,18 @@ literal "false"
 \end_inset
 \end_inset
 
 
 .
 .
- In addition, limma is also able to fit simple mixed models incorporating
+ In addition,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ is also able to fit simple mixed models incorporating
  one random effect in addition to the fixed effects represented by an ordinary
  one random effect in addition to the fixed effects represented by an ordinary
  linear model 
  linear model 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
@@ -933,21 +1014,87 @@ literal "false"
 \end_inset
 \end_inset
 
 
 .
 .
- Once again, limma shares information between features to obtain a robust
+ Once again,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ shares information between features to obtain a robust
  estimate for the random effect correlation.
  estimate for the random effect correlation.
 \end_layout
 \end_layout
 
 
 \begin_layout Subsubsection
 \begin_layout Subsubsection
-edgeR provides limma-like analysis features for count data
+edgeR provides
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+-like analysis features for count data
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
-Although limma can be applied to read counts from RNA-seq data, it is less
+Although
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+ can be applied to read counts from RNA-seq data, it is less
  suitable for counts from ChIP-seq data, which tend to be much smaller and
  suitable for counts from ChIP-seq data, which tend to be much smaller and
  therefore violate the assumption of a normal distribution more severely.
  therefore violate the assumption of a normal distribution more severely.
- For all count-based data, the edgeR package works similarly to limma, but
+ For all count-based data, the
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ package works similarly to
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+, but
  uses a generalized linear model instead of a linear model.
  uses a generalized linear model instead of a linear model.
- The most important difference is that the GLM in edgeR models the counts
+ The most important difference is that the GLM in
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ models the counts
  directly using a negative binomial distribution rather than modeling the
  directly using a negative binomial distribution rather than modeling the
  normalized log counts using a normal distribution 
  normalized log counts using a normal distribution 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
@@ -979,12 +1126,34 @@ noise
  The choice of a gamma distribution is arbitrary and motivated by mathematical
  The choice of a gamma distribution is arbitrary and motivated by mathematical
  convenience, since a gamma-Poisson mixture yields the numerically tractable
  convenience, since a gamma-Poisson mixture yields the numerically tractable
  negative binomial distribution.
  negative binomial distribution.
- Thus, edgeR assumes 
+ Thus,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ assumes 
 \emph on
 \emph on
 a prioi 
 a prioi 
 \emph default
 \emph default
 that the variation in abundances between replicates follows a gamma distribution.
 that the variation in abundances between replicates follows a gamma distribution.
- For differential abundance testing, edgeR offers a likelihood ratio test,
+ For differential abundance testing,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ offers a likelihood ratio test,
  but more recently recommends a quasi-likelihood test that properly factors
  but more recently recommends a quasi-likelihood test that properly factors
  the uncertainty in variance estimation into the statistical significance
  the uncertainty in variance estimation into the statistical significance
  for each feature 
  for each feature 
@@ -1268,7 +1437,18 @@ In addition to well-understood effects that can be easily normalized out,
  However, as with variance estimation, estimating the differences in batch
  However, as with variance estimation, estimating the differences in batch
  means is not necessarily robust at the feature level, so the ComBat method
  means is not necessarily robust at the feature level, so the ComBat method
  adds empirical Bayes squeezing of the batch mean differences toward a common
  adds empirical Bayes squeezing of the batch mean differences toward a common
- value, analogous to limma's empirical Bayes squeezing of feature variance
+ value, analogous to
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+'s empirical Bayes squeezing of feature variance
  estimates 
  estimates 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
 LatexCommand cite
 LatexCommand cite
@@ -2155,7 +2335,18 @@ However, removing the systematic component of the batch effect still leaves
  the noise component.
  the noise component.
  The gene quantifications from the first batch are substantially noisier
  The gene quantifications from the first batch are substantially noisier
  than those in the second batch.
  than those in the second batch.
- This analysis corrected for this by using limma's sample weighting method
+ This analysis corrected for this by using
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+'s sample weighting method
  to assign lower weights to the noisy samples of batch 1 
  to assign lower weights to the noisy samples of batch 1 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
 LatexCommand cite
 LatexCommand cite
@@ -2200,8 +2391,30 @@ literal "false"
 
 
 , and batch-corrected at this point using ComBat.
 , and batch-corrected at this point using ComBat.
  A linear model was fit to the batch-corrected, quality-weighted data for
  A linear model was fit to the batch-corrected, quality-weighted data for
- each gene using limma, and each gene was tested for differential expression
- using limma's empirical Bayes moderated 
+ each gene using
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+, and each gene was tested for differential expression
+ using
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+limma
+\end_layout
+
+\end_inset
+
+'s empirical Bayes moderated 
 \begin_inset Formula $t$
 \begin_inset Formula $t$
 \end_inset
 \end_inset
 
 
@@ -2869,7 +3082,18 @@ PCoA plots of ChIP-seq sliding window data, before and after subtracting
 \begin_layout Standard
 \begin_layout Standard
 Reads in promoters, peaks, and sliding windows across the genome were counted
 Reads in promoters, peaks, and sliding windows across the genome were counted
  and normalized using csaw and analyzed for differential modification using
  and normalized using csaw and analyzed for differential modification using
- edgeR 
+
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
 LatexCommand cite
 LatexCommand cite
 key "Lun2014,Lun2015a,Lund2012,Phipson2016"
 key "Lun2014,Lun2015a,Lund2012,Phipson2016"
@@ -13078,7 +13302,18 @@ literal "false"
 
 
 .
 .
  Log2 counts per million values (logCPM) were calculated using the cpm function
  Log2 counts per million values (logCPM) were calculated using the cpm function
- in edgeR for individual samples and aveLogCPM function for averages across
+ in
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ for individual samples and aveLogCPM function for averages across
  groups of samples, using those functions’ default prior count values to
  groups of samples, using those functions’ default prior count values to
  avoid taking the logarithm of 0.
  avoid taking the logarithm of 0.
  Genes were considered “present” if their average normalized logCPM values
  Genes were considered “present” if their average normalized logCPM values
@@ -13129,7 +13364,18 @@ Differential Expression Analysis
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
-All tests for differential gene expression were performed using edgeR, by
+All tests for differential gene expression were performed using
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+, by
  first fitting a negative binomial generalized linear model to the counts
  first fitting a negative binomial generalized linear model to the counts
  and normalization factors and then performing a quasi-likelihood F-test
  and normalization factors and then performing a quasi-likelihood F-test
  with robust estimation of outlier gene dispersions 
  with robust estimation of outlier gene dispersions 
@@ -14311,10 +14557,32 @@ noprefix "false"
 
 
 , and genes with an average logCPM below -1 were filtered out.
 , and genes with an average logCPM below -1 were filtered out.
  Each remaining gene was tested for differential abundance with respect
  Each remaining gene was tested for differential abundance with respect
- to globin blocking (GB) using edgeR’s quasi-likelihood F-test, fitting
+ to globin blocking (GB) using
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+’s quasi-likelihood F-test, fitting
  a negative binomial generalized linear model to table of read counts in
  a negative binomial generalized linear model to table of read counts in
  each library.
  each library.
- For each gene, edgeR reported average abundance (logCPM), 
+ For each gene,
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ reported average abundance (logCPM), 
 \begin_inset Formula $\log_{2}$
 \begin_inset Formula $\log_{2}$
 \end_inset
 \end_inset
 
 
@@ -14439,7 +14707,18 @@ Comparison of inter-sample gene abundance correlations with and without
  All libraries were normalized together as described in Figure 2, and genes
  All libraries were normalized together as described in Figure 2, and genes
  with an average abundance (logCPM, log2 counts per million reads counted)
  with an average abundance (logCPM, log2 counts per million reads counted)
  less than -1 were filtered out.
  less than -1 were filtered out.
- Each gene’s logCPM was computed in each library using the edgeR cpm function.
+ Each gene’s logCPM was computed in each library using the
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ cpm function.
  For each pair of biological samples, the Pearson correlation between those
  For each pair of biological samples, the Pearson correlation between those
  samples' GB libraries was plotted against the correlation between the same
  samples' GB libraries was plotted against the correlation between the same
  samples’ non-GB libraries.
  samples’ non-GB libraries.
@@ -14487,7 +14766,18 @@ ons than the non-GB libraries.
  sign-rank test: V = 2195, P ≪ 2.2e-16).
  sign-rank test: V = 2195, P ≪ 2.2e-16).
  Performing the same tests on the Spearman correlations gave the same conclusion
  Performing the same tests on the Spearman correlations gave the same conclusion
  (t-test: t = 26.8, df = 665, P ≪ 2.2e-16; sign-rank test: V = 8781, P ≪ 2.2e-16).
  (t-test: t = 26.8, df = 665, P ≪ 2.2e-16; sign-rank test: V = 8781, P ≪ 2.2e-16).
- The edgeR package was used to compute the overall biological coefficient
+ The
+
+\begin_inset Flex Code
+status open
+
+\begin_layout Plain Layout
+edgeR
+\end_layout
+
+\end_inset
+
+ package was used to compute the overall biological coefficient
  of variation (BCV) for GB and non-GB libraries, and found that globin blocking
  of variation (BCV) for GB and non-GB libraries, and found that globin blocking
  resulted in a negligible increase in the BCV (0.417 with GB vs.
  resulted in a negligible increase in the BCV (0.417 with GB vs.
  0.400 without).
  0.400 without).