浏览代码

Progress on methods overview in intro

Ryan C. Thompson 5 年之前
父节点
当前提交
40551054ec
共有 1 个文件被更改,包括 213 次插入28 次删除
  1. 213 28
      thesis.lyx

+ 213 - 28
thesis.lyx

@@ -662,8 +662,13 @@ Overview of bioinformatic analysis methods
 \end_layout
 
 \begin_layout Standard
-An overview of all the methods used, including what problem they solve,
- what assumptions they make, and a basic description of how they work.
+The studies presented in this work all involve the analysis of high-throughput
+ genomic and epigenomic data.
+ These data present many unique analysis challenges, and a wide array of
+ software tools are available to analyze them.
+ This section presents an overview of the methods used, including what problems
+ they solve, what assumptions they make, and a basic description of how
+ they work.
 \end_layout
 
 \begin_layout Standard
@@ -671,8 +676,8 @@ An overview of all the methods used, including what problem they solve,
 status open
 
 \begin_layout Plain Layout
-Many of these points are also addressed in the approach sections of the
- following chapters? Redundant?
+Many of these points may also be addressed in the approach/methods sections
+ of the following chapters? Redundant?
 \end_layout
 
 \end_inset
@@ -680,6 +685,193 @@ Many of these points are also addressed in the approach sections of the
 
 \end_layout
 
+\begin_layout Subsubsection
+Limma: The standard linear modeling framework for genomics
+\end_layout
+
+\begin_layout Standard
+Linear models are a generalization of the 
+\begin_inset Formula $t$
+\end_inset
+
+-test and ANOVA to arbitrarily complex experimental designs.
+ In a typical linear model, there is one dependent variable observation
+ per sample.
+ For example, in a linear model of height as a function of age and sex,
+ there is one height measurement per person.
+ However, when analyzing genomic data, each sample consists of observations
+ of thousands of dependent variables.
+ For example, in an RNA-seq experiment, the dependent variables may be the
+ count of RNA-seq reads for each annotated gene.
+ In abstract terms, each dependent variable being measured is referred to
+ as a feature.
+ The simplest approach to analyzing such data would be to fit the same model
+ independently to each feature.
+ However, this is undesirable for most genomics data sets.
+ Genomics assays like high-throughput sequencing are expensive, and often
+ generating the samples is also quite expensive and time-consuming.
+ This expense limits the sample sizes typically employed in genomics experiments
+, and as a result the statistical power of each individual feature's linear
+ model is likewise limited.
+ However, because thousands of features from the same samples are analyzed
+ together, there is an opportunity to improve the statistical power of the
+ analysis by exploiting shared patterns of variation across features.
+ This is the core feature of limma, a linear modeling framework designed
+ for genomic data.
+ Limma is typically used to analyze expression microarray data, and more
+ recently RNA-seq data, but it can also be used to analyze any other data
+ for which linear modeling is appropriate.
+\end_layout
+
+\begin_layout Standard
+The central challenge when fitting a linear model is to estimate the variance
+ of the data accurately.
+ This quantity is the most difficult to estimate when sample sizes are small.
+ A single shared variance could be estimated for all of the features together,
+ and this estimate would be very stable, in contrast to the individual feature
+ variance estimates.
+ However, this would require the assumption that every feature is equally
+ variable, which is known to be false for most genomic data sets.
+ Limma offers a compromise between these two extremes by using a method
+ called empirical Bayes moderation to 
+\begin_inset Quotes eld
+\end_inset
+
+squeeze
+\begin_inset Quotes erd
+\end_inset
+
+ the distribution of estimated variances toward a single common value that
+ represents the variance of an average feature in the data 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Smyth2004"
+literal "false"
+
+\end_inset
+
+.
+ While the individual feature variance estimates are not stable, the common
+ variance estiamate for the entire data set is quite stable, so using a
+ combination of the two yields a variance estimate for each feature with
+ greater precision than the individual feature varaiances.
+ The trade-off for this improvement is that squeezing each estimated variance
+ toward the common value introduces some bias – the variance will be underestima
+ted for features with high variance and overestimated for features with
+ low variance.
+ Essentially, limma assumes that extreme variances are less common than
+ variances close to the common value.
+ The variance estimates from this empirical Bayes procedure are shown empiricall
+y to yield greater statistical power than either the individual feature
+ variances or the single common value.
+\end_layout
+
+\begin_layout Standard
+On top of this core framework, limma also implements many other enhancements
+ that, further relax the assumptions of the model and extend the scope of
+ what kinds of data it can analyze.
+ Instead of squeezing toward a single common variance value, limma can model
+ the common variance as a function of a covariate, such as average expression
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Law2013"
+literal "false"
+
+\end_inset
+
+.
+ This is essential for RNA-seq data, where higher gene counts yield more
+ precise expression measurements and therefore smaller variances than low-count
+ genes.
+ While linear models typically assume that all samples have equal variance,
+ limma is able to relax this assumption by identifying and down-weighting
+ samples the diverge more strongly from the lienar model across many features
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Ritchie2006,Liu2015"
+literal "false"
+
+\end_inset
+
+.
+ In addition, limma is also able to fit simple mixed models incorporating
+ one random effect in addition to the fixed effects represented by an ordinary
+ linear model 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Smyth2005a"
+literal "false"
+
+\end_inset
+
+.
+ Once again, limma shares information between features to obtain a robust
+ estimate for the random effect correlation.
+\end_layout
+
+\begin_layout Subsubsection
+edgeR provides limma-like analysis features for count data
+\end_layout
+
+\begin_layout Standard
+Although limma can be applied to read counts from RNA-seq data, it is less
+ suitable for counts from ChIP-seq data, which tend to be much smaller and
+ therefore violate the assumption of a normal distribution more severely.
+ For all count-based data, the edgeR package works similarly to limma, but
+ uses a generalized linear model instead of a linear model.
+ The most important difference is that the GLM in edgeR models the counts
+ directly using a negative binomial distribution rather than modeling the
+ normalized log counts using a normal distribution 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Chen2014,McCarthy2012,Robinson2010a"
+literal "false"
+
+\end_inset
+
+.
+ The negative binomial is a good fit for count data because it can be derived
+ as a gamma-distributed mixture of Poisson distributions.
+ The Poisson distribution accurately represents the distribution of counts
+ expected for a given gene abundance, and the gamma distribution is then
+ used to represent the variation in gene abundance between biological replicates.
+ For this reason, the square root of the dispersion paramter of the negative
+ binomial is sometimes referred to as the biological coefficient of variation,
+ since it represents the variability that was present in the samples prior
+ to the Poisson 
+\begin_inset Quotes eld
+\end_inset
+
+noise
+\begin_inset Quotes erd
+\end_inset
+
+ that was generated by the random sampling of reads in proportion to feature
+ abundances.
+ The choice of a gamma distribution is arbitrary and motivated by mathematical
+ convenience, since a gamma-Poisson mixture yields the numerically tractable
+ negative binomial distribution.
+ Thus, edgeR assumes 
+\emph on
+a prioi 
+\emph default
+that the variation in abundances between replicates follows a gamma distribution.
+ For differential abundance testing, edgeR offers a likelihood ratio test,
+ but more recently recommends a quasi-likelihood test that properly factors
+ the uncertainty in variance estimation into the statistical significance
+ for each feature 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Lund2012"
+literal "false"
+
+\end_inset
+
+.
+\end_layout
+
 \begin_layout Subsubsection
 ChIP-seq Peak calling
 \end_layout
@@ -726,27 +918,6 @@ ChIP-seq: complex with many considerations, dependent on experimental methods,
  biological system, and analysis goals
 \end_layout
 
-\begin_layout Subsubsection
-Limma: The standard linear modeling framework for genomics
-\end_layout
-
-\begin_layout Itemize
-empirical Bayes variance modeling: limma's core feature
-\end_layout
-
-\begin_layout Itemize
-edgeR & DESeq2: Extend with negative bonomial GLM for RNA-seq and other
- count data
-\end_layout
-
-\begin_layout Itemize
-voom: Extend with precision weights to model mean-variance trend
-\end_layout
-
-\begin_layout Itemize
-arrayWeights and duplicateCorrelation to handle complex variance structures
-\end_layout
-
 \begin_layout Subsubsection
 sva and ComBat for batch correction
 \end_layout
@@ -764,6 +935,20 @@ Batch-corrected PCA is informative, but careful application is required
 Innovation
 \end_layout
 
+\begin_layout Standard
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Is this entire section redundant with the Approach sections of each chapter?
+ I'm not really sure what to write here.
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
 \begin_layout Subsection
 MSC infusion to improve transplant outcomes (prevent/delay rejection)
 \end_layout
@@ -13613,9 +13798,9 @@ noprefix "false"
 
 , and genes with an average logCPM below -1 were filtered out.
  Each remaining gene was tested for differential abundance with respect
- to globin blocking (GB) using edgeR’s quasi-likelihod F-test, fitting a
- negative binomial generalized linear model to table of read counts in each
- library.
+ to globin blocking (GB) using edgeR’s quasi-likelihood F-test, fitting
+ a negative binomial generalized linear model to table of read counts in
+ each library.
  For each gene, edgeR reported average abundance (logCPM), 
 \begin_inset Formula $\log_{2}$
 \end_inset