Просмотр исходного кода

Fix issues noted by JRT in intro

Ryan C. Thompson 5 лет назад
Родитель
Сommit
a3d5b63dee
1 измененных файлов с 46 добавлено и 83 удалено
  1. 46 83
      thesis.lyx

+ 46 - 83
thesis.lyx

@@ -873,8 +873,6 @@ The central challenge when fitting a linear model is to estimate the variance
  variance estimates.
  However, this would require the assumption that every feature is equally
  variable, which is known to be false for most genomic data sets.
-
-
 \begin_inset Flex Code
 status open
 
@@ -884,8 +882,8 @@ limma
 
 \end_inset
 
- offers a compromise between these two extremes by using a method
- called empirical Bayes moderation to 
+ offers a compromise between these two extremes by using a method called
+ empirical Bayes moderation to 
 \begin_inset Quotes eld
 \end_inset
 
@@ -912,7 +910,6 @@ on of the two yields a variance estimate for each feature with greater precision
 ted for features with high variance and overestimated for features with
  low variance.
  Essentially,
-
 \begin_inset Flex Code
 status open
 
@@ -922,8 +919,8 @@ limma
 
 \end_inset
 
- assumes that extreme variances are less common than
- variances close to the common value.
+ assumes that extreme variances are less common than variances close to
+ the common value.
  The variance estimates from this empirical Bayes procedure are shown empiricall
 y to yield greater statistical power than either the individual feature
  variances or the single common value.
@@ -931,7 +928,6 @@ y to yield greater statistical power than either the individual feature
 
 \begin_layout Standard
 On top of this core framework,
-
 \begin_inset Flex Code
 status open
 
@@ -941,11 +937,9 @@ limma
 
 \end_inset
 
- also implements many other enhancements
- that, further relax the assumptions of the model and extend the scope of
- what kinds of data it can analyze.
+ also implements many other enhancements that, further relax the assumptions
+ of the model and extend the scope of what kinds of data it can analyze.
  Instead of squeezing toward a single common variance value,
-
 \begin_inset Flex Code
 status open
 
@@ -955,9 +949,8 @@ limma
 
 \end_inset
 
- can model
- the common variance as a function of a covariate, such as average expression
- 
+ can model the common variance as a function of a covariate, such as average
+ expression 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Law2013"
@@ -970,8 +963,6 @@ literal "false"
  precise expression measurements and therefore smaller variances than low-count
  genes.
  While linear models typically assume that all samples have equal variance,
-
-
 \begin_inset Flex Code
 status open
 
@@ -981,9 +972,8 @@ limma
 
 \end_inset
 
- is able to relax this assumption by identifying and down-weighting
- samples the diverge more strongly from the linear model across many features
- 
+ is able to relax this assumption by identifying and down-weighting samples
+ that diverge more strongly from the linear model across many features 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Ritchie2006,Liu2015"
@@ -993,7 +983,6 @@ literal "false"
 
 .
  In addition,
-
 \begin_inset Flex Code
 status open
 
@@ -1003,9 +992,9 @@ limma
 
 \end_inset
 
- is also able to fit simple mixed models incorporating
- one random effect in addition to the fixed effects represented by an ordinary
- linear model 
+ is also able to fit simple mixed models incorporating one random effect
+ in addition to the fixed effects represented by an ordinary linear model
+ 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Smyth2005a"
@@ -1015,7 +1004,6 @@ literal "false"
 
 .
  Once again,
-
 \begin_inset Flex Code
 status open
 
@@ -1025,13 +1013,12 @@ limma
 
 \end_inset
 
- shares information between features to obtain a robust
- estimate for the random effect correlation.
+ shares information between features to obtain a robust estimate for the
+ random effect correlation.
 \end_layout
 
 \begin_layout Subsubsection
 edgeR provides
-
 \begin_inset Flex Code
 status open
 
@@ -1046,7 +1033,6 @@ limma
 
 \begin_layout Standard
 Although
-
 \begin_inset Flex Code
 status open
 
@@ -1056,11 +1042,10 @@ limma
 
 \end_inset
 
- can be applied to read counts from RNA-seq data, it is less
- suitable for counts from ChIP-seq data, which tend to be much smaller and
- therefore violate the assumption of a normal distribution more severely.
+ can be applied to read counts from RNA-seq data, it is less suitable for
+ counts from ChIP-seq data, which tend to be much smaller and therefore
+ violate the assumption of a normal distribution more severely.
  For all count-based data, the
-
 \begin_inset Flex Code
 status open
 
@@ -1071,7 +1056,6 @@ edgeR
 \end_inset
 
  package works similarly to
-
 \begin_inset Flex Code
 status open
 
@@ -1081,10 +1065,8 @@ limma
 
 \end_inset
 
-, but
- uses a generalized linear model instead of a linear model.
+, but uses a generalized linear model instead of a linear model.
  The most important difference is that the GLM in
-
 \begin_inset Flex Code
 status open
 
@@ -1094,9 +1076,8 @@ edgeR
 
 \end_inset
 
- models the counts
- directly using a negative binomial distribution rather than modeling the
- normalized log counts using a normal distribution 
+ models the counts directly using a negative binomial distribution rather
+ than modeling the normalized log counts using a normal distribution 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Chen2014,McCarthy2012,Robinson2010a"
@@ -1127,7 +1108,6 @@ noise
  convenience, since a gamma-Poisson mixture yields the numerically tractable
  negative binomial distribution.
  Thus,
-
 \begin_inset Flex Code
 status open
 
@@ -1143,7 +1123,6 @@ a prioi
 \emph default
 that the variation in abundances between replicates follows a gamma distribution.
  For differential abundance testing,
-
 \begin_inset Flex Code
 status open
 
@@ -1153,10 +1132,9 @@ edgeR
 
 \end_inset
 
- offers a likelihood ratio test,
- but more recently recommends a quasi-likelihood test that properly factors
- the uncertainty in variance estimation into the statistical significance
- for each feature 
+ offers a likelihood ratio test, but more recently recommends a quasi-likelihood
+ test that properly factors the uncertainty in variance estimation into
+ the statistical significance for each feature 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Lund2012"
@@ -1173,8 +1151,8 @@ ChIP-seq Peak calling
 
 \begin_layout Standard
 Unlike RNA-seq data, in which gene annotations provide a well-defined set
- of genomic regions in which to count reads, ChIP-seq data can potentially
- occur anywhere in the genome.
+ of discrete genomic regions in which to count reads, ChIP-seq reads can
+ potentially occur anywhere in the genome.
  However, most genome regions will not contain significant ChIP-seq read
  coverage, and analyzing every position in the entire genome is statistically
  and computationally infeasible, so it is necessary to identify regions
@@ -1194,10 +1172,11 @@ Unlike RNA-seq data, in which gene annotations provide a well-defined set
 There are generally two kinds of peaks that can be identified: narrow peaks
  and broadly enriched regions.
  Proteins like transcription factors that bind specific sites in the genome
- typically show most of their read coverage at these specific sites and
- very little coverage anywhere else.
+ typically show most of their ChIP-seq read coverage at these specific sites
+ and very little coverage anywhere else.
  Because the footprint of the protein is consistent wherever it binds, each
- peak has a consistent size, typically tens to hundreds of base pairs.
+ peak has a consistent width, typically tens to hundreds of base pairs,
+ representing the length of DNA that it binds to.
  Algorithms like MACS exploit this pattern to identify specific loci at
  which such 
 \begin_inset Quotes eld
@@ -1293,7 +1272,7 @@ crossover point
 \begin_inset Quotes erd
 \end_inset
 
- between the signal and the noise by determining how for down the list the
+ between the signal and the noise by determining how far down the list the
  correspondence between feature ranks breaks down.
 \end_layout
 
@@ -1323,7 +1302,7 @@ Normalization of high-throughput data is non-trivial and application-dependent
 \end_layout
 
 \begin_layout Standard
-High-throughput data sets invariable require some kind of normalization
+High-throughput data sets invariably require some kind of normalization
  before further analysis can be conducted.
  In general, the goal of normalization is to remove effects in the data
  that are caused by technical factors that have nothing to do with the biology
@@ -1438,7 +1417,6 @@ In addition to well-understood effects that can be easily normalized out,
  means is not necessarily robust at the feature level, so the ComBat method
  adds empirical Bayes squeezing of the batch mean differences toward a common
  value, analogous to
-
 \begin_inset Flex Code
 status open
 
@@ -1448,8 +1426,7 @@ limma
 
 \end_inset
 
-'s empirical Bayes squeezing of feature variance
- estimates 
+'s empirical Bayes squeezing of feature variance estimates 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Johnson2007"
@@ -1479,7 +1456,7 @@ In some data sets, unknown batch effects may be present due to inherent
  shared across features to identify patterns of un-modeled variation that
  are repeated in many features.
  One attractive strategy would be to perform singular value decomposition
- (SVD) on the matrix os linear model residuals (which contain all the un-modeled
+ (SVD) on the matrix of linear model residuals (which contain all the un-modeled
  variation in the data) and take the first few singular vectors as batch
  effects.
  While this can be effective, it makes the unreasonable assumption that
@@ -2336,7 +2313,6 @@ However, removing the systematic component of the batch effect still leaves
  The gene quantifications from the first batch are substantially noisier
  than those in the second batch.
  This analysis corrected for this by using
-
 \begin_inset Flex Code
 status open
 
@@ -2346,8 +2322,8 @@ limma
 
 \end_inset
 
-'s sample weighting method
- to assign lower weights to the noisy samples of batch 1 
+'s sample weighting method to assign lower weights to the noisy samples
+ of batch 1 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Ritchie2006,Liu2015"
@@ -2392,7 +2368,6 @@ literal "false"
 , and batch-corrected at this point using ComBat.
  A linear model was fit to the batch-corrected, quality-weighted data for
  each gene using
-
 \begin_inset Flex Code
 status open
 
@@ -2402,9 +2377,7 @@ limma
 
 \end_inset
 
-, and each gene was tested for differential expression
- using
-
+, and each gene was tested for differential expression using
 \begin_inset Flex Code
 status open
 
@@ -3082,8 +3055,6 @@ PCoA plots of ChIP-seq sliding window data, before and after subtracting
 \begin_layout Standard
 Reads in promoters, peaks, and sliding windows across the genome were counted
  and normalized using csaw and analyzed for differential modification using
-
-
 \begin_inset Flex Code
 status open
 
@@ -13303,7 +13274,6 @@ literal "false"
 .
  Log2 counts per million values (logCPM) were calculated using the cpm function
  in
-
 \begin_inset Flex Code
 status open
 
@@ -13313,9 +13283,9 @@ edgeR
 
 \end_inset
 
- for individual samples and aveLogCPM function for averages across
- groups of samples, using those functions’ default prior count values to
- avoid taking the logarithm of 0.
+ for individual samples and aveLogCPM function for averages across groups
+ of samples, using those functions’ default prior count values to avoid
+ taking the logarithm of 0.
  Genes were considered “present” if their average normalized logCPM values
  across all libraries were at least -1.
  Normalizing for gene length was unnecessary because the sequencing protocol
@@ -13365,7 +13335,6 @@ Differential Expression Analysis
 
 \begin_layout Standard
 All tests for differential gene expression were performed using
-
 \begin_inset Flex Code
 status open
 
@@ -13375,8 +13344,7 @@ edgeR
 
 \end_inset
 
-, by
- first fitting a negative binomial generalized linear model to the counts
+, by first fitting a negative binomial generalized linear model to the counts
  and normalization factors and then performing a quasi-likelihood F-test
  with robust estimation of outlier gene dispersions 
 \begin_inset CommandInset citation
@@ -14558,7 +14526,6 @@ noprefix "false"
 , and genes with an average logCPM below -1 were filtered out.
  Each remaining gene was tested for differential abundance with respect
  to globin blocking (GB) using
-
 \begin_inset Flex Code
 status open
 
@@ -14568,11 +14535,9 @@ edgeR
 
 \end_inset
 
-’s quasi-likelihood F-test, fitting
- a negative binomial generalized linear model to table of read counts in
- each library.
+’s quasi-likelihood F-test, fitting a negative binomial generalized linear
+ model to table of read counts in each library.
  For each gene,
-
 \begin_inset Flex Code
 status open
 
@@ -14708,7 +14673,6 @@ Comparison of inter-sample gene abundance correlations with and without
  with an average abundance (logCPM, log2 counts per million reads counted)
  less than -1 were filtered out.
  Each gene’s logCPM was computed in each library using the
-
 \begin_inset Flex Code
 status open
 
@@ -14767,7 +14731,6 @@ ons than the non-GB libraries.
  Performing the same tests on the Spearman correlations gave the same conclusion
  (t-test: t = 26.8, df = 665, P ≪ 2.2e-16; sign-rank test: V = 8781, P ≪ 2.2e-16).
  The
-
 \begin_inset Flex Code
 status open
 
@@ -14777,9 +14740,9 @@ edgeR
 
 \end_inset
 
- package was used to compute the overall biological coefficient
- of variation (BCV) for GB and non-GB libraries, and found that globin blocking
- resulted in a negligible increase in the BCV (0.417 with GB vs.
+ package was used to compute the overall biological coefficient of variation
+ (BCV) for GB and non-GB libraries, and found that globin blocking resulted
+ in a negligible increase in the BCV (0.417 with GB vs.
  0.400 without).
  The near equality of the BCVs for both sets indicates that the higher correlati
 ons in the GB libraries are most likely a result of the increased yield