Ver código fonte

Progress on chapter 3

Ryan C. Thompson 6 anos atrás
pai
commit
3eedfcbe15
1 arquivos alterados com 161 adições e 95 exclusões
  1. 161 95
      thesis.lyx

+ 161 - 95
thesis.lyx

@@ -845,27 +845,64 @@ Approach
 \end_layout
 
 \begin_layout Subsection
-fRMA for classifiers
+Frozen RMA for clinical microarray classifiers
 \end_layout
 
-\begin_layout Itemize
-RMA makes the normalization of every sample depend on all other samples
- due to the quantile normalization and median polish steps
+\begin_layout Subsubsection
+Standard normalization methods are unsuitable for clinical application
 \end_layout
 
-\begin_deeper
-\begin_layout Itemize
-This makes standard RMA impractical to apply in a machine learning context,
- because adding in the new sample(s) to be classified changes the normalization
- of all samples
+\begin_layout Standard
+As the cost of performing microarray assays falls, there is increasing interest
+ in using genomic assays for diagnostic purposes, such as distinguishing
+ healthy transplants (TX) from transplants undergoing acute rejection (AR)
+ or acute dysfunction with no rejection (ADNR).
+ However, the the standard normalization algorithm used for microarray data,
+ Robust Multi-chip Average (RMA) 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Irizarry2003a"
+literal "false"
+
+\end_inset
+
+, is not applicable in a clinical setting.
+ Two of the steps in RMA, quantile normalization and probe summarization
+ by median polish, depend on every array in the data set being normalized.
+ This means that adding or removing any arrays from a data set changes the
+ normalized values for all arrays, and data sets that have been normalized
+ separately cannot be compared to each other.
+ Hence, when using RMA, any arrays to be analyzed together must also be
+ normalized together, and the set of arrays included in the data set must
+ be held constant throughout an analysis.
 \end_layout
 
-\end_deeper
-\begin_layout Itemize
-Machine-learning applications demand a "single-channel" normalization method
+\begin_layout Standard
+These limitations present serious impediments to the use of arrays as a
+ diagnostic tool.
+ When training a classifier, the samples to be classified must not be involved
+ in any step of the training process, lest their inclusion bias the training
+ process.
+ Once a classifier is deployed in a clinical setting, the samples to be
+ classified will not even 
+\emph on
+exist
+\emph default
+ at the time of training, so including them would be impossible even if
+ it were statistically justifiable.
+ Therefore, any machine learning application for microarrays demands that
+ the normalized expression values computed for an array must depend only
+ on information contained within that array.
+ This would ensure that each array's normalization is independent of every
+ other array, and that arrays normalized separately can still be compared
+ to each other without bias.
 \end_layout
 
-\begin_layout Itemize
+\begin_layout Subsubsection
+Frozen RMA satisfies clinical normalization requirements
+\end_layout
+
+\begin_layout Standard
 Frozen RMA (fRMA) addresses these concerns by replacing the quantile normalizati
 on and median polish with alternatives that do not introduce inter-array
  dependence, allowing each array to be normalized independently of all others
@@ -878,84 +915,65 @@ literal "false"
 \end_inset
 
 .
-\end_layout
-
-\begin_deeper
-\begin_layout Itemize
-Quantile normalization is performed against a pre-generated set of quantiles
- learned from a large collection of publically available array data in GEO
-\end_layout
-
-\begin_layout Itemize
-Median polish is replaced with a weighted average of probes, using weights
- learned form the same public GEO data
-\end_layout
-
-\begin_layout Itemize
-With fRMA, there is no difference between normalizaing 
-\begin_inset Quotes eld
-\end_inset
+ Quantile normalization is performed against a pre-generated set of quantiles
+ learned from a collection of 850 publically available arrays sampled from
+ a wide variety of tissues in the Gene Expression Omnibus (GEO).
+ Each array's probe intensity distribution is normalized against these pre-gener
+ated quantiles.
+ The median polish step is replaced with a robust weighted average of probe
+ intensities, using inverse variance weights learned from the same public
+ GEO data.
+ The result is a normalization that satisfies the requirements mentioned
+ above: each array is normalized independently of all others, and any two
+ normalized arrays can be compared directly to each other.
+\end_layout
+
+\begin_layout Standard
+One important limitation of fRMA is that it requires a separate reference
+ data set from which to learn the parameters (reference quantiles and probe
+ weights) that will be used to normalize each array.
+ These parameters are specific to a given array platform, and pre-generated
+ parameters are only provided for the most common platforms, such as Affymetrix
+ hgu133plus2.
+ For a less common platform, is is necessary to learn custom parameters
+ from in-house data before fRMA can be used to normalize samples on that
+ platform 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "HudsonK.&RemediosC.2010"
+literal "false"
 
-together
-\begin_inset Quotes erd
 \end_inset
 
- or separately, and any normalized sample can be compared to any other
-\end_layout
-
-\end_deeper
-\begin_layout Itemize
-frozen RMA is a good solution for common array platforms with large amounts
- of publically available data, but for less common platforms, ready-made
- normalization vectors are not provided, so custom vectors must be learned
- from in-house data
+.
 \end_layout
 
 \begin_layout Subsection
 Adapting voom to model heteroskedasticity in methylation array data
 \end_layout
 
-\begin_layout Itemize
-Methylation array data preprocessing induces heteroskedasticity
-\end_layout
-
-\begin_deeper
-\begin_layout Itemize
-\series bold
- 
-\series default
-values, interpreted as fraction of copies methylated, range from 0 to 1.
-\end_layout
-
-\begin_layout Itemize
-\series bold
- 
-\series default
-values, with their constrained range, are highly non-normal and not suitable
- for linear modeling
+\begin_layout Subsubsection
+Methylation array preprocessing induces heteroskedasticity
 \end_layout
 
-\begin_layout Itemize
-M-values, interpreted as ratio of methyled to unmethylated copies, maps
- the beta values from 
-\begin_inset Formula $[0,1]$
-\end_inset
-
- onto 
-\begin_inset Formula $(-\infty,+\infty)$
-\end_inset
-
-, also transforming them to have approximately normally distributed error
+\begin_layout Standard
+DNA methylation arrays are a relatively new kind of assay that uses microarrays
+ to measure the degree of methylation on cytosines in specific regions arrayed
+ across the genome.
+ First, bisulfite treatment converts all unmethylated cytosines to uracil
+ (which then become thymine after amplication) while leaving methylated
+ cytosines unaffected.
+ Then, each target region is interrogated with two probes: one binds to
+ the original genomic sequence and interrogates the level of methylated
+ DNA, and the other binds to the sequence with all Cs replaced by Ts and
+ interrogates the level of unmethylated DNA.
 \end_layout
 
-\end_deeper
 \begin_layout Standard
 \begin_inset Float figure
 wide false
 sideways false
-status open
+status collapsed
 
 \begin_layout Plain Layout
 \begin_inset Graphics
@@ -986,17 +1004,37 @@ Sigmoid shape of the mapping between β and M values
 
 \end_layout
 
-\begin_layout Plain Layout
+\end_inset
+
 
 \end_layout
 
+\begin_layout Standard
+After normalization, these two probe intensities are summarized in one of
+ two ways, each with advantages and disadvantages.
+ β
+\series bold
+ 
+\series default
+values, interpreted as fraction of DNA copies methylated, range from 0 to
+ 1.
+ β
+\series bold
+ 
+\series default
+values are conceptually easy to interpret, but the constrained range makes
+ them unsuitable for linear modeling, and their error distributions are
+ highly non-normal, which also frustrates linear modeling.
+ M-values, interpreted as the log ratio of methylated to unmethylated copies,
+ are computed by mapping the beta values from 
+\begin_inset Formula $[0,1]$
 \end_inset
 
+ onto 
+\begin_inset Formula $(-\infty,+\infty)$
+\end_inset
 
-\end_layout
-
-\begin_layout Itemize
-However, the sigmoid transformation (Figure 
+ using a sigmoid curve (Figure 
 \begin_inset CommandInset ref
 LatexCommand ref
 reference "fig:Sigmoid-beta-m-mapping"
@@ -1006,27 +1044,56 @@ noprefix "false"
 
 \end_inset
 
-) over-exaggerates the variance of extreme values, leading to a U-shaped
- trend in the mean-variance curve
+).
+ This transformation results in values with better statistical perperties:
+ the unconstrained range is suitable for linear modeling, and the error
+ distributions are more normal.
+ Hence, most linear modeling and other statistical testing on methylation
+ arrays is performed using M-values.
 \end_layout
 
-\begin_layout Itemize
-This mean-variance dependency must be accounted for when fitting the linear
- model for differential methylation
+\begin_layout Standard
+However, the steep slope of the sigmoid transformation near 0 and 1 tends
+ to over-exaggerate small differences in β values near those extremes, which
+ in turn amplifies the error in those values, leading to a U-shaped trend
+ in the mean-variance curve.
+ This mean-variance dependency must be accounted for when fitting the linear
+ model for differential methylation, or else the variance will be systematically
+ overestimated for probes with moderate M-values and underestimated for
+ probes with extreme M-values.
 \end_layout
 
-\begin_layout Itemize
+\begin_layout Subsubsection
+The voom method for RNA-seq data can model the heteroskedasticity
+\end_layout
+
+\begin_layout Standard
+RNA-seq read count data are also known to show heteroskedasticity, and the
+ voom method was developed for modeling this heteroskedasticity by estimating
+ the mean-variance trend in the data and using this trend to assign precision
+ weights to each observation 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Law2013"
+literal "false"
+
+\end_inset
+
+.
+ While methylation array data are not derived from counts,
+\end_layout
+
+\begin_layout Standard
 Voom method, originally developed for RNA-seq data, can model mean-variance
  dependence
 \end_layout
 
-\begin_deeper
-\begin_layout Itemize
+\begin_layout Standard
 Standard implementation of voom assumes the input is read counts, and adjustment
 s are required to run it on M-values.
 \end_layout
 
-\begin_layout Itemize
+\begin_layout Standard
 \begin_inset Flex TODO Note (inline)
 status open
 
@@ -1039,8 +1106,7 @@ Put code on Github and reference it
 
 \end_layout
 
-\end_deeper
-\begin_layout Itemize
+\begin_layout Standard
 Other methods, such as duplicateCorrelation and arrayWeights, are also applicabl
 e with no need for custom adaptation
 \end_layout
@@ -1106,11 +1172,11 @@ fRMA eliminates unwanted dependence of classifier training on normalization
  strategy caused by RMA
 \end_layout
 
-\begin_layout Itemize
-Data set consists of training set (23 TX, 35 AR, 21 ADNR), validation set
- (23 TX, 34 AR, 21 ADNR), and external validation set gathered from public
- GEO data (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays
- 
+\begin_layout Standard
+The initial data set for testing fRMA consisted of 157 hgu133plus2 arrays,
+ split into a training set (23 TX, 35 AR, 21 ADNR), validation set (23 TX,
+ 34 AR, 21 ADNR), and external validation set gathered from public GEO data
+ (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays 
 \begin_inset CommandInset citation
 LatexCommand cite
 key "Kurian2014"