瀏覽代碼

Progress on chapter 3

Ryan C. Thompson 6 年之前
父節點
當前提交
3eedfcbe15
共有 1 個文件被更改,包括 161 次插入95 次删除
  1. 161 95
      thesis.lyx

+ 161 - 95
thesis.lyx

@@ -845,27 +845,64 @@ Approach
 \end_layout
 \end_layout
 
 
 \begin_layout Subsection
 \begin_layout Subsection
-fRMA for classifiers
+Frozen RMA for clinical microarray classifiers
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
-RMA makes the normalization of every sample depend on all other samples
- due to the quantile normalization and median polish steps
+\begin_layout Subsubsection
+Standard normalization methods are unsuitable for clinical application
 \end_layout
 \end_layout
 
 
-\begin_deeper
-\begin_layout Itemize
-This makes standard RMA impractical to apply in a machine learning context,
- because adding in the new sample(s) to be classified changes the normalization
- of all samples
+\begin_layout Standard
+As the cost of performing microarray assays falls, there is increasing interest
+ in using genomic assays for diagnostic purposes, such as distinguishing
+ healthy transplants (TX) from transplants undergoing acute rejection (AR)
+ or acute dysfunction with no rejection (ADNR).
+ However, the the standard normalization algorithm used for microarray data,
+ Robust Multi-chip Average (RMA) 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Irizarry2003a"
+literal "false"
+
+\end_inset
+
+, is not applicable in a clinical setting.
+ Two of the steps in RMA, quantile normalization and probe summarization
+ by median polish, depend on every array in the data set being normalized.
+ This means that adding or removing any arrays from a data set changes the
+ normalized values for all arrays, and data sets that have been normalized
+ separately cannot be compared to each other.
+ Hence, when using RMA, any arrays to be analyzed together must also be
+ normalized together, and the set of arrays included in the data set must
+ be held constant throughout an analysis.
 \end_layout
 \end_layout
 
 
-\end_deeper
-\begin_layout Itemize
-Machine-learning applications demand a "single-channel" normalization method
+\begin_layout Standard
+These limitations present serious impediments to the use of arrays as a
+ diagnostic tool.
+ When training a classifier, the samples to be classified must not be involved
+ in any step of the training process, lest their inclusion bias the training
+ process.
+ Once a classifier is deployed in a clinical setting, the samples to be
+ classified will not even 
+\emph on
+exist
+\emph default
+ at the time of training, so including them would be impossible even if
+ it were statistically justifiable.
+ Therefore, any machine learning application for microarrays demands that
+ the normalized expression values computed for an array must depend only
+ on information contained within that array.
+ This would ensure that each array's normalization is independent of every
+ other array, and that arrays normalized separately can still be compared
+ to each other without bias.
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
+\begin_layout Subsubsection
+Frozen RMA satisfies clinical normalization requirements
+\end_layout
+
+\begin_layout Standard
 Frozen RMA (fRMA) addresses these concerns by replacing the quantile normalizati
 Frozen RMA (fRMA) addresses these concerns by replacing the quantile normalizati
 on and median polish with alternatives that do not introduce inter-array
 on and median polish with alternatives that do not introduce inter-array
  dependence, allowing each array to be normalized independently of all others
  dependence, allowing each array to be normalized independently of all others
@@ -878,84 +915,65 @@ literal "false"
 \end_inset
 \end_inset
 
 
 .
 .
-\end_layout
-
-\begin_deeper
-\begin_layout Itemize
-Quantile normalization is performed against a pre-generated set of quantiles
- learned from a large collection of publically available array data in GEO
-\end_layout
-
-\begin_layout Itemize
-Median polish is replaced with a weighted average of probes, using weights
- learned form the same public GEO data
-\end_layout
-
-\begin_layout Itemize
-With fRMA, there is no difference between normalizaing 
-\begin_inset Quotes eld
-\end_inset
+ Quantile normalization is performed against a pre-generated set of quantiles
+ learned from a collection of 850 publically available arrays sampled from
+ a wide variety of tissues in the Gene Expression Omnibus (GEO).
+ Each array's probe intensity distribution is normalized against these pre-gener
+ated quantiles.
+ The median polish step is replaced with a robust weighted average of probe
+ intensities, using inverse variance weights learned from the same public
+ GEO data.
+ The result is a normalization that satisfies the requirements mentioned
+ above: each array is normalized independently of all others, and any two
+ normalized arrays can be compared directly to each other.
+\end_layout
+
+\begin_layout Standard
+One important limitation of fRMA is that it requires a separate reference
+ data set from which to learn the parameters (reference quantiles and probe
+ weights) that will be used to normalize each array.
+ These parameters are specific to a given array platform, and pre-generated
+ parameters are only provided for the most common platforms, such as Affymetrix
+ hgu133plus2.
+ For a less common platform, is is necessary to learn custom parameters
+ from in-house data before fRMA can be used to normalize samples on that
+ platform 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "HudsonK.&RemediosC.2010"
+literal "false"
 
 
-together
-\begin_inset Quotes erd
 \end_inset
 \end_inset
 
 
- or separately, and any normalized sample can be compared to any other
-\end_layout
-
-\end_deeper
-\begin_layout Itemize
-frozen RMA is a good solution for common array platforms with large amounts
- of publically available data, but for less common platforms, ready-made
- normalization vectors are not provided, so custom vectors must be learned
- from in-house data
+.
 \end_layout
 \end_layout
 
 
 \begin_layout Subsection
 \begin_layout Subsection
 Adapting voom to model heteroskedasticity in methylation array data
 Adapting voom to model heteroskedasticity in methylation array data
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
-Methylation array data preprocessing induces heteroskedasticity
-\end_layout
-
-\begin_deeper
-\begin_layout Itemize
-\series bold
- 
-\series default
-values, interpreted as fraction of copies methylated, range from 0 to 1.
-\end_layout
-
-\begin_layout Itemize
-\series bold
- 
-\series default
-values, with their constrained range, are highly non-normal and not suitable
- for linear modeling
+\begin_layout Subsubsection
+Methylation array preprocessing induces heteroskedasticity
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
-M-values, interpreted as ratio of methyled to unmethylated copies, maps
- the beta values from 
-\begin_inset Formula $[0,1]$
-\end_inset
-
- onto 
-\begin_inset Formula $(-\infty,+\infty)$
-\end_inset
-
-, also transforming them to have approximately normally distributed error
+\begin_layout Standard
+DNA methylation arrays are a relatively new kind of assay that uses microarrays
+ to measure the degree of methylation on cytosines in specific regions arrayed
+ across the genome.
+ First, bisulfite treatment converts all unmethylated cytosines to uracil
+ (which then become thymine after amplication) while leaving methylated
+ cytosines unaffected.
+ Then, each target region is interrogated with two probes: one binds to
+ the original genomic sequence and interrogates the level of methylated
+ DNA, and the other binds to the sequence with all Cs replaced by Ts and
+ interrogates the level of unmethylated DNA.
 \end_layout
 \end_layout
 
 
-\end_deeper
 \begin_layout Standard
 \begin_layout Standard
 \begin_inset Float figure
 \begin_inset Float figure
 wide false
 wide false
 sideways false
 sideways false
-status open
+status collapsed
 
 
 \begin_layout Plain Layout
 \begin_layout Plain Layout
 \begin_inset Graphics
 \begin_inset Graphics
@@ -986,17 +1004,37 @@ Sigmoid shape of the mapping between β and M values
 
 
 \end_layout
 \end_layout
 
 
-\begin_layout Plain Layout
+\end_inset
+
 
 
 \end_layout
 \end_layout
 
 
+\begin_layout Standard
+After normalization, these two probe intensities are summarized in one of
+ two ways, each with advantages and disadvantages.
+ β
+\series bold
+ 
+\series default
+values, interpreted as fraction of DNA copies methylated, range from 0 to
+ 1.
+ β
+\series bold
+ 
+\series default
+values are conceptually easy to interpret, but the constrained range makes
+ them unsuitable for linear modeling, and their error distributions are
+ highly non-normal, which also frustrates linear modeling.
+ M-values, interpreted as the log ratio of methylated to unmethylated copies,
+ are computed by mapping the beta values from 
+\begin_inset Formula $[0,1]$
 \end_inset
 \end_inset
 
 
+ onto 
+\begin_inset Formula $(-\infty,+\infty)$
+\end_inset
 
 
-\end_layout
-
-\begin_layout Itemize
-However, the sigmoid transformation (Figure 
+ using a sigmoid curve (Figure 
 \begin_inset CommandInset ref
 \begin_inset CommandInset ref
 LatexCommand ref
 LatexCommand ref
 reference "fig:Sigmoid-beta-m-mapping"
 reference "fig:Sigmoid-beta-m-mapping"
@@ -1006,27 +1044,56 @@ noprefix "false"
 
 
 \end_inset
 \end_inset
 
 
-) over-exaggerates the variance of extreme values, leading to a U-shaped
- trend in the mean-variance curve
+).
+ This transformation results in values with better statistical perperties:
+ the unconstrained range is suitable for linear modeling, and the error
+ distributions are more normal.
+ Hence, most linear modeling and other statistical testing on methylation
+ arrays is performed using M-values.
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
-This mean-variance dependency must be accounted for when fitting the linear
- model for differential methylation
+\begin_layout Standard
+However, the steep slope of the sigmoid transformation near 0 and 1 tends
+ to over-exaggerate small differences in β values near those extremes, which
+ in turn amplifies the error in those values, leading to a U-shaped trend
+ in the mean-variance curve.
+ This mean-variance dependency must be accounted for when fitting the linear
+ model for differential methylation, or else the variance will be systematically
+ overestimated for probes with moderate M-values and underestimated for
+ probes with extreme M-values.
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
+\begin_layout Subsubsection
+The voom method for RNA-seq data can model the heteroskedasticity
+\end_layout
+
+\begin_layout Standard
+RNA-seq read count data are also known to show heteroskedasticity, and the
+ voom method was developed for modeling this heteroskedasticity by estimating
+ the mean-variance trend in the data and using this trend to assign precision
+ weights to each observation 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Law2013"
+literal "false"
+
+\end_inset
+
+.
+ While methylation array data are not derived from counts,
+\end_layout
+
+\begin_layout Standard
 Voom method, originally developed for RNA-seq data, can model mean-variance
 Voom method, originally developed for RNA-seq data, can model mean-variance
  dependence
  dependence
 \end_layout
 \end_layout
 
 
-\begin_deeper
-\begin_layout Itemize
+\begin_layout Standard
 Standard implementation of voom assumes the input is read counts, and adjustment
 Standard implementation of voom assumes the input is read counts, and adjustment
 s are required to run it on M-values.
 s are required to run it on M-values.
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
+\begin_layout Standard
 \begin_inset Flex TODO Note (inline)
 \begin_inset Flex TODO Note (inline)
 status open
 status open
 
 
@@ -1039,8 +1106,7 @@ Put code on Github and reference it
 
 
 \end_layout
 \end_layout
 
 
-\end_deeper
-\begin_layout Itemize
+\begin_layout Standard
 Other methods, such as duplicateCorrelation and arrayWeights, are also applicabl
 Other methods, such as duplicateCorrelation and arrayWeights, are also applicabl
 e with no need for custom adaptation
 e with no need for custom adaptation
 \end_layout
 \end_layout
@@ -1106,11 +1172,11 @@ fRMA eliminates unwanted dependence of classifier training on normalization
  strategy caused by RMA
  strategy caused by RMA
 \end_layout
 \end_layout
 
 
-\begin_layout Itemize
-Data set consists of training set (23 TX, 35 AR, 21 ADNR), validation set
- (23 TX, 34 AR, 21 ADNR), and external validation set gathered from public
- GEO data (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays
- 
+\begin_layout Standard
+The initial data set for testing fRMA consisted of 157 hgu133plus2 arrays,
+ split into a training set (23 TX, 35 AR, 21 ADNR), validation set (23 TX,
+ 34 AR, 21 ADNR), and external validation set gathered from public GEO data
+ (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays 
 \begin_inset CommandInset citation
 \begin_inset CommandInset citation
 LatexCommand cite
 LatexCommand cite
 key "Kurian2014"
 key "Kurian2014"