|
@@ -845,27 +845,64 @@ Approach
|
|
|
\end_layout
|
|
|
|
|
|
\begin_layout Subsection
|
|
|
-fRMA for classifiers
|
|
|
+Frozen RMA for clinical microarray classifiers
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
-RMA makes the normalization of every sample depend on all other samples
|
|
|
- due to the quantile normalization and median polish steps
|
|
|
+\begin_layout Subsubsection
|
|
|
+Standard normalization methods are unsuitable for clinical application
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Itemize
|
|
|
-This makes standard RMA impractical to apply in a machine learning context,
|
|
|
- because adding in the new sample(s) to be classified changes the normalization
|
|
|
- of all samples
|
|
|
+\begin_layout Standard
|
|
|
+As the cost of performing microarray assays falls, there is increasing interest
|
|
|
+ in using genomic assays for diagnostic purposes, such as distinguishing
|
|
|
+ healthy transplants (TX) from transplants undergoing acute rejection (AR)
|
|
|
+ or acute dysfunction with no rejection (ADNR).
|
|
|
+ However, the the standard normalization algorithm used for microarray data,
|
|
|
+ Robust Multi-chip Average (RMA)
|
|
|
+\begin_inset CommandInset citation
|
|
|
+LatexCommand cite
|
|
|
+key "Irizarry2003a"
|
|
|
+literal "false"
|
|
|
+
|
|
|
+\end_inset
|
|
|
+
|
|
|
+, is not applicable in a clinical setting.
|
|
|
+ Two of the steps in RMA, quantile normalization and probe summarization
|
|
|
+ by median polish, depend on every array in the data set being normalized.
|
|
|
+ This means that adding or removing any arrays from a data set changes the
|
|
|
+ normalized values for all arrays, and data sets that have been normalized
|
|
|
+ separately cannot be compared to each other.
|
|
|
+ Hence, when using RMA, any arrays to be analyzed together must also be
|
|
|
+ normalized together, and the set of arrays included in the data set must
|
|
|
+ be held constant throughout an analysis.
|
|
|
\end_layout
|
|
|
|
|
|
-\end_deeper
|
|
|
-\begin_layout Itemize
|
|
|
-Machine-learning applications demand a "single-channel" normalization method
|
|
|
+\begin_layout Standard
|
|
|
+These limitations present serious impediments to the use of arrays as a
|
|
|
+ diagnostic tool.
|
|
|
+ When training a classifier, the samples to be classified must not be involved
|
|
|
+ in any step of the training process, lest their inclusion bias the training
|
|
|
+ process.
|
|
|
+ Once a classifier is deployed in a clinical setting, the samples to be
|
|
|
+ classified will not even
|
|
|
+\emph on
|
|
|
+exist
|
|
|
+\emph default
|
|
|
+ at the time of training, so including them would be impossible even if
|
|
|
+ it were statistically justifiable.
|
|
|
+ Therefore, any machine learning application for microarrays demands that
|
|
|
+ the normalized expression values computed for an array must depend only
|
|
|
+ on information contained within that array.
|
|
|
+ This would ensure that each array's normalization is independent of every
|
|
|
+ other array, and that arrays normalized separately can still be compared
|
|
|
+ to each other without bias.
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
+\begin_layout Subsubsection
|
|
|
+Frozen RMA satisfies clinical normalization requirements
|
|
|
+\end_layout
|
|
|
+
|
|
|
+\begin_layout Standard
|
|
|
Frozen RMA (fRMA) addresses these concerns by replacing the quantile normalizati
|
|
|
on and median polish with alternatives that do not introduce inter-array
|
|
|
dependence, allowing each array to be normalized independently of all others
|
|
@@ -878,84 +915,65 @@ literal "false"
|
|
|
\end_inset
|
|
|
|
|
|
.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Itemize
|
|
|
-Quantile normalization is performed against a pre-generated set of quantiles
|
|
|
- learned from a large collection of publically available array data in GEO
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Itemize
|
|
|
-Median polish is replaced with a weighted average of probes, using weights
|
|
|
- learned form the same public GEO data
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Itemize
|
|
|
-With fRMA, there is no difference between normalizaing
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
+ Quantile normalization is performed against a pre-generated set of quantiles
|
|
|
+ learned from a collection of 850 publically available arrays sampled from
|
|
|
+ a wide variety of tissues in the Gene Expression Omnibus (GEO).
|
|
|
+ Each array's probe intensity distribution is normalized against these pre-gener
|
|
|
+ated quantiles.
|
|
|
+ The median polish step is replaced with a robust weighted average of probe
|
|
|
+ intensities, using inverse variance weights learned from the same public
|
|
|
+ GEO data.
|
|
|
+ The result is a normalization that satisfies the requirements mentioned
|
|
|
+ above: each array is normalized independently of all others, and any two
|
|
|
+ normalized arrays can be compared directly to each other.
|
|
|
+\end_layout
|
|
|
+
|
|
|
+\begin_layout Standard
|
|
|
+One important limitation of fRMA is that it requires a separate reference
|
|
|
+ data set from which to learn the parameters (reference quantiles and probe
|
|
|
+ weights) that will be used to normalize each array.
|
|
|
+ These parameters are specific to a given array platform, and pre-generated
|
|
|
+ parameters are only provided for the most common platforms, such as Affymetrix
|
|
|
+ hgu133plus2.
|
|
|
+ For a less common platform, is is necessary to learn custom parameters
|
|
|
+ from in-house data before fRMA can be used to normalize samples on that
|
|
|
+ platform
|
|
|
+\begin_inset CommandInset citation
|
|
|
+LatexCommand cite
|
|
|
+key "HudsonK.&RemediosC.2010"
|
|
|
+literal "false"
|
|
|
|
|
|
-together
|
|
|
-\begin_inset Quotes erd
|
|
|
\end_inset
|
|
|
|
|
|
- or separately, and any normalized sample can be compared to any other
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_deeper
|
|
|
-\begin_layout Itemize
|
|
|
-frozen RMA is a good solution for common array platforms with large amounts
|
|
|
- of publically available data, but for less common platforms, ready-made
|
|
|
- normalization vectors are not provided, so custom vectors must be learned
|
|
|
- from in-house data
|
|
|
+.
|
|
|
\end_layout
|
|
|
|
|
|
\begin_layout Subsection
|
|
|
Adapting voom to model heteroskedasticity in methylation array data
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
-Methylation array data preprocessing induces heteroskedasticity
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Itemize
|
|
|
-β
|
|
|
-\series bold
|
|
|
-
|
|
|
-\series default
|
|
|
-values, interpreted as fraction of copies methylated, range from 0 to 1.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Itemize
|
|
|
-β
|
|
|
-\series bold
|
|
|
-
|
|
|
-\series default
|
|
|
-values, with their constrained range, are highly non-normal and not suitable
|
|
|
- for linear modeling
|
|
|
+\begin_layout Subsubsection
|
|
|
+Methylation array preprocessing induces heteroskedasticity
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
-M-values, interpreted as ratio of methyled to unmethylated copies, maps
|
|
|
- the beta values from
|
|
|
-\begin_inset Formula $[0,1]$
|
|
|
-\end_inset
|
|
|
-
|
|
|
- onto
|
|
|
-\begin_inset Formula $(-\infty,+\infty)$
|
|
|
-\end_inset
|
|
|
-
|
|
|
-, also transforming them to have approximately normally distributed error
|
|
|
+\begin_layout Standard
|
|
|
+DNA methylation arrays are a relatively new kind of assay that uses microarrays
|
|
|
+ to measure the degree of methylation on cytosines in specific regions arrayed
|
|
|
+ across the genome.
|
|
|
+ First, bisulfite treatment converts all unmethylated cytosines to uracil
|
|
|
+ (which then become thymine after amplication) while leaving methylated
|
|
|
+ cytosines unaffected.
|
|
|
+ Then, each target region is interrogated with two probes: one binds to
|
|
|
+ the original genomic sequence and interrogates the level of methylated
|
|
|
+ DNA, and the other binds to the sequence with all Cs replaced by Ts and
|
|
|
+ interrogates the level of unmethylated DNA.
|
|
|
\end_layout
|
|
|
|
|
|
-\end_deeper
|
|
|
\begin_layout Standard
|
|
|
\begin_inset Float figure
|
|
|
wide false
|
|
|
sideways false
|
|
|
-status open
|
|
|
+status collapsed
|
|
|
|
|
|
\begin_layout Plain Layout
|
|
|
\begin_inset Graphics
|
|
@@ -986,17 +1004,37 @@ Sigmoid shape of the mapping between β and M values
|
|
|
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Plain Layout
|
|
|
+\end_inset
|
|
|
+
|
|
|
|
|
|
\end_layout
|
|
|
|
|
|
+\begin_layout Standard
|
|
|
+After normalization, these two probe intensities are summarized in one of
|
|
|
+ two ways, each with advantages and disadvantages.
|
|
|
+ β
|
|
|
+\series bold
|
|
|
+
|
|
|
+\series default
|
|
|
+values, interpreted as fraction of DNA copies methylated, range from 0 to
|
|
|
+ 1.
|
|
|
+ β
|
|
|
+\series bold
|
|
|
+
|
|
|
+\series default
|
|
|
+values are conceptually easy to interpret, but the constrained range makes
|
|
|
+ them unsuitable for linear modeling, and their error distributions are
|
|
|
+ highly non-normal, which also frustrates linear modeling.
|
|
|
+ M-values, interpreted as the log ratio of methylated to unmethylated copies,
|
|
|
+ are computed by mapping the beta values from
|
|
|
+\begin_inset Formula $[0,1]$
|
|
|
\end_inset
|
|
|
|
|
|
+ onto
|
|
|
+\begin_inset Formula $(-\infty,+\infty)$
|
|
|
+\end_inset
|
|
|
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Itemize
|
|
|
-However, the sigmoid transformation (Figure
|
|
|
+ using a sigmoid curve (Figure
|
|
|
\begin_inset CommandInset ref
|
|
|
LatexCommand ref
|
|
|
reference "fig:Sigmoid-beta-m-mapping"
|
|
@@ -1006,27 +1044,56 @@ noprefix "false"
|
|
|
|
|
|
\end_inset
|
|
|
|
|
|
-) over-exaggerates the variance of extreme values, leading to a U-shaped
|
|
|
- trend in the mean-variance curve
|
|
|
+).
|
|
|
+ This transformation results in values with better statistical perperties:
|
|
|
+ the unconstrained range is suitable for linear modeling, and the error
|
|
|
+ distributions are more normal.
|
|
|
+ Hence, most linear modeling and other statistical testing on methylation
|
|
|
+ arrays is performed using M-values.
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
-This mean-variance dependency must be accounted for when fitting the linear
|
|
|
- model for differential methylation
|
|
|
+\begin_layout Standard
|
|
|
+However, the steep slope of the sigmoid transformation near 0 and 1 tends
|
|
|
+ to over-exaggerate small differences in β values near those extremes, which
|
|
|
+ in turn amplifies the error in those values, leading to a U-shaped trend
|
|
|
+ in the mean-variance curve.
|
|
|
+ This mean-variance dependency must be accounted for when fitting the linear
|
|
|
+ model for differential methylation, or else the variance will be systematically
|
|
|
+ overestimated for probes with moderate M-values and underestimated for
|
|
|
+ probes with extreme M-values.
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
+\begin_layout Subsubsection
|
|
|
+The voom method for RNA-seq data can model the heteroskedasticity
|
|
|
+\end_layout
|
|
|
+
|
|
|
+\begin_layout Standard
|
|
|
+RNA-seq read count data are also known to show heteroskedasticity, and the
|
|
|
+ voom method was developed for modeling this heteroskedasticity by estimating
|
|
|
+ the mean-variance trend in the data and using this trend to assign precision
|
|
|
+ weights to each observation
|
|
|
+\begin_inset CommandInset citation
|
|
|
+LatexCommand cite
|
|
|
+key "Law2013"
|
|
|
+literal "false"
|
|
|
+
|
|
|
+\end_inset
|
|
|
+
|
|
|
+.
|
|
|
+ While methylation array data are not derived from counts,
|
|
|
+\end_layout
|
|
|
+
|
|
|
+\begin_layout Standard
|
|
|
Voom method, originally developed for RNA-seq data, can model mean-variance
|
|
|
dependence
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Itemize
|
|
|
+\begin_layout Standard
|
|
|
Standard implementation of voom assumes the input is read counts, and adjustment
|
|
|
s are required to run it on M-values.
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
+\begin_layout Standard
|
|
|
\begin_inset Flex TODO Note (inline)
|
|
|
status open
|
|
|
|
|
@@ -1039,8 +1106,7 @@ Put code on Github and reference it
|
|
|
|
|
|
\end_layout
|
|
|
|
|
|
-\end_deeper
|
|
|
-\begin_layout Itemize
|
|
|
+\begin_layout Standard
|
|
|
Other methods, such as duplicateCorrelation and arrayWeights, are also applicabl
|
|
|
e with no need for custom adaptation
|
|
|
\end_layout
|
|
@@ -1106,11 +1172,11 @@ fRMA eliminates unwanted dependence of classifier training on normalization
|
|
|
strategy caused by RMA
|
|
|
\end_layout
|
|
|
|
|
|
-\begin_layout Itemize
|
|
|
-Data set consists of training set (23 TX, 35 AR, 21 ADNR), validation set
|
|
|
- (23 TX, 34 AR, 21 ADNR), and external validation set gathered from public
|
|
|
- GEO data (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays
|
|
|
-
|
|
|
+\begin_layout Standard
|
|
|
+The initial data set for testing fRMA consisted of 157 hgu133plus2 arrays,
|
|
|
+ split into a training set (23 TX, 35 AR, 21 ADNR), validation set (23 TX,
|
|
|
+ 34 AR, 21 ADNR), and external validation set gathered from public GEO data
|
|
|
+ (37 TX, 38 AR, no ADNR), all on standard hgu133plus2 Affy arrays
|
|
|
\begin_inset CommandInset citation
|
|
|
LatexCommand cite
|
|
|
key "Kurian2014"
|