hace 5 años · b6928049d1
--- a/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS
+++ b/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS
--- a/examples/Salomon/Teaching/RNA-Seq.Rmd
+++ b/examples/Salomon/Teaching/RNA-Seq.Rmd
@@ -0,0 +1,554 @@
 
				+---
			
 
				+title: "RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon gamma"
			
 
				+author: "Ryan C. Thompson"
			
 
				+date: "May 8, 2018"
			
 
				+output:
			
 
				+  html_document:
			
 
				+    fig_caption: true
			
 
				+  pdf_document:
			
 
				+    fig_caption: true
			
 
				+---
			
 
				+
			
 
				+```{r setup, include=FALSE}
			
 
				+knitr::opts_chunk$set(echo = TRUE, retina=2, cache=TRUE, autodep=TRUE)
			
 
				+```
			
 
				+
			
 
				+# Introduction
			
 
				+
			
 
				+Today we're going to look at a data set from my work. This data comes
			
 
				+from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys
			
 
				+(a closely related species to rhesus macaques). MSCs are known to have
			
 
				+immune modulatory effects, and the main goal of the project is to test
			
 
				+whether these stem cells, when activated with interferon gamma (IFNg),
			
 
				+can help suppress the body's immune response against an organ
			
 
				+transplant.
			
 
				+
			
 
				+![Known effects of MSC which may benefit organ transplants](images/msc1-cropped.png)
			
 
				+
			
 
				+The purpose of this specific data set is simply to determine which
			
 
				+genes' expression is affected when MSCs are treated with IFNg. As
			
 
				+such, we have both untreated (Control) samples and IFNg-treated
			
 
				+samples from 3 different passages of the cell cultures. So, we'll be
			
 
				+using limma to test each gene for differential expression between
			
 
				+Control and IFNg samples.
			
 
				+
			
 
				+# Preliminary setup
			
 
				+
			
 
				+First, let's install all the packages we'll need.
			
 
				+
			
 
				+```{r install_pkgs, eval=FALSE}
			
 
				+needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
			
 
				+already_installed_packages <- rownames(installed.packages())
			
 
				+need_to_install <- setdiff(needed_packages, already_installed_packages)
			
 
				+if (length(need_to_install) > 0) {
			
 
				+    ## http://bioconductor.org/install/
			
 
				+    source("https://bioconductor.org/biocLite.R")
			
 
				+    biocLite(need_to_install)
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Then we'll load those packages.
			
 
				+
			
 
				+```{r load_pkgs, results="hide", message=FALSE}
			
 
				+library(SummarizedExperiment)
			
 
				+library(edgeR)
			
 
				+library(limma)
			
 
				+library(ggplot2)
			
 
				+library(dplyr)
			
 
				+```
			
 
				+
			
 
				+# Initial data loading and exploration
			
 
				+
			
 
				+Let's load the data file. This uses the `readRDS` function, which
			
 
				+reads a single R object from a file and returns it. We assign the
			
 
				+result to a variable.
			
 
				+
			
 
				+```{r loading_data}
			
 
				+sexp <- readRDS(gzcon(url("https://www.dropbox.com/s/4lx6q9fgyz66rky/Cyno-RNASeq-SummarizedExperiment.RDS?dl=1")))
			
 
				+print(sexp)
			
 
				+```
			
 
				+
			
 
				+This object is called a SummarizedExperiment, and its purpose is to
			
 
				+hold the experimental data, sample metadata, gene metadata, and any
			
 
				+other relevant information for the experiment all in one place. This
			
 
				+makes it an excellent starting point for any analysis. In our case,
			
 
				+the experimental data consists of a matrix of counts for each gene in
			
 
				+each sample. This count matrix was generated by aligning all the RNA
			
 
				+sequence reads in each sample to the
			
 
				+[cyno genome](http://www.ncbi.nlm.nih.gov/genome/776) and then
			
 
				+counting the number of uniquely-mapped reads that overlap each gene. I
			
 
				+have done the aligning and counting process for you and provided the
			
 
				+count matrix, since the alignment and counting would take many hours
			
 
				+to commplete.
			
 
				+
			
 
				+Look at the output of the print statement above. The RNA-Seq read
			
 
				+counts are contained in the `assays` slot. The `colData` slot contains
			
 
				+information on the samples, while the `rowRanges` slot contains
			
 
				+information on the genes. The rows have "ranges" instead of just
			
 
				+"data" because in addition to things like the gene symbol and
			
 
				+description, the `rowRanges` slot also contains the genomic
			
 
				+coordinates of each gene, that is, the chromosome, start/end positions
			
 
				+of each exon, and the strand. This is the same information that was
			
 
				+used to count the overlapping reads for each gene, but we won't be
			
 
				+needing it today.
			
 
				+
			
 
				+You can read more about SummarizedExperiment objects
			
 
				+[here](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).
			
 
				+
			
 
				+![Structure of a SummarizedExperiment object](images/SEXP.svg)
			
 
				+
			
 
				+* How many samples does this SummarizedExperiment object contain? How
			
 
				+  many genes does it have? Where did you find this information?
			
 
				+* What happens when you try to select a subset of rows or columns?
			
 
				+
			
 
				+## Extracting the data and metadata
			
 
				+
			
 
				+Ok, let's pull out the information that we want from this
			
 
				+SummarizedExperiment.
			
 
				+
			
 
				+First, the sample information:
			
 
				+
			
 
				+```{r extract_sampletable}
			
 
				+sample_table <- as.data.frame(colData(sexp))
			
 
				+```
			
 
				+
			
 
				+Next, the gene information:
			
 
				+
			
 
				+```{r extract_geneinfo}
			
 
				+## Gene metadata
			
 
				+gene_info <- as.data.frame(mcols(sexp))
			
 
				+```
			
 
				+
			
 
				+And finally, the count matrix:
			
 
				+
			
 
				+```{r extract_counts}
			
 
				+count_matrix <- assay(sexp)
			
 
				+```
			
 
				+
			
 
				+Examine the `sample_table`, `gene_info`, and `count_matrix` objects.
			
 
				+Try `dim`, `class`, `nrow`, `ncol`, and similar. Try viewing the first
			
 
				+few rows or columns of each one.
			
 
				+
			
 
				+* What kind of object is each one?
			
 
				+* What kind of data do they contain?
			
 
				+* Which dimensions do they have in common? How do these dimensions
			
 
				+  match up to the dimension of the SummarizedExperiment?
			
 
				+
			
 
				+```{r example_sexp_components, include=FALSE}
			
 
				+dim(sexp)
			
 
				+dim(sample_table)
			
 
				+dim(gene_info)
			
 
				+dim(count_matrix)
			
 
				+
			
 
				+head(sample_table)
			
 
				+head(gene_info)
			
 
				+## Find the IFNG gene
			
 
				+gene_info[which(gene_info$symbol == "IFNG"),]
			
 
				+count_matrix[1:10,1:6]
			
 
				+```
			
 
				+
			
 
				+## Learning about the experimental design from the sample table
			
 
				+
			
 
				+Before we get to the analysis, we need to have a closer look at the
			
 
				+sample table, to gain a better understanding of the experimental
			
 
				+design.
			
 
				+
			
 
				+```{r head_sample_table}
			
 
				+names(sample_table)
			
 
				+head(sample_table)
			
 
				+```
			
 
				+
			
 
				+* How many Labs/Runs/Animals/Passages/Treatments are there in this
			
 
				+  experiment?
			
 
				+* How many samples are in each one?
			
 
				+
			
 
				+(You may find the `unique` and `table` functions useful for answering
			
 
				+these questions.)
			
 
				+
			
 
				+```{r tabulate_samples, include=FALSE}
			
 
				+lapply(sample_table[-1], unique)
			
 
				+lapply(sample_table[-1], table)
			
 
				+```
			
 
				+* How many Animals are there in each Lab?
			
 
				+* Are there Animals with samples from both Labs?
			
 
				+
			
 
				+```{r tabulate_animals_in_labs}
			
 
				+table(sample_table$Animal.ID, sample_table$Lab)
			
 
				+```
			
 
				+
			
 
				+* How many sequencing Runs are there in each Lab?
			
 
				+* How many Animals are there in each sequencing Run?
			
 
				+* Are there Animals with samples from multiple sequencing Runs?
			
 
				+* Is it better to put all of each animal's samples in the same run, or
			
 
				+  is it better to distribute each animal's samples across multiple
			
 
				+  runs?
			
 
				+
			
 
				+```{r tabulate_more_things, include=FALSE}
			
 
				+table(sample_table$Animal.ID, sample_table$Run)
			
 
				+table(sample_table$Run, sample_table$Lab)
			
 
				+```
			
 
				+
			
 
				+# Performing a basic limma analysis #
			
 
				+
			
 
				+## Filtering non-expressed genes ##
			
 
				+
			
 
				+Our count matrix contains information on every known gene in the cyno
			
 
				+genome, but we only want to look at the genes that are expressed in
			
 
				+MSCs. Many genes will not be expressed at all and will have zero or
			
 
				+very few counts in all samples, and the statistical method breaks down
			
 
				+when it is fed all zeros. So we need to decide on a threshold of gene
			
 
				+detection and filter out all the genes whose average expression does
			
 
				+not reach this threshold. Let's see how a threshold of logCPM = 1
			
 
				+looks on a histogram and a QQ plot against normal distribution
			
 
				+quantiles.
			
 
				+
			
 
				+```{r plot_cpm_threshold, fig.cap="logCPM histogram with threshold line"}
			
 
				+mean_log_cpm <- aveLogCPM(count_matrix)
			
 
				+filter_threshold <- 1
			
 
				+ggplot() + aes(x=mean_log_cpm) +
			
 
				+    geom_histogram(binwidth=0.2) +
			
 
				+    geom_vline(xintercept=filter_threshold) +
			
 
				+    ggtitle("Histogram of mean expression values")
			
 
				+```
			
 
				+
			
 
				+```{r plot_cpm_qq, fig.cap="logCPM QQ plot with threshold line"}
			
 
				+qqnorm(mean_log_cpm); abline(h=filter_threshold)
			
 
				+```
			
 
				+
			
 
				+* What do the shapes of these two plots indicate?
			
 
				+* How can we justify our choice of threshold using these plots?
			
 
				+* Would this same detection threshold be appropriate for other
			
 
				+  experiments?
			
 
				+* What biological or experimental factors might influence the choice
			
 
				+  of threshold?
			
 
				+
			
 
				+Having chosen our threshold, let's pick the subset of genes whose
			
 
				+average expression passes that threshold.
			
 
				+
			
 
				+```{r filter_genes}
			
 
				+keep_genes <- mean_log_cpm >= 1
			
 
				+filtered_count_matrix <- count_matrix[keep_genes,]
			
 
				+filtered_gene_info <- gene_info[keep_genes,]
			
 
				+nrow(filtered_count_matrix)
			
 
				+```
			
 
				+
			
 
				+* What fraction of the genes in the genome are considered expressed
			
 
				+  according to our threshold?
			
 
				+* Is this number sensitive to our choice of threshold? How much of a
			
 
				+  difference does it make if we use a threshold of 0.5 or 1.5 instead?
			
 
				+
			
 
				+```{r threshold_sensitivity, include=FALSE}
			
 
				+mean(keep_genes)
			
 
				+mean(mean_log_cpm >= 1.5)
			
 
				+mean(mean_log_cpm >= 0.5)
			
 
				+```
			
 
				+
			
 
				+## Normalization ##
			
 
				+
			
 
				+Ok, now that we have some understanding of the experimental design,
			
 
				+let's find some differentially expressed genes! We'll start by
			
 
				+computing the total counts in each sample and then normalizing these
			
 
				+total counts using the Trimmed Mean of M-values (TMM) method, provided
			
 
				+by the `calcNormFactors` function.
			
 
				+
			
 
				+```{r calcNormFactors}
			
 
				+total_counts <- colSums(count_matrix)
			
 
				+nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
			
 
				+print(nf)
			
 
				+normalized_total_counts <- total_counts * nf
			
 
				+```
			
 
				+
			
 
				+* What is the range of total_counts/normalization factors? (Try the
			
 
				+  `summary` function)
			
 
				+* What is the average total count per sample? Does this seem high or
			
 
				+  low? How might we expect this to affect our results?
			
 
				+
			
 
				+```{r inspect_nf, include=FALSE}
			
 
				+summary(total_counts)
			
 
				+summary(nf)
			
 
				+```
			
 
				+
			
 
				+## Fitting the linear models ##
			
 
				+
			
 
				+Now that we have computed our normalization and filtered out
			
 
				+non-expressed genes, it's time to fit our linear models. The basic
			
 
				+model fitting function provided by limma is `lmFit`, which is
			
 
				+essentially a shortcut for running `lm` once for each gene, using the
			
 
				+same model formula each time.
			
 
				+
			
 
				+First, we will construct our design matrix by selecting an appropriate
			
 
				+model formula. This step is performed automatically when you run `lm`,
			
 
				+but for `lmFit` we must do it manually. For our first model, we'll
			
 
				+keep it simple and use Treatment as the only covariate.
			
 
				+
			
 
				+```{r design}
			
 
				+design <- model.matrix(~ Treatment, data=sample_table)
			
 
				+head(design)
			
 
				+```
			
 
				+
			
 
				+Note that the design has two columns, representing the two
			
 
				+coefficients in our model. The first one, named `(Intercept)`,
			
 
				+represents the expression level of the Control samples, while the
			
 
				+second coefficient represents the difference between Control and IFNg
			
 
				+treatments. This is the coefficient that we will test for differential
			
 
				+expression once we have fit our model. This model will be equivalent
			
 
				+to a simple two-sample t-test.
			
 
				+
			
 
				+Before we fit our model, we have to run `voom` to compensate for the
			
 
				+heteroskedasticity of the counts. (While we're at it, we also insert
			
 
				+the gene metadata into the resulting object. This will allow limma to
			
 
				+include the gene metadata its result tables automatically.)
			
 
				+
			
 
				+```{r voom, fig.cap="Voom diagnostic plot"}
			
 
				+v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
			
 
				+v$genes <- filtered_gene_info
			
 
				+```
			
 
				+
			
 
				+Running `voom` with `plot=TRUE` produces a diagnostic plot showing the
			
 
				+empirical relationship between log count and variance. The fitted
			
 
				+curve is a running average of the cloud of points, and this curve is
			
 
				+what voom uses to assign a weight to each observed count.
			
 
				+
			
 
				+* According to the plot, which count would have the highest weight
			
 
				+  (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the
			
 
				+  log2 scale)
			
 
				+* Which would have the lowest weight?
			
 
				+* Is the relationship between log count and variance monotonic? Why
			
 
				+  might it not be monotonic, if higher counts are supposed to be more
			
 
				+  precise?
			
 
				+* Why does `voom` need to know our experimental design?
			
 
				+
			
 
				+The object returned by `voom` is an `EList`, which is a complex object
			
 
				+similar in spirit to a SummarizedExperiment. You don't need to know
			
 
				+anything about its internals. Just know that it contains both the
			
 
				+matrix of log2(CPM) values and the corresponding matrix of weights. It
			
 
				+also contains the gene info, which we added manually. Together with
			
 
				+the design matrix, this is everything we need to fit our linear
			
 
				+models.
			
 
				+
			
 
				+```{r lmfit}
			
 
				+fit <- lmFit(v, design)
			
 
				+fit <- eBayes(fit, robust=TRUE)
			
 
				+```
			
 
				+
			
 
				+Recall that the `eBayes` function is responsible for squeezing each
			
 
				+gene's sample variance toward the overall mean variance of the whole
			
 
				+data set, in the process trading a bit of bias for stability.
			
 
				+
			
 
				+## Getting our results ##
			
 
				+
			
 
				+Anyway, now we can perform a "moderated t-test". "Moderated" means
			
 
				+that we are substituting the empirical Bayes squeezed variance for the
			
 
				+sample variance in the formula for the t statistic (and also adjusting
			
 
				+the degrees of freedom term accordingly). To get our results, we call
			
 
				+`topTable`, telling it which coefficient we wish to test, along with
			
 
				+which multiple testing correction to use on the p-values. The `n=Inf`
			
 
				+tells it to give us the results for all the genes in the data set.
			
 
				+
			
 
				+```{r results}
			
 
				+results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
			
 
				+head(results)
			
 
				+```
			
 
				+
			
 
				+* What is the estimated FDR for the most significant gene (the FDR is
			
 
				+  stored in the `adj.P.Val` column)?
			
 
				+* How many genes are significantly differentially expressed at a
			
 
				+  threshold of 10% FDR?
			
 
				+* Do you see any genes that make biological sense in the top few
			
 
				+  results?
			
 
				+
			
 
				+```{r num_sig, include=FALSE}
			
 
				+results$adj.P.Val[1]
			
 
				+table(results$adj.P.Val <= 0.1)
			
 
				+```
			
 
				+
			
 
				+Now let's inspect the p-value histogram. For reference, we'll add in a
			
 
				+horizontal line indicating what a uniform distribution would look
			
 
				+like.
			
 
				+
			
 
				+```{r pval_hist, fig.cap="P-value histogram for the first analysis"}
			
 
				+ggplot(results) +
			
 
				+    aes(x=P.Value) +
			
 
				+    geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
			
 
				+    geom_hline(yintercept=1) +
			
 
				+    ggtitle("P-value distribution for Control vs Treatment")
			
 
				+```
			
 
				+
			
 
				+* What can we conclude from this histogram?
			
 
				+
			
 
				+Lastly, we can generate an MA plot showing the log2 fold change vs the
			
 
				+log2 CPM for each gene:
			
 
				+
			
 
				+```{r maplot}
			
 
				+ggplot(arrange(results, desc(P.Value))) +
			
 
				+    aes(x=AveExpr, y=logFC,
			
 
				+         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
			
 
				+    geom_point(size=0.1) +
			
 
				+    scale_color_hue(name="Significance") +
			
 
				+    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
			
 
				+    ggtitle("MA Plot, IFNg vs Control")
			
 
				+```
			
 
				+
			
 
				+Another common plot is the volcano plot, which plots significance vs
			
 
				+log fold change:
			
 
				+
			
 
				+```{r volcanoplot}
			
 
				+ggplot(arrange(results, desc(P.Value))) +
			
 
				+    aes(x=logFC, y=-log10(P.Value),
			
 
				+         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
			
 
				+    geom_point(size=0.1) +
			
 
				+    scale_color_hue(name="Significance") +
			
 
				+    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
			
 
				+    ggtitle("Volcano Plot, IFNg vs Control")
			
 
				+```
			
 
				+
			
 
				+* Based on these plots, are the changes balanced between up and down,
			
 
				+  or is there a bias toward a certain direction of change? Does this
			
 
				+  make sense for the biology of interferon treatment?
			
 
				+
			
 
				+
			
 
				+# Improving the analysis by exploring the data
			
 
				+
			
 
				+We got pretty good results above, but how do we know that we analyzed
			
 
				+the data correctly? Are there any important covariates that we should
			
 
				+have included in our model? How can we figure out which covariates are
			
 
				+important? One way is to do a PCA plot. Limma actually provides
			
 
				+something called an MDS or PCoA plot, which is slightly different from
			
 
				+a PCA plot but serves the same purpose.
			
 
				+
			
 
				+```{r mds_init, fig.cap="Basic MDS plot from limma"}
			
 
				+mds <- data.frame(plotMDS(v)[c("x", "y")])
			
 
				+```
			
 
				+
			
 
				+Limma's `plotMDS` function creates an MDS plot using the sample
			
 
				+labels, but this results in a very crowded plot. Luckily, it also
			
 
				+returns the x and y coordinates of the plot, which we can use to make
			
 
				+our own. Because we want to see how each covariate relates to the
			
 
				+principal coordinates, let's make several versions of our MDS plot,
			
 
				+colored by each covariate.
			
 
				+
			
 
				+```{r mds, results="hide"}
			
 
				+mds <- cbind(mds, sample_table)
			
 
				+p <- ggplot(mds) +
			
 
				+    aes(x=x, y=y) +
			
 
				+    xlab("PC1") + ylab("PC2") +
			
 
				+    geom_point(size=3) +
			
 
				+    coord_fixed(ratio=1) +
			
 
				+    ggtitle("Sample MDS Plot")
			
 
				+for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
			
 
				+    print(p + aes_string(color=i) +
			
 
				+          ggtitle(paste("Sample MDS Plot Colored by", i)))
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+* Based on the MDS plots, which variables seem to be important, and
			
 
				+  which do not?
			
 
				+
			
 
				+## Investigating an outlier
			
 
				+
			
 
				+You might have noticed that a few of the IFNg-treated samples cluster
			
 
				+with the Control samples. These look like possible outlier samples,
			
 
				+and we should investigate them, since they could interfere with our
			
 
				+model fit. Let's try coloring by Animal ID and using a different shape
			
 
				+for the IFNg samples.
			
 
				+
			
 
				+```{r id_outlier, fig.cap="MDS plot for identifying the outlier animal"}
			
 
				+p + aes(color=Animal.ID, shape=Treatment)
			
 
				+```
			
 
				+
			
 
				+* Can you identify the misbehaving Animal?
			
 
				+
			
 
				+Let's remove the samples for this animal from the data set and repeat
			
 
				+the whole limma analysis.
			
 
				+
			
 
				+```{r filter_outlier_animal}
			
 
				+bad_animal <- "CN8351"
			
 
				+selected_samples <- !(sample_table$Animal.ID %in% bad_animal)
			
 
				+
			
 
				+good_sample_table <- droplevels(sample_table[selected_samples,])
			
 
				+good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
			
 
				+good_normalized_total_counts <- normalized_total_counts[selected_samples]
			
 
				+
			
 
				+design <- model.matrix(~ Treatment, data=good_sample_table)
			
 
				+v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
			
 
				+v$genes <- filtered_gene_info
			
 
				+fit <- lmFit(v, design)
			
 
				+fit <- eBayes(fit, robust=TRUE)
			
 
				+results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
			
 
				+table(results$adj.P.Val <= 0.1)
			
 
				+```
			
 
				+
			
 
				+* How did removing the outlier animal affect the number of
			
 
				+  differentially expressed genes?
			
 
				+
			
 
				+Hopefully this demonstrates the importance of properly exploring your
			
 
				+data before deciding on a model. You can always fit any model to the
			
 
				+data, but the model will give you some kind of answer, sometimes even
			
 
				+a plausible-looking one, whether or not that model is a good fit for
			
 
				+the data. By eliminating an outlier, we significantly increased our
			
 
				+ability to detect differential expression. Similarly, the focus of the
			
 
				+homework will be to explore the consequences of adding additional
			
 
				+covariates into the model formula.
			
 
				+
			
 
				+# Homework: Basic Model Selection
			
 
				+
			
 
				+Try fitting models with Animal.ID, Passage, or both in addition to
			
 
				+Treatment. In other words, try all four of the following model
			
 
				+formulae:
			
 
				+
			
 
				+```{r model_options, eval=FALSE}
			
 
				+~ Treatment # (This is the model we just fit in class)
			
 
				+~ Animal.ID + Treatment
			
 
				+~ Passage + Treatment
			
 
				+~ Animal.ID + Passage + Treatment
			
 
				+```
			
 
				+
			
 
				+Use the filtered dataset with the outlier animal samples removed. In
			
 
				+addition to testing Treatment for differential expression in each
			
 
				+model, also test the other covariates for differential expression by
			
 
				+passing a different value for the `coef` argument to `topTable`. You
			
 
				+will need to pass a vector of all the columns of the design matrix
			
 
				+relevant to the covariate. Below, I have included an example of how to
			
 
				+fit the model and test for differential expression for the formula
			
 
				+`~Passage + Treatment`. (Hint: use `colnames(design)` to help you
			
 
				+figure out which coefficients to test.)
			
 
				+
			
 
				+```{r homework_example, eval=FALSE}
			
 
				+design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
			
 
				+v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
			
 
				+v$genes <- filtered_gene_info
			
 
				+fit <- lmFit(v, design)
			
 
				+fit <- eBayes(fit, robust=TRUE)
			
 
				+results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
			
 
				+table(results$adj.P.Val <= 0.1)
			
 
				+passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
			
 
				+table(passage.results$adj.P.Val <= 0.1)
			
 
				+```
			
 
				+
			
 
				+Based on your results for all four models, decide which model best
			
 
				+fits the data. Justify your choice with MDS plots and p-value
			
 
				+distribution plots that show why you are including or excluding each
			
 
				+of Animal.ID and Passage from your model. How do your results for
			
 
				+Treatment change when you include your chosen covariates? Does your
			
 
				+model give more differentially expressed genes than the one we fit in
			
 
				+class? If so, how many more? How many genes are differentially
			
 
				+expressed with respect to your chosen covariates? Do these results
			
 
				+still hold when using a different signifcance threshold, such as 5% or
			
 
				+1% FDR instead of 10%?
			
 
				+
			
 
				+For help fitting models with multiple covariates, see the
			
 
				+[Limma User's Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
			
 
				+section 9.4, "Additive Models and Blocking".
			
 
				+
			
 
				+# Homework shortcut: `selectModel()`
			
 
				+
			
 
				+```{r homework_shortcut}
			
 
				+designList <- list(
			
 
				+  TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
			
 
				+  TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
			
 
				+  TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
			
 
				+  TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
			
 
				+v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
			
 
				+sm <- selectModel(v, designlist = designList, criterion = "aic")
			
 
				+table(sm$pref)
			
 
				+```