123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556 |
- ---
- title: RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon
- gamma
- author: "Ryan C. Thompson"
- date: "May 8, 2018"
- output:
- html_notebook:
- fig_caption: yes
- pdf_document:
- fig_caption: yes
- ---
- ```{r setup, include=FALSE}
- knitr::opts_chunk$set(echo = TRUE, retina=2, cache=TRUE, autodep=TRUE)
- ```
- # Introduction
- Today we're going to look at a data set from my work. This data comes
- from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys
- (a closely related species to rhesus macaques). MSCs are known to have
- immune modulatory effects, and the main goal of the project is to test
- whether these stem cells, when activated with interferon gamma (IFNg),
- can help suppress the body's immune response against an organ
- transplant.
- 
- The purpose of this specific data set is simply to determine which
- genes' expression is affected when MSCs are treated with IFNg. As
- such, we have both untreated (Control) samples and IFNg-treated
- samples from 3 different passages of the cell cultures. So, we'll be
- using limma to test each gene for differential expression between
- Control and IFNg samples.
- # Preliminary setup
- First, let's install all the packages we'll need.
- ```{r install_pkgs, eval=FALSE}
- if (!requireNamespace("BiocManager", quietly = TRUE))
- install.packages("BiocManager")
- already_installed_packages <- rownames(installed.packages())
- needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
- need_to_install <- setdiff(needed_packages, already_installed_packages)
- if (length(need_to_install) > 0) {
- ## http://bioconductor.org/install/
- BiocManager::install(need_to_install)
- }
- ```
- Then we'll load those packages.
- ```{r load_pkgs, results="hide", message=FALSE}
- library(SummarizedExperiment)
- library(edgeR)
- library(limma)
- library(ggplot2)
- library(dplyr)
- ```
- # Initial data loading and exploration
- Let's load the data file. This uses the `readRDS` function, which
- reads a single R object from a file and returns it. We assign the
- result to a variable.
- ```{r loading_data}
- sexp <- readRDS(gzcon(url("https://darwinawardwinner.github.io/resume/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS")))
- print(sexp)
- ```
- This object is called a SummarizedExperiment, and its purpose is to
- hold the experimental data, sample metadata, gene metadata, and any
- other relevant information for the experiment all in one place. This
- makes it an excellent starting point for any analysis. In our case,
- the experimental data consists of a matrix of counts for each gene in
- each sample. This count matrix was generated by aligning all the RNA
- sequence reads in each sample to the
- [cyno genome](http://www.ncbi.nlm.nih.gov/genome/776) and then
- counting the number of uniquely-mapped reads that overlap each gene. I
- have done the aligning and counting process for you and provided the
- count matrix, since the alignment and counting would take many hours
- to commplete.
- Look at the output of the print statement above. The RNA-Seq read
- counts are contained in the `assays` slot. The `colData` slot contains
- information on the samples, while the `rowRanges` slot contains
- information on the genes. The rows have "ranges" instead of just
- "data" because in addition to things like the gene symbol and
- description, the `rowRanges` slot also contains the genomic
- coordinates of each gene, that is, the chromosome, start/end positions
- of each exon, and the strand. This is the same information that was
- used to count the overlapping reads for each gene, but we won't be
- needing it today.
- You can read more about SummarizedExperiment objects
- [here](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).
- 
- * How many samples does this SummarizedExperiment object contain? How
- many genes does it have? Where did you find this information?
- * What happens when you try to select a subset of rows or columns?
- ## Extracting the data and metadata
- Ok, let's pull out the information that we want from this
- SummarizedExperiment.
- First, the sample information:
- ```{r extract_sampletable}
- sample_table <- as.data.frame(colData(sexp))
- ```
- Next, the gene information:
- ```{r extract_geneinfo}
- ## Gene metadata
- gene_info <- as.data.frame(mcols(sexp))
- ```
- And finally, the count matrix:
- ```{r extract_counts}
- count_matrix <- assay(sexp)
- ```
- Examine the `sample_table`, `gene_info`, and `count_matrix` objects.
- Try `dim`, `class`, `nrow`, `ncol`, and similar. Try viewing the first
- few rows or columns of each one.
- * What kind of object is each one?
- * What kind of data do they contain?
- * Which dimensions do they have in common? How do these dimensions
- match up to the dimension of the SummarizedExperiment?
- ```{r example_sexp_components, include=FALSE}
- dim(sexp)
- dim(sample_table)
- dim(gene_info)
- dim(count_matrix)
- head(sample_table)
- head(gene_info)
- ## Find the IFNG gene
- gene_info[which(gene_info$symbol == "IFNG"),]
- count_matrix[1:10,1:6]
- ```
- ## Learning about the experimental design from the sample table
- Before we get to the analysis, we need to have a closer look at the
- sample table, to gain a better understanding of the experimental
- design.
- ```{r head_sample_table}
- names(sample_table)
- head(sample_table)
- ```
- * How many Labs/Runs/Animals/Passages/Treatments are there in this
- experiment?
- * How many samples are in each one?
- (You may find the `unique` and `table` functions useful for answering
- these questions.)
- ```{r tabulate_samples, include=FALSE}
- lapply(sample_table[-1], unique)
- lapply(sample_table[-1], table)
- ```
- * How many Animals are there in each Lab?
- * Are there Animals with samples from both Labs?
- ```{r tabulate_animals_in_labs}
- table(sample_table$Animal.ID, sample_table$Lab)
- ```
- * How many sequencing Runs are there in each Lab?
- * How many Animals are there in each sequencing Run?
- * Are there Animals with samples from multiple sequencing Runs?
- * Is it better to put all of each animal's samples in the same run, or
- is it better to distribute each animal's samples across multiple
- runs?
- ```{r tabulate_more_things, include=FALSE}
- table(sample_table$Animal.ID, sample_table$Run)
- table(sample_table$Run, sample_table$Lab)
- ```
- # Performing a basic limma analysis #
- ## Filtering non-expressed genes ##
- Our count matrix contains information on every known gene in the cyno
- genome, but we only want to look at the genes that are expressed in
- MSCs. Many genes will not be expressed at all and will have zero or
- very few counts in all samples, and the statistical method breaks down
- when it is fed all zeros. So we need to decide on a threshold of gene
- detection and filter out all the genes whose average expression does
- not reach this threshold. Let's see how a threshold of logCPM = 1
- looks on a histogram and a QQ plot against normal distribution
- quantiles.
- ```{r plot_cpm_threshold, fig.cap="logCPM histogram with threshold line"}
- mean_log_cpm <- aveLogCPM(count_matrix)
- filter_threshold <- 1
- ggplot() + aes(x=mean_log_cpm) +
- geom_histogram(binwidth=0.2) +
- geom_vline(xintercept=filter_threshold) +
- ggtitle("Histogram of mean expression values")
- ```
- ```{r plot_cpm_qq, fig.cap="logCPM QQ plot with threshold line"}
- qqnorm(mean_log_cpm); abline(h=filter_threshold)
- ```
- * What do the shapes of these two plots indicate?
- * How can we justify our choice of threshold using these plots?
- * Would this same detection threshold be appropriate for other
- experiments?
- * What biological or experimental factors might influence the choice
- of threshold?
- Having chosen our threshold, let's pick the subset of genes whose
- average expression passes that threshold.
- ```{r filter_genes}
- keep_genes <- mean_log_cpm >= 1
- filtered_count_matrix <- count_matrix[keep_genes,]
- filtered_gene_info <- gene_info[keep_genes,]
- nrow(filtered_count_matrix)
- ```
- * What fraction of the genes in the genome are considered expressed
- according to our threshold?
- * Is this number sensitive to our choice of threshold? How much of a
- difference does it make if we use a threshold of 0.5 or 1.5 instead?
- ```{r threshold_sensitivity, include=FALSE}
- mean(keep_genes)
- mean(mean_log_cpm >= 1.5)
- mean(mean_log_cpm >= 0.5)
- ```
- ## Normalization ##
- Ok, now that we have some understanding of the experimental design,
- let's find some differentially expressed genes! We'll start by
- computing the total counts in each sample and then normalizing these
- total counts using the Trimmed Mean of M-values (TMM) method, provided
- by the `calcNormFactors` function.
- ```{r calcNormFactors}
- total_counts <- colSums(count_matrix)
- nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
- print(nf)
- normalized_total_counts <- total_counts * nf
- ```
- * What is the range of total_counts/normalization factors? (Try the
- `summary` function)
- * What is the average total count per sample? Does this seem high or
- low? How might we expect this to affect our results?
- ```{r inspect_nf, include=FALSE}
- summary(total_counts)
- summary(nf)
- ```
- ## Fitting the linear models ##
- Now that we have computed our normalization and filtered out
- non-expressed genes, it's time to fit our linear models. The basic
- model fitting function provided by limma is `lmFit`, which is
- essentially a shortcut for running `lm` once for each gene, using the
- same model formula each time.
- First, we will construct our design matrix by selecting an appropriate
- model formula. This step is performed automatically when you run `lm`,
- but for `lmFit` we must do it manually. For our first model, we'll
- keep it simple and use Treatment as the only covariate.
- ```{r design}
- design <- model.matrix(~ Treatment, data=sample_table)
- head(design)
- ```
- Note that the design has two columns, representing the two
- coefficients in our model. The first one, named `(Intercept)`,
- represents the expression level of the Control samples, while the
- second coefficient represents the difference between Control and IFNg
- treatments. This is the coefficient that we will test for differential
- expression once we have fit our model. This model will be equivalent
- to a simple two-sample t-test.
- Before we fit our model, we have to run `voom` to compensate for the
- heteroskedasticity of the counts. (While we're at it, we also insert
- the gene metadata into the resulting object. This will allow limma to
- include the gene metadata its result tables automatically.)
- ```{r voom, fig.cap="Voom diagnostic plot"}
- v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
- v$genes <- filtered_gene_info
- ```
- Running `voom` with `plot=TRUE` produces a diagnostic plot showing the
- empirical relationship between log count and variance. The fitted
- curve is a running average of the cloud of points, and this curve is
- what voom uses to assign a weight to each observed count.
- * According to the plot, which count would have the highest weight
- (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the
- log2 scale)
- * Which would have the lowest weight?
- * Is the relationship between log count and variance monotonic? Why
- might it not be monotonic, if higher counts are supposed to be more
- precise?
- * Why does `voom` need to know our experimental design?
- The object returned by `voom` is an `EList`, which is a complex object
- similar in spirit to a SummarizedExperiment. You don't need to know
- anything about its internals. Just know that it contains both the
- matrix of log2(CPM) values and the corresponding matrix of weights. It
- also contains the gene info, which we added manually. Together with
- the design matrix, this is everything we need to fit our linear
- models.
- ```{r lmfit}
- fit <- lmFit(v, design)
- fit <- eBayes(fit, robust=TRUE)
- ```
- Recall that the `eBayes` function is responsible for squeezing each
- gene's sample variance toward the overall mean variance of the whole
- data set, in the process trading a bit of bias for stability.
- ## Getting our results ##
- Anyway, now we can perform a "moderated t-test". "Moderated" means
- that we are substituting the empirical Bayes squeezed variance for the
- sample variance in the formula for the t statistic (and also adjusting
- the degrees of freedom term accordingly). To get our results, we call
- `topTable`, telling it which coefficient we wish to test, along with
- which multiple testing correction to use on the p-values. The `n=Inf`
- tells it to give us the results for all the genes in the data set.
- ```{r results}
- results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
- head(results)
- ```
- * What is the estimated FDR for the most significant gene (the FDR is
- stored in the `adj.P.Val` column)?
- * How many genes are significantly differentially expressed at a
- threshold of 10% FDR?
- * Do you see any genes that make biological sense in the top few
- results?
- ```{r num_sig, include=FALSE}
- results$adj.P.Val[1]
- table(results$adj.P.Val <= 0.1)
- ```
- Now let's inspect the p-value histogram. For reference, we'll add in a
- horizontal line indicating what a uniform distribution would look
- like.
- ```{r pval_hist, fig.cap="P-value histogram for the first analysis"}
- ggplot(results) +
- aes(x=P.Value) +
- geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
- geom_hline(yintercept=1) +
- ggtitle("P-value distribution for Control vs Treatment")
- ```
- * What can we conclude from this histogram?
- Lastly, we can generate an MA plot showing the log2 fold change vs the
- log2 CPM for each gene:
- ```{r maplot}
- ggplot(arrange(results, desc(P.Value))) +
- aes(x=AveExpr, y=logFC,
- color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
- geom_point(size=0.1) +
- scale_color_hue(name="Significance") +
- theme(legend.justification=c(1,1), legend.position=c(1,1)) +
- ggtitle("MA Plot, IFNg vs Control")
- ```
- Another common plot is the volcano plot, which plots significance vs
- log fold change:
- ```{r volcanoplot}
- ggplot(arrange(results, desc(P.Value))) +
- aes(x=logFC, y=-log10(P.Value),
- color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
- geom_point(size=0.1) +
- scale_color_hue(name="Significance") +
- theme(legend.justification=c(1,1), legend.position=c(1,1)) +
- ggtitle("Volcano Plot, IFNg vs Control")
- ```
- * Based on these plots, are the changes balanced between up and down,
- or is there a bias toward a certain direction of change? Does this
- make sense for the biology of interferon treatment?
- # Improving the analysis by exploring the data
- We got pretty good results above, but how do we know that we analyzed
- the data correctly? Are there any important covariates that we should
- have included in our model? How can we figure out which covariates are
- important? One way is to do a PCA plot. Limma actually provides
- something called an MDS or PCoA plot, which is slightly different from
- a PCA plot but serves the same purpose.
- ```{r mds_init, fig.cap="Basic MDS plot from limma"}
- mds <- data.frame(plotMDS(v)[c("x", "y")])
- ```
- Limma's `plotMDS` function creates an MDS plot using the sample
- labels, but this results in a very crowded plot. Luckily, it also
- returns the x and y coordinates of the plot, which we can use to make
- our own. Because we want to see how each covariate relates to the
- principal coordinates, let's make several versions of our MDS plot,
- colored by each covariate.
- ```{r mds, results="hide"}
- mds <- cbind(mds, sample_table)
- p <- ggplot(mds) +
- aes(x=x, y=y) +
- xlab("PC1") + ylab("PC2") +
- geom_point(size=3) +
- coord_fixed(ratio=1) +
- ggtitle("Sample MDS Plot")
- for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
- print(p + aes_string(color=i) +
- ggtitle(paste("Sample MDS Plot Colored by", i)))
- }
- ```
- * Based on the MDS plots, which variables seem to be important, and
- which do not?
- ## Investigating an outlier
- You might have noticed that a few of the IFNg-treated samples cluster
- with the Control samples. These look like possible outlier samples,
- and we should investigate them, since they could interfere with our
- model fit. Let's try coloring by Animal ID and using a different shape
- for the IFNg samples.
- ```{r id_outlier, fig.cap="MDS plot for identifying the outlier animal"}
- p + aes(color=Animal.ID, shape=Treatment)
- ```
- * Can you identify the misbehaving Animal?
- Let's remove the samples for this animal from the data set and repeat
- the whole limma analysis.
- ```{r filter_outlier_animal}
- bad_animal <- "CN8351"
- selected_samples <- !(sample_table$Animal.ID %in% bad_animal)
- good_sample_table <- droplevels(sample_table[selected_samples,])
- good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
- good_normalized_total_counts <- normalized_total_counts[selected_samples]
- design <- model.matrix(~ Treatment, data=good_sample_table)
- v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
- v$genes <- filtered_gene_info
- fit <- lmFit(v, design)
- fit <- eBayes(fit, robust=TRUE)
- results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
- table(results$adj.P.Val <= 0.1)
- ```
- * How did removing the outlier animal affect the number of
- differentially expressed genes?
- Hopefully this demonstrates the importance of properly exploring your
- data before deciding on a model. You can always fit any model to the
- data, but the model will give you some kind of answer, sometimes even
- a plausible-looking one, whether or not that model is a good fit for
- the data. By eliminating an outlier, we significantly increased our
- ability to detect differential expression. Similarly, the focus of the
- homework will be to explore the consequences of adding additional
- covariates into the model formula.
- # Homework: Basic Model Selection
- Try fitting models with Animal.ID, Passage, or both in addition to
- Treatment. In other words, try all four of the following model
- formulae:
- ```{r model_options, eval=FALSE}
- ~ Treatment # (This is the model we just fit in class)
- ~ Animal.ID + Treatment
- ~ Passage + Treatment
- ~ Animal.ID + Passage + Treatment
- ```
- Use the filtered dataset with the outlier animal samples removed. In
- addition to testing Treatment for differential expression in each
- model, also test the other covariates for differential expression by
- passing a different value for the `coef` argument to `topTable`. You
- will need to pass a vector of all the columns of the design matrix
- relevant to the covariate. Below, I have included an example of how to
- fit the model and test for differential expression for the formula
- `~Passage + Treatment`. (Hint: use `colnames(design)` to help you
- figure out which coefficients to test.)
- ```{r homework_example, eval=FALSE}
- design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
- v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
- v$genes <- filtered_gene_info
- fit <- lmFit(v, design)
- fit <- eBayes(fit, robust=TRUE)
- results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
- table(results$adj.P.Val <= 0.1)
- passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
- table(passage.results$adj.P.Val <= 0.1)
- ```
- Based on your results for all four models, decide which model best
- fits the data. Justify your choice with MDS plots and p-value
- distribution plots that show why you are including or excluding each
- of Animal.ID and Passage from your model. How do your results for
- Treatment change when you include your chosen covariates? Does your
- model give more differentially expressed genes than the one we fit in
- class? If so, how many more? How many genes are differentially
- expressed with respect to your chosen covariates? Do these results
- still hold when using a different signifcance threshold, such as 5% or
- 1% FDR instead of 10%?
- For help fitting models with multiple covariates, see the
- [Limma User's Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
- section 9.4, "Additive Models and Blocking".
- # Homework shortcut: `selectModel()`
- ```{r homework_shortcut}
- designList <- list(
- TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
- TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
- TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
- TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
- v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
- sm <- selectModel(v, designlist = designList, criterion = "aic")
- table(sm$pref)
- ```
|