RNA-Seq Lab.Rmd 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556
  1. ---
  2. title: RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon
  3. gamma
  4. author: "Ryan C. Thompson"
  5. date: "May 8, 2018"
  6. output:
  7. html_notebook:
  8. fig_caption: yes
  9. pdf_document:
  10. fig_caption: yes
  11. ---
  12. ```{r setup, include=FALSE}
  13. knitr::opts_chunk$set(echo = TRUE, retina=2, cache=TRUE, autodep=TRUE)
  14. ```
  15. # Introduction
  16. Today we're going to look at a data set from my work. This data comes
  17. from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys
  18. (a closely related species to rhesus macaques). MSCs are known to have
  19. immune modulatory effects, and the main goal of the project is to test
  20. whether these stem cells, when activated with interferon gamma (IFNg),
  21. can help suppress the body's immune response against an organ
  22. transplant.
  23. ![Known effects of MSC which may benefit organ transplants](images/msc1-cropped.png)
  24. The purpose of this specific data set is simply to determine which
  25. genes' expression is affected when MSCs are treated with IFNg. As
  26. such, we have both untreated (Control) samples and IFNg-treated
  27. samples from 3 different passages of the cell cultures. So, we'll be
  28. using limma to test each gene for differential expression between
  29. Control and IFNg samples.
  30. # Preliminary setup
  31. First, let's install all the packages we'll need.
  32. ```{r install_pkgs, eval=FALSE}
  33. if (!requireNamespace("BiocManager", quietly = TRUE))
  34. install.packages("BiocManager")
  35. already_installed_packages <- rownames(installed.packages())
  36. needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
  37. need_to_install <- setdiff(needed_packages, already_installed_packages)
  38. if (length(need_to_install) > 0) {
  39. ## http://bioconductor.org/install/
  40. BiocManager::install(need_to_install)
  41. }
  42. ```
  43. Then we'll load those packages.
  44. ```{r load_pkgs, results="hide", message=FALSE}
  45. library(SummarizedExperiment)
  46. library(edgeR)
  47. library(limma)
  48. library(ggplot2)
  49. library(dplyr)
  50. ```
  51. # Initial data loading and exploration
  52. Let's load the data file. This uses the `readRDS` function, which
  53. reads a single R object from a file and returns it. We assign the
  54. result to a variable.
  55. ```{r loading_data}
  56. sexp <- readRDS(gzcon(url("https://darwinawardwinner.github.io/resume/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS")))
  57. print(sexp)
  58. ```
  59. This object is called a SummarizedExperiment, and its purpose is to
  60. hold the experimental data, sample metadata, gene metadata, and any
  61. other relevant information for the experiment all in one place. This
  62. makes it an excellent starting point for any analysis. In our case,
  63. the experimental data consists of a matrix of counts for each gene in
  64. each sample. This count matrix was generated by aligning all the RNA
  65. sequence reads in each sample to the
  66. [cyno genome](http://www.ncbi.nlm.nih.gov/genome/776) and then
  67. counting the number of uniquely-mapped reads that overlap each gene. I
  68. have done the aligning and counting process for you and provided the
  69. count matrix, since the alignment and counting would take many hours
  70. to commplete.
  71. Look at the output of the print statement above. The RNA-Seq read
  72. counts are contained in the `assays` slot. The `colData` slot contains
  73. information on the samples, while the `rowRanges` slot contains
  74. information on the genes. The rows have "ranges" instead of just
  75. "data" because in addition to things like the gene symbol and
  76. description, the `rowRanges` slot also contains the genomic
  77. coordinates of each gene, that is, the chromosome, start/end positions
  78. of each exon, and the strand. This is the same information that was
  79. used to count the overlapping reads for each gene, but we won't be
  80. needing it today.
  81. You can read more about SummarizedExperiment objects
  82. [here](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).
  83. ![Structure of a SummarizedExperiment object](images/SEXP.svg)
  84. * How many samples does this SummarizedExperiment object contain? How
  85. many genes does it have? Where did you find this information?
  86. * What happens when you try to select a subset of rows or columns?
  87. ## Extracting the data and metadata
  88. Ok, let's pull out the information that we want from this
  89. SummarizedExperiment.
  90. First, the sample information:
  91. ```{r extract_sampletable}
  92. sample_table <- as.data.frame(colData(sexp))
  93. ```
  94. Next, the gene information:
  95. ```{r extract_geneinfo}
  96. ## Gene metadata
  97. gene_info <- as.data.frame(mcols(sexp))
  98. ```
  99. And finally, the count matrix:
  100. ```{r extract_counts}
  101. count_matrix <- assay(sexp)
  102. ```
  103. Examine the `sample_table`, `gene_info`, and `count_matrix` objects.
  104. Try `dim`, `class`, `nrow`, `ncol`, and similar. Try viewing the first
  105. few rows or columns of each one.
  106. * What kind of object is each one?
  107. * What kind of data do they contain?
  108. * Which dimensions do they have in common? How do these dimensions
  109. match up to the dimension of the SummarizedExperiment?
  110. ```{r example_sexp_components, include=FALSE}
  111. dim(sexp)
  112. dim(sample_table)
  113. dim(gene_info)
  114. dim(count_matrix)
  115. head(sample_table)
  116. head(gene_info)
  117. ## Find the IFNG gene
  118. gene_info[which(gene_info$symbol == "IFNG"),]
  119. count_matrix[1:10,1:6]
  120. ```
  121. ## Learning about the experimental design from the sample table
  122. Before we get to the analysis, we need to have a closer look at the
  123. sample table, to gain a better understanding of the experimental
  124. design.
  125. ```{r head_sample_table}
  126. names(sample_table)
  127. head(sample_table)
  128. ```
  129. * How many Labs/Runs/Animals/Passages/Treatments are there in this
  130. experiment?
  131. * How many samples are in each one?
  132. (You may find the `unique` and `table` functions useful for answering
  133. these questions.)
  134. ```{r tabulate_samples, include=FALSE}
  135. lapply(sample_table[-1], unique)
  136. lapply(sample_table[-1], table)
  137. ```
  138. * How many Animals are there in each Lab?
  139. * Are there Animals with samples from both Labs?
  140. ```{r tabulate_animals_in_labs}
  141. table(sample_table$Animal.ID, sample_table$Lab)
  142. ```
  143. * How many sequencing Runs are there in each Lab?
  144. * How many Animals are there in each sequencing Run?
  145. * Are there Animals with samples from multiple sequencing Runs?
  146. * Is it better to put all of each animal's samples in the same run, or
  147. is it better to distribute each animal's samples across multiple
  148. runs?
  149. ```{r tabulate_more_things, include=FALSE}
  150. table(sample_table$Animal.ID, sample_table$Run)
  151. table(sample_table$Run, sample_table$Lab)
  152. ```
  153. # Performing a basic limma analysis #
  154. ## Filtering non-expressed genes ##
  155. Our count matrix contains information on every known gene in the cyno
  156. genome, but we only want to look at the genes that are expressed in
  157. MSCs. Many genes will not be expressed at all and will have zero or
  158. very few counts in all samples, and the statistical method breaks down
  159. when it is fed all zeros. So we need to decide on a threshold of gene
  160. detection and filter out all the genes whose average expression does
  161. not reach this threshold. Let's see how a threshold of logCPM = 1
  162. looks on a histogram and a QQ plot against normal distribution
  163. quantiles.
  164. ```{r plot_cpm_threshold, fig.cap="logCPM histogram with threshold line"}
  165. mean_log_cpm <- aveLogCPM(count_matrix)
  166. filter_threshold <- 1
  167. ggplot() + aes(x=mean_log_cpm) +
  168. geom_histogram(binwidth=0.2) +
  169. geom_vline(xintercept=filter_threshold) +
  170. ggtitle("Histogram of mean expression values")
  171. ```
  172. ```{r plot_cpm_qq, fig.cap="logCPM QQ plot with threshold line"}
  173. qqnorm(mean_log_cpm); abline(h=filter_threshold)
  174. ```
  175. * What do the shapes of these two plots indicate?
  176. * How can we justify our choice of threshold using these plots?
  177. * Would this same detection threshold be appropriate for other
  178. experiments?
  179. * What biological or experimental factors might influence the choice
  180. of threshold?
  181. Having chosen our threshold, let's pick the subset of genes whose
  182. average expression passes that threshold.
  183. ```{r filter_genes}
  184. keep_genes <- mean_log_cpm >= 1
  185. filtered_count_matrix <- count_matrix[keep_genes,]
  186. filtered_gene_info <- gene_info[keep_genes,]
  187. nrow(filtered_count_matrix)
  188. ```
  189. * What fraction of the genes in the genome are considered expressed
  190. according to our threshold?
  191. * Is this number sensitive to our choice of threshold? How much of a
  192. difference does it make if we use a threshold of 0.5 or 1.5 instead?
  193. ```{r threshold_sensitivity, include=FALSE}
  194. mean(keep_genes)
  195. mean(mean_log_cpm >= 1.5)
  196. mean(mean_log_cpm >= 0.5)
  197. ```
  198. ## Normalization ##
  199. Ok, now that we have some understanding of the experimental design,
  200. let's find some differentially expressed genes! We'll start by
  201. computing the total counts in each sample and then normalizing these
  202. total counts using the Trimmed Mean of M-values (TMM) method, provided
  203. by the `calcNormFactors` function.
  204. ```{r calcNormFactors}
  205. total_counts <- colSums(count_matrix)
  206. nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
  207. print(nf)
  208. normalized_total_counts <- total_counts * nf
  209. ```
  210. * What is the range of total_counts/normalization factors? (Try the
  211. `summary` function)
  212. * What is the average total count per sample? Does this seem high or
  213. low? How might we expect this to affect our results?
  214. ```{r inspect_nf, include=FALSE}
  215. summary(total_counts)
  216. summary(nf)
  217. ```
  218. ## Fitting the linear models ##
  219. Now that we have computed our normalization and filtered out
  220. non-expressed genes, it's time to fit our linear models. The basic
  221. model fitting function provided by limma is `lmFit`, which is
  222. essentially a shortcut for running `lm` once for each gene, using the
  223. same model formula each time.
  224. First, we will construct our design matrix by selecting an appropriate
  225. model formula. This step is performed automatically when you run `lm`,
  226. but for `lmFit` we must do it manually. For our first model, we'll
  227. keep it simple and use Treatment as the only covariate.
  228. ```{r design}
  229. design <- model.matrix(~ Treatment, data=sample_table)
  230. head(design)
  231. ```
  232. Note that the design has two columns, representing the two
  233. coefficients in our model. The first one, named `(Intercept)`,
  234. represents the expression level of the Control samples, while the
  235. second coefficient represents the difference between Control and IFNg
  236. treatments. This is the coefficient that we will test for differential
  237. expression once we have fit our model. This model will be equivalent
  238. to a simple two-sample t-test.
  239. Before we fit our model, we have to run `voom` to compensate for the
  240. heteroskedasticity of the counts. (While we're at it, we also insert
  241. the gene metadata into the resulting object. This will allow limma to
  242. include the gene metadata its result tables automatically.)
  243. ```{r voom, fig.cap="Voom diagnostic plot"}
  244. v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
  245. v$genes <- filtered_gene_info
  246. ```
  247. Running `voom` with `plot=TRUE` produces a diagnostic plot showing the
  248. empirical relationship between log count and variance. The fitted
  249. curve is a running average of the cloud of points, and this curve is
  250. what voom uses to assign a weight to each observed count.
  251. * According to the plot, which count would have the highest weight
  252. (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the
  253. log2 scale)
  254. * Which would have the lowest weight?
  255. * Is the relationship between log count and variance monotonic? Why
  256. might it not be monotonic, if higher counts are supposed to be more
  257. precise?
  258. * Why does `voom` need to know our experimental design?
  259. The object returned by `voom` is an `EList`, which is a complex object
  260. similar in spirit to a SummarizedExperiment. You don't need to know
  261. anything about its internals. Just know that it contains both the
  262. matrix of log2(CPM) values and the corresponding matrix of weights. It
  263. also contains the gene info, which we added manually. Together with
  264. the design matrix, this is everything we need to fit our linear
  265. models.
  266. ```{r lmfit}
  267. fit <- lmFit(v, design)
  268. fit <- eBayes(fit, robust=TRUE)
  269. ```
  270. Recall that the `eBayes` function is responsible for squeezing each
  271. gene's sample variance toward the overall mean variance of the whole
  272. data set, in the process trading a bit of bias for stability.
  273. ## Getting our results ##
  274. Anyway, now we can perform a "moderated t-test". "Moderated" means
  275. that we are substituting the empirical Bayes squeezed variance for the
  276. sample variance in the formula for the t statistic (and also adjusting
  277. the degrees of freedom term accordingly). To get our results, we call
  278. `topTable`, telling it which coefficient we wish to test, along with
  279. which multiple testing correction to use on the p-values. The `n=Inf`
  280. tells it to give us the results for all the genes in the data set.
  281. ```{r results}
  282. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  283. head(results)
  284. ```
  285. * What is the estimated FDR for the most significant gene (the FDR is
  286. stored in the `adj.P.Val` column)?
  287. * How many genes are significantly differentially expressed at a
  288. threshold of 10% FDR?
  289. * Do you see any genes that make biological sense in the top few
  290. results?
  291. ```{r num_sig, include=FALSE}
  292. results$adj.P.Val[1]
  293. table(results$adj.P.Val <= 0.1)
  294. ```
  295. Now let's inspect the p-value histogram. For reference, we'll add in a
  296. horizontal line indicating what a uniform distribution would look
  297. like.
  298. ```{r pval_hist, fig.cap="P-value histogram for the first analysis"}
  299. ggplot(results) +
  300. aes(x=P.Value) +
  301. geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
  302. geom_hline(yintercept=1) +
  303. ggtitle("P-value distribution for Control vs Treatment")
  304. ```
  305. * What can we conclude from this histogram?
  306. Lastly, we can generate an MA plot showing the log2 fold change vs the
  307. log2 CPM for each gene:
  308. ```{r maplot}
  309. ggplot(arrange(results, desc(P.Value))) +
  310. aes(x=AveExpr, y=logFC,
  311. color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
  312. geom_point(size=0.1) +
  313. scale_color_hue(name="Significance") +
  314. theme(legend.justification=c(1,1), legend.position=c(1,1)) +
  315. ggtitle("MA Plot, IFNg vs Control")
  316. ```
  317. Another common plot is the volcano plot, which plots significance vs
  318. log fold change:
  319. ```{r volcanoplot}
  320. ggplot(arrange(results, desc(P.Value))) +
  321. aes(x=logFC, y=-log10(P.Value),
  322. color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
  323. geom_point(size=0.1) +
  324. scale_color_hue(name="Significance") +
  325. theme(legend.justification=c(1,1), legend.position=c(1,1)) +
  326. ggtitle("Volcano Plot, IFNg vs Control")
  327. ```
  328. * Based on these plots, are the changes balanced between up and down,
  329. or is there a bias toward a certain direction of change? Does this
  330. make sense for the biology of interferon treatment?
  331. # Improving the analysis by exploring the data
  332. We got pretty good results above, but how do we know that we analyzed
  333. the data correctly? Are there any important covariates that we should
  334. have included in our model? How can we figure out which covariates are
  335. important? One way is to do a PCA plot. Limma actually provides
  336. something called an MDS or PCoA plot, which is slightly different from
  337. a PCA plot but serves the same purpose.
  338. ```{r mds_init, fig.cap="Basic MDS plot from limma"}
  339. mds <- data.frame(plotMDS(v)[c("x", "y")])
  340. ```
  341. Limma's `plotMDS` function creates an MDS plot using the sample
  342. labels, but this results in a very crowded plot. Luckily, it also
  343. returns the x and y coordinates of the plot, which we can use to make
  344. our own. Because we want to see how each covariate relates to the
  345. principal coordinates, let's make several versions of our MDS plot,
  346. colored by each covariate.
  347. ```{r mds, results="hide"}
  348. mds <- cbind(mds, sample_table)
  349. p <- ggplot(mds) +
  350. aes(x=x, y=y) +
  351. xlab("PC1") + ylab("PC2") +
  352. geom_point(size=3) +
  353. coord_fixed(ratio=1) +
  354. ggtitle("Sample MDS Plot")
  355. for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
  356. print(p + aes_string(color=i) +
  357. ggtitle(paste("Sample MDS Plot Colored by", i)))
  358. }
  359. ```
  360. * Based on the MDS plots, which variables seem to be important, and
  361. which do not?
  362. ## Investigating an outlier
  363. You might have noticed that a few of the IFNg-treated samples cluster
  364. with the Control samples. These look like possible outlier samples,
  365. and we should investigate them, since they could interfere with our
  366. model fit. Let's try coloring by Animal ID and using a different shape
  367. for the IFNg samples.
  368. ```{r id_outlier, fig.cap="MDS plot for identifying the outlier animal"}
  369. p + aes(color=Animal.ID, shape=Treatment)
  370. ```
  371. * Can you identify the misbehaving Animal?
  372. Let's remove the samples for this animal from the data set and repeat
  373. the whole limma analysis.
  374. ```{r filter_outlier_animal}
  375. bad_animal <- "CN8351"
  376. selected_samples <- !(sample_table$Animal.ID %in% bad_animal)
  377. good_sample_table <- droplevels(sample_table[selected_samples,])
  378. good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
  379. good_normalized_total_counts <- normalized_total_counts[selected_samples]
  380. design <- model.matrix(~ Treatment, data=good_sample_table)
  381. v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
  382. v$genes <- filtered_gene_info
  383. fit <- lmFit(v, design)
  384. fit <- eBayes(fit, robust=TRUE)
  385. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  386. table(results$adj.P.Val <= 0.1)
  387. ```
  388. * How did removing the outlier animal affect the number of
  389. differentially expressed genes?
  390. Hopefully this demonstrates the importance of properly exploring your
  391. data before deciding on a model. You can always fit any model to the
  392. data, but the model will give you some kind of answer, sometimes even
  393. a plausible-looking one, whether or not that model is a good fit for
  394. the data. By eliminating an outlier, we significantly increased our
  395. ability to detect differential expression. Similarly, the focus of the
  396. homework will be to explore the consequences of adding additional
  397. covariates into the model formula.
  398. # Homework: Basic Model Selection
  399. Try fitting models with Animal.ID, Passage, or both in addition to
  400. Treatment. In other words, try all four of the following model
  401. formulae:
  402. ```{r model_options, eval=FALSE}
  403. ~ Treatment # (This is the model we just fit in class)
  404. ~ Animal.ID + Treatment
  405. ~ Passage + Treatment
  406. ~ Animal.ID + Passage + Treatment
  407. ```
  408. Use the filtered dataset with the outlier animal samples removed. In
  409. addition to testing Treatment for differential expression in each
  410. model, also test the other covariates for differential expression by
  411. passing a different value for the `coef` argument to `topTable`. You
  412. will need to pass a vector of all the columns of the design matrix
  413. relevant to the covariate. Below, I have included an example of how to
  414. fit the model and test for differential expression for the formula
  415. `~Passage + Treatment`. (Hint: use `colnames(design)` to help you
  416. figure out which coefficients to test.)
  417. ```{r homework_example, eval=FALSE}
  418. design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
  419. v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
  420. v$genes <- filtered_gene_info
  421. fit <- lmFit(v, design)
  422. fit <- eBayes(fit, robust=TRUE)
  423. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  424. table(results$adj.P.Val <= 0.1)
  425. passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
  426. table(passage.results$adj.P.Val <= 0.1)
  427. ```
  428. Based on your results for all four models, decide which model best
  429. fits the data. Justify your choice with MDS plots and p-value
  430. distribution plots that show why you are including or excluding each
  431. of Animal.ID and Passage from your model. How do your results for
  432. Treatment change when you include your chosen covariates? Does your
  433. model give more differentially expressed genes than the one we fit in
  434. class? If so, how many more? How many genes are differentially
  435. expressed with respect to your chosen covariates? Do these results
  436. still hold when using a different signifcance threshold, such as 5% or
  437. 1% FDR instead of 10%?
  438. For help fitting models with multiple covariates, see the
  439. [Limma User's Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
  440. section 9.4, "Additive Models and Blocking".
  441. # Homework shortcut: `selectModel()`
  442. ```{r homework_shortcut}
  443. designList <- list(
  444. TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
  445. TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
  446. TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
  447. TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
  448. v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
  449. sm <- selectModel(v, designlist = designList, criterion = "aic")
  450. table(sm$pref)
  451. ```