RNA-Seq.Rmd 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554
  1. ---
  2. title: "RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon gamma"
  3. author: "Ryan C. Thompson"
  4. date: "May 8, 2018"
  5. output:
  6. html_document:
  7. fig_caption: true
  8. pdf_document:
  9. fig_caption: true
  10. ---
  11. ```{r setup, include=FALSE}
  12. knitr::opts_chunk$set(echo = TRUE, retina=2, cache=TRUE, autodep=TRUE)
  13. ```
  14. # Introduction
  15. Today we're going to look at a data set from my work. This data comes
  16. from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys
  17. (a closely related species to rhesus macaques). MSCs are known to have
  18. immune modulatory effects, and the main goal of the project is to test
  19. whether these stem cells, when activated with interferon gamma (IFNg),
  20. can help suppress the body's immune response against an organ
  21. transplant.
  22. ![Known effects of MSC which may benefit organ transplants](images/msc1-cropped.png)
  23. The purpose of this specific data set is simply to determine which
  24. genes' expression is affected when MSCs are treated with IFNg. As
  25. such, we have both untreated (Control) samples and IFNg-treated
  26. samples from 3 different passages of the cell cultures. So, we'll be
  27. using limma to test each gene for differential expression between
  28. Control and IFNg samples.
  29. # Preliminary setup
  30. First, let's install all the packages we'll need.
  31. ```{r install_pkgs, eval=FALSE}
  32. needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
  33. already_installed_packages <- rownames(installed.packages())
  34. need_to_install <- setdiff(needed_packages, already_installed_packages)
  35. if (length(need_to_install) > 0) {
  36. ## http://bioconductor.org/install/
  37. source("https://bioconductor.org/biocLite.R")
  38. biocLite(need_to_install)
  39. }
  40. ```
  41. Then we'll load those packages.
  42. ```{r load_pkgs, results="hide", message=FALSE}
  43. library(SummarizedExperiment)
  44. library(edgeR)
  45. library(limma)
  46. library(ggplot2)
  47. library(dplyr)
  48. ```
  49. # Initial data loading and exploration
  50. Let's load the data file. This uses the `readRDS` function, which
  51. reads a single R object from a file and returns it. We assign the
  52. result to a variable.
  53. ```{r loading_data}
  54. sexp <- readRDS(gzcon(url("https://www.dropbox.com/s/4lx6q9fgyz66rky/Cyno-RNASeq-SummarizedExperiment.RDS?dl=1")))
  55. print(sexp)
  56. ```
  57. This object is called a SummarizedExperiment, and its purpose is to
  58. hold the experimental data, sample metadata, gene metadata, and any
  59. other relevant information for the experiment all in one place. This
  60. makes it an excellent starting point for any analysis. In our case,
  61. the experimental data consists of a matrix of counts for each gene in
  62. each sample. This count matrix was generated by aligning all the RNA
  63. sequence reads in each sample to the
  64. [cyno genome](http://www.ncbi.nlm.nih.gov/genome/776) and then
  65. counting the number of uniquely-mapped reads that overlap each gene. I
  66. have done the aligning and counting process for you and provided the
  67. count matrix, since the alignment and counting would take many hours
  68. to commplete.
  69. Look at the output of the print statement above. The RNA-Seq read
  70. counts are contained in the `assays` slot. The `colData` slot contains
  71. information on the samples, while the `rowRanges` slot contains
  72. information on the genes. The rows have "ranges" instead of just
  73. "data" because in addition to things like the gene symbol and
  74. description, the `rowRanges` slot also contains the genomic
  75. coordinates of each gene, that is, the chromosome, start/end positions
  76. of each exon, and the strand. This is the same information that was
  77. used to count the overlapping reads for each gene, but we won't be
  78. needing it today.
  79. You can read more about SummarizedExperiment objects
  80. [here](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).
  81. ![Structure of a SummarizedExperiment object](images/SEXP.svg)
  82. * How many samples does this SummarizedExperiment object contain? How
  83. many genes does it have? Where did you find this information?
  84. * What happens when you try to select a subset of rows or columns?
  85. ## Extracting the data and metadata
  86. Ok, let's pull out the information that we want from this
  87. SummarizedExperiment.
  88. First, the sample information:
  89. ```{r extract_sampletable}
  90. sample_table <- as.data.frame(colData(sexp))
  91. ```
  92. Next, the gene information:
  93. ```{r extract_geneinfo}
  94. ## Gene metadata
  95. gene_info <- as.data.frame(mcols(sexp))
  96. ```
  97. And finally, the count matrix:
  98. ```{r extract_counts}
  99. count_matrix <- assay(sexp)
  100. ```
  101. Examine the `sample_table`, `gene_info`, and `count_matrix` objects.
  102. Try `dim`, `class`, `nrow`, `ncol`, and similar. Try viewing the first
  103. few rows or columns of each one.
  104. * What kind of object is each one?
  105. * What kind of data do they contain?
  106. * Which dimensions do they have in common? How do these dimensions
  107. match up to the dimension of the SummarizedExperiment?
  108. ```{r example_sexp_components, include=FALSE}
  109. dim(sexp)
  110. dim(sample_table)
  111. dim(gene_info)
  112. dim(count_matrix)
  113. head(sample_table)
  114. head(gene_info)
  115. ## Find the IFNG gene
  116. gene_info[which(gene_info$symbol == "IFNG"),]
  117. count_matrix[1:10,1:6]
  118. ```
  119. ## Learning about the experimental design from the sample table
  120. Before we get to the analysis, we need to have a closer look at the
  121. sample table, to gain a better understanding of the experimental
  122. design.
  123. ```{r head_sample_table}
  124. names(sample_table)
  125. head(sample_table)
  126. ```
  127. * How many Labs/Runs/Animals/Passages/Treatments are there in this
  128. experiment?
  129. * How many samples are in each one?
  130. (You may find the `unique` and `table` functions useful for answering
  131. these questions.)
  132. ```{r tabulate_samples, include=FALSE}
  133. lapply(sample_table[-1], unique)
  134. lapply(sample_table[-1], table)
  135. ```
  136. * How many Animals are there in each Lab?
  137. * Are there Animals with samples from both Labs?
  138. ```{r tabulate_animals_in_labs}
  139. table(sample_table$Animal.ID, sample_table$Lab)
  140. ```
  141. * How many sequencing Runs are there in each Lab?
  142. * How many Animals are there in each sequencing Run?
  143. * Are there Animals with samples from multiple sequencing Runs?
  144. * Is it better to put all of each animal's samples in the same run, or
  145. is it better to distribute each animal's samples across multiple
  146. runs?
  147. ```{r tabulate_more_things, include=FALSE}
  148. table(sample_table$Animal.ID, sample_table$Run)
  149. table(sample_table$Run, sample_table$Lab)
  150. ```
  151. # Performing a basic limma analysis #
  152. ## Filtering non-expressed genes ##
  153. Our count matrix contains information on every known gene in the cyno
  154. genome, but we only want to look at the genes that are expressed in
  155. MSCs. Many genes will not be expressed at all and will have zero or
  156. very few counts in all samples, and the statistical method breaks down
  157. when it is fed all zeros. So we need to decide on a threshold of gene
  158. detection and filter out all the genes whose average expression does
  159. not reach this threshold. Let's see how a threshold of logCPM = 1
  160. looks on a histogram and a QQ plot against normal distribution
  161. quantiles.
  162. ```{r plot_cpm_threshold, fig.cap="logCPM histogram with threshold line"}
  163. mean_log_cpm <- aveLogCPM(count_matrix)
  164. filter_threshold <- 1
  165. ggplot() + aes(x=mean_log_cpm) +
  166. geom_histogram(binwidth=0.2) +
  167. geom_vline(xintercept=filter_threshold) +
  168. ggtitle("Histogram of mean expression values")
  169. ```
  170. ```{r plot_cpm_qq, fig.cap="logCPM QQ plot with threshold line"}
  171. qqnorm(mean_log_cpm); abline(h=filter_threshold)
  172. ```
  173. * What do the shapes of these two plots indicate?
  174. * How can we justify our choice of threshold using these plots?
  175. * Would this same detection threshold be appropriate for other
  176. experiments?
  177. * What biological or experimental factors might influence the choice
  178. of threshold?
  179. Having chosen our threshold, let's pick the subset of genes whose
  180. average expression passes that threshold.
  181. ```{r filter_genes}
  182. keep_genes <- mean_log_cpm >= 1
  183. filtered_count_matrix <- count_matrix[keep_genes,]
  184. filtered_gene_info <- gene_info[keep_genes,]
  185. nrow(filtered_count_matrix)
  186. ```
  187. * What fraction of the genes in the genome are considered expressed
  188. according to our threshold?
  189. * Is this number sensitive to our choice of threshold? How much of a
  190. difference does it make if we use a threshold of 0.5 or 1.5 instead?
  191. ```{r threshold_sensitivity, include=FALSE}
  192. mean(keep_genes)
  193. mean(mean_log_cpm >= 1.5)
  194. mean(mean_log_cpm >= 0.5)
  195. ```
  196. ## Normalization ##
  197. Ok, now that we have some understanding of the experimental design,
  198. let's find some differentially expressed genes! We'll start by
  199. computing the total counts in each sample and then normalizing these
  200. total counts using the Trimmed Mean of M-values (TMM) method, provided
  201. by the `calcNormFactors` function.
  202. ```{r calcNormFactors}
  203. total_counts <- colSums(count_matrix)
  204. nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
  205. print(nf)
  206. normalized_total_counts <- total_counts * nf
  207. ```
  208. * What is the range of total_counts/normalization factors? (Try the
  209. `summary` function)
  210. * What is the average total count per sample? Does this seem high or
  211. low? How might we expect this to affect our results?
  212. ```{r inspect_nf, include=FALSE}
  213. summary(total_counts)
  214. summary(nf)
  215. ```
  216. ## Fitting the linear models ##
  217. Now that we have computed our normalization and filtered out
  218. non-expressed genes, it's time to fit our linear models. The basic
  219. model fitting function provided by limma is `lmFit`, which is
  220. essentially a shortcut for running `lm` once for each gene, using the
  221. same model formula each time.
  222. First, we will construct our design matrix by selecting an appropriate
  223. model formula. This step is performed automatically when you run `lm`,
  224. but for `lmFit` we must do it manually. For our first model, we'll
  225. keep it simple and use Treatment as the only covariate.
  226. ```{r design}
  227. design <- model.matrix(~ Treatment, data=sample_table)
  228. head(design)
  229. ```
  230. Note that the design has two columns, representing the two
  231. coefficients in our model. The first one, named `(Intercept)`,
  232. represents the expression level of the Control samples, while the
  233. second coefficient represents the difference between Control and IFNg
  234. treatments. This is the coefficient that we will test for differential
  235. expression once we have fit our model. This model will be equivalent
  236. to a simple two-sample t-test.
  237. Before we fit our model, we have to run `voom` to compensate for the
  238. heteroskedasticity of the counts. (While we're at it, we also insert
  239. the gene metadata into the resulting object. This will allow limma to
  240. include the gene metadata its result tables automatically.)
  241. ```{r voom, fig.cap="Voom diagnostic plot"}
  242. v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
  243. v$genes <- filtered_gene_info
  244. ```
  245. Running `voom` with `plot=TRUE` produces a diagnostic plot showing the
  246. empirical relationship between log count and variance. The fitted
  247. curve is a running average of the cloud of points, and this curve is
  248. what voom uses to assign a weight to each observed count.
  249. * According to the plot, which count would have the highest weight
  250. (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the
  251. log2 scale)
  252. * Which would have the lowest weight?
  253. * Is the relationship between log count and variance monotonic? Why
  254. might it not be monotonic, if higher counts are supposed to be more
  255. precise?
  256. * Why does `voom` need to know our experimental design?
  257. The object returned by `voom` is an `EList`, which is a complex object
  258. similar in spirit to a SummarizedExperiment. You don't need to know
  259. anything about its internals. Just know that it contains both the
  260. matrix of log2(CPM) values and the corresponding matrix of weights. It
  261. also contains the gene info, which we added manually. Together with
  262. the design matrix, this is everything we need to fit our linear
  263. models.
  264. ```{r lmfit}
  265. fit <- lmFit(v, design)
  266. fit <- eBayes(fit, robust=TRUE)
  267. ```
  268. Recall that the `eBayes` function is responsible for squeezing each
  269. gene's sample variance toward the overall mean variance of the whole
  270. data set, in the process trading a bit of bias for stability.
  271. ## Getting our results ##
  272. Anyway, now we can perform a "moderated t-test". "Moderated" means
  273. that we are substituting the empirical Bayes squeezed variance for the
  274. sample variance in the formula for the t statistic (and also adjusting
  275. the degrees of freedom term accordingly). To get our results, we call
  276. `topTable`, telling it which coefficient we wish to test, along with
  277. which multiple testing correction to use on the p-values. The `n=Inf`
  278. tells it to give us the results for all the genes in the data set.
  279. ```{r results}
  280. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  281. head(results)
  282. ```
  283. * What is the estimated FDR for the most significant gene (the FDR is
  284. stored in the `adj.P.Val` column)?
  285. * How many genes are significantly differentially expressed at a
  286. threshold of 10% FDR?
  287. * Do you see any genes that make biological sense in the top few
  288. results?
  289. ```{r num_sig, include=FALSE}
  290. results$adj.P.Val[1]
  291. table(results$adj.P.Val <= 0.1)
  292. ```
  293. Now let's inspect the p-value histogram. For reference, we'll add in a
  294. horizontal line indicating what a uniform distribution would look
  295. like.
  296. ```{r pval_hist, fig.cap="P-value histogram for the first analysis"}
  297. ggplot(results) +
  298. aes(x=P.Value) +
  299. geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
  300. geom_hline(yintercept=1) +
  301. ggtitle("P-value distribution for Control vs Treatment")
  302. ```
  303. * What can we conclude from this histogram?
  304. Lastly, we can generate an MA plot showing the log2 fold change vs the
  305. log2 CPM for each gene:
  306. ```{r maplot}
  307. ggplot(arrange(results, desc(P.Value))) +
  308. aes(x=AveExpr, y=logFC,
  309. color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
  310. geom_point(size=0.1) +
  311. scale_color_hue(name="Significance") +
  312. theme(legend.justification=c(1,1), legend.position=c(1,1)) +
  313. ggtitle("MA Plot, IFNg vs Control")
  314. ```
  315. Another common plot is the volcano plot, which plots significance vs
  316. log fold change:
  317. ```{r volcanoplot}
  318. ggplot(arrange(results, desc(P.Value))) +
  319. aes(x=logFC, y=-log10(P.Value),
  320. color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
  321. geom_point(size=0.1) +
  322. scale_color_hue(name="Significance") +
  323. theme(legend.justification=c(1,1), legend.position=c(1,1)) +
  324. ggtitle("Volcano Plot, IFNg vs Control")
  325. ```
  326. * Based on these plots, are the changes balanced between up and down,
  327. or is there a bias toward a certain direction of change? Does this
  328. make sense for the biology of interferon treatment?
  329. # Improving the analysis by exploring the data
  330. We got pretty good results above, but how do we know that we analyzed
  331. the data correctly? Are there any important covariates that we should
  332. have included in our model? How can we figure out which covariates are
  333. important? One way is to do a PCA plot. Limma actually provides
  334. something called an MDS or PCoA plot, which is slightly different from
  335. a PCA plot but serves the same purpose.
  336. ```{r mds_init, fig.cap="Basic MDS plot from limma"}
  337. mds <- data.frame(plotMDS(v)[c("x", "y")])
  338. ```
  339. Limma's `plotMDS` function creates an MDS plot using the sample
  340. labels, but this results in a very crowded plot. Luckily, it also
  341. returns the x and y coordinates of the plot, which we can use to make
  342. our own. Because we want to see how each covariate relates to the
  343. principal coordinates, let's make several versions of our MDS plot,
  344. colored by each covariate.
  345. ```{r mds, results="hide"}
  346. mds <- cbind(mds, sample_table)
  347. p <- ggplot(mds) +
  348. aes(x=x, y=y) +
  349. xlab("PC1") + ylab("PC2") +
  350. geom_point(size=3) +
  351. coord_fixed(ratio=1) +
  352. ggtitle("Sample MDS Plot")
  353. for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
  354. print(p + aes_string(color=i) +
  355. ggtitle(paste("Sample MDS Plot Colored by", i)))
  356. }
  357. ```
  358. * Based on the MDS plots, which variables seem to be important, and
  359. which do not?
  360. ## Investigating an outlier
  361. You might have noticed that a few of the IFNg-treated samples cluster
  362. with the Control samples. These look like possible outlier samples,
  363. and we should investigate them, since they could interfere with our
  364. model fit. Let's try coloring by Animal ID and using a different shape
  365. for the IFNg samples.
  366. ```{r id_outlier, fig.cap="MDS plot for identifying the outlier animal"}
  367. p + aes(color=Animal.ID, shape=Treatment)
  368. ```
  369. * Can you identify the misbehaving Animal?
  370. Let's remove the samples for this animal from the data set and repeat
  371. the whole limma analysis.
  372. ```{r filter_outlier_animal}
  373. bad_animal <- "CN8351"
  374. selected_samples <- !(sample_table$Animal.ID %in% bad_animal)
  375. good_sample_table <- droplevels(sample_table[selected_samples,])
  376. good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
  377. good_normalized_total_counts <- normalized_total_counts[selected_samples]
  378. design <- model.matrix(~ Treatment, data=good_sample_table)
  379. v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
  380. v$genes <- filtered_gene_info
  381. fit <- lmFit(v, design)
  382. fit <- eBayes(fit, robust=TRUE)
  383. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  384. table(results$adj.P.Val <= 0.1)
  385. ```
  386. * How did removing the outlier animal affect the number of
  387. differentially expressed genes?
  388. Hopefully this demonstrates the importance of properly exploring your
  389. data before deciding on a model. You can always fit any model to the
  390. data, but the model will give you some kind of answer, sometimes even
  391. a plausible-looking one, whether or not that model is a good fit for
  392. the data. By eliminating an outlier, we significantly increased our
  393. ability to detect differential expression. Similarly, the focus of the
  394. homework will be to explore the consequences of adding additional
  395. covariates into the model formula.
  396. # Homework: Basic Model Selection
  397. Try fitting models with Animal.ID, Passage, or both in addition to
  398. Treatment. In other words, try all four of the following model
  399. formulae:
  400. ```{r model_options, eval=FALSE}
  401. ~ Treatment # (This is the model we just fit in class)
  402. ~ Animal.ID + Treatment
  403. ~ Passage + Treatment
  404. ~ Animal.ID + Passage + Treatment
  405. ```
  406. Use the filtered dataset with the outlier animal samples removed. In
  407. addition to testing Treatment for differential expression in each
  408. model, also test the other covariates for differential expression by
  409. passing a different value for the `coef` argument to `topTable`. You
  410. will need to pass a vector of all the columns of the design matrix
  411. relevant to the covariate. Below, I have included an example of how to
  412. fit the model and test for differential expression for the formula
  413. `~Passage + Treatment`. (Hint: use `colnames(design)` to help you
  414. figure out which coefficients to test.)
  415. ```{r homework_example, eval=FALSE}
  416. design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
  417. v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
  418. v$genes <- filtered_gene_info
  419. fit <- lmFit(v, design)
  420. fit <- eBayes(fit, robust=TRUE)
  421. results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
  422. table(results$adj.P.Val <= 0.1)
  423. passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
  424. table(passage.results$adj.P.Val <= 0.1)
  425. ```
  426. Based on your results for all four models, decide which model best
  427. fits the data. Justify your choice with MDS plots and p-value
  428. distribution plots that show why you are including or excluding each
  429. of Animal.ID and Passage from your model. How do your results for
  430. Treatment change when you include your chosen covariates? Does your
  431. model give more differentially expressed genes than the one we fit in
  432. class? If so, how many more? How many genes are differentially
  433. expressed with respect to your chosen covariates? Do these results
  434. still hold when using a different signifcance threshold, such as 5% or
  435. 1% FDR instead of 10%?
  436. For help fitting models with multiple covariates, see the
  437. [Limma User's Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
  438. section 9.4, "Additive Models and Blocking".
  439. # Homework shortcut: `selectModel()`
  440. ```{r homework_shortcut}
  441. designList <- list(
  442. TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
  443. TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
  444. TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
  445. TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
  446. v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
  447. sm <- selectModel(v, designlist = designList, criterion = "aic")
  448. table(sm$pref)
  449. ```