Today we’re going to look at a data set from my work. This data comes from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys (a closely related species to rhesus macaques). MSCs are known to have immune modulatory effects, and the main goal of the project is to test whether these stem cells, when activated with interferon gamma (IFNg), can help suppress the body’s immune response against an organ transplant.
Known effects of MSC which may benefit organ transplants
The purpose of this specific data set is simply to determine which genes’ expression is affected when MSCs are treated with IFNg. As such, we have both untreated (Control) samples and IFNg-treated samples from 3 different passages of the cell cultures. So, we’ll be using limma to test each gene for differential expression between Control and IFNg samples.
First, let’s install all the packages we’ll need.
needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr")
already_installed_packages <- rownames(installed.packages())
need_to_install <- setdiff(needed_packages, already_installed_packages)
if (length(need_to_install) > 0) {
## http://bioconductor.org/install/
source("https://bioconductor.org/biocLite.R")
biocLite(need_to_install)
}
Then we’ll load those packages.
library(SummarizedExperiment)
library(edgeR)
library(limma)
library(ggplot2)
library(dplyr)
Let’s load the data file. This uses the readRDS
function, which reads a single R object from a file and returns it. We assign the result to a variable.
sexp <- readRDS(gzcon(url("http://salomon24.scripps.edu/~ryan/Cyno-RNASeq-SummarizedExperiment.RDS")))
print(sexp)
## class: RangedSummarizedExperiment
## dim: 20416 54
## metadata(0):
## assays(1): counts
## rownames(20416): ENSP00000346839 ENSP00000260356 ...
## ENSP00000420502 ENSP00000420636
## rowRanges metadata column names(8): symbol ensembl_gene_id ...
## entrezgene source_species
## colnames(54): LabA.R104.6C4_65M.P4.Control
## LabA.R104.6C4_65M.P4.IFNg ... LabB.R95.CN8351.P6.Control
## LabB.R95.CN8351.P6.IFNg
## colData names(6): Sample.ID Animal.ID ... Lab Treatment
This object is called a SummarizedExperiment, and its purpose is to hold the experimental data, sample metadata, gene metadata, and any other relevant information for the experiment all in one place. This makes it an excellent starting point for any analysis. In our case, the experimental data consists of a matrix of counts for each gene in each sample. This count matrix was generated by aligning all the RNA sequence reads in each sample to the cyno genome and then counting the number of uniquely-mapped reads that overlap each gene. I have done the aligning and counting process for you and provided the count matrix, since the alignment and counting would take many hours to commplete.
Look at the output of the print statement above. The RNA-Seq read counts are contained in the assays
slot. The colData
slot contains information on the samples, while the rowRanges
slot contains information on the genes. The rows have “ranges” instead of just “data” because in addition to things like the gene symbol and description, the rowRanges
slot also contains the genomic coordinates of each gene, that is, the chromosome, start/end positions of each exon, and the strand. This is the same information that was used to count the overlapping reads for each gene, but we won’t be needing it today.
You can read more about SummarizedExperiment objects here.
Structure of a SummarizedExperiment object
Ok, let’s pull out the information that we want from this SummarizedExperiment.
First, the sample information:
sample_table <- as.data.frame(colData(sexp))
Next, the gene information:
## Gene metadata
gene_info <- as.data.frame(mcols(sexp))
And finally, the count matrix:
count_matrix <- assay(sexp)
Examine the sample_table
, gene_info
, and count_matrix
objects. Try dim
, class
, nrow
, ncol
, and similar. Try viewing the first few rows or columns of each one.
Before we get to the analysis, we need to have a closer look at the sample table, to gain a better understanding of the experimental design.
names(sample_table)
## [1] "Sample.ID" "Animal.ID" "Passage" "Run" "Lab" "Treatment"
head(sample_table)
## Sample.ID Animal.ID
## LabA.R104.6C4_65M.P4.Control LabA.R104.6C4_65M.P4.Control 6C4_65M
## LabA.R104.6C4_65M.P4.IFNg LabA.R104.6C4_65M.P4.IFNg 6C4_65M
## LabA.R104.6C4_65M.P5.Control LabA.R104.6C4_65M.P5.Control 6C4_65M
## LabA.R104.6C4_65M.P5.IFNg LabA.R104.6C4_65M.P5.IFNg 6C4_65M
## LabA.R104.6C4_65M.P6.Control LabA.R104.6C4_65M.P6.Control 6C4_65M
## LabA.R104.6C4_65M.P6.IFNg LabA.R104.6C4_65M.P6.IFNg 6C4_65M
## Passage Run Lab Treatment
## LabA.R104.6C4_65M.P4.Control P4 R104 A Control
## LabA.R104.6C4_65M.P4.IFNg P4 R104 A IFNg
## LabA.R104.6C4_65M.P5.Control P5 R104 A Control
## LabA.R104.6C4_65M.P5.IFNg P5 R104 A IFNg
## LabA.R104.6C4_65M.P6.Control P6 R104 A Control
## LabA.R104.6C4_65M.P6.IFNg P6 R104 A IFNg
(You may find the unique
and table
functions useful for answering these questions.)
table(sample_table$Animal.ID, sample_table$Lab)
##
## A B
## 6C4_65M 6 0
## 6C63 6 0
## 6C7 6 0
## 6C84 6 0
## 7C37 6 0
## 8C8 6 0
## CN7314 0 6
## CN7875 0 6
## CN8351 0 6
Ok, now that we have some understanding of the experimental design, let’s find some differentially expressed genes! We’ll start by computing the total counts in each sample and then normalizing these total counts using the Trimmed Mean of M-values (TMM) method, provided by the calcNormFactors
function.
total_counts <- colSums(count_matrix)
nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
print(nf)
## [1] 0.8800745 0.9898919 0.8203199 0.9952668 0.8838364 1.0259290 1.1900295
## [8] 1.3016013 1.1399379 1.1169711 1.0104253 1.2073377 1.0517275 1.0998364
## [15] 1.0280885 1.0526623 0.8782156 1.0749254 0.6720892 0.9333467 0.6241975
## [22] 0.8950369 0.6706472 0.9438313 1.0769607 1.1242476 1.0746693 1.0230317
## [29] 1.1620610 1.0660020 0.8419412 1.0470961 0.7138150 1.0934070 0.8403635
## [36] 1.1033499 0.9509877 1.0742606 0.9817360 1.0921636 1.2666923 1.0758749
## [43] 0.9013256 1.0833068 1.0313982 1.1667053 1.1918567 1.3519535 0.9827153
## [50] 0.9560459 1.0029934 0.9652303 0.9766554 0.9307331
normalized_total_counts <- total_counts * nf
summary
function)Our count matrix contains information on every known gene in the cyno genome, but we only want to look at the genes that are expressed in MSCs. Many genes will not be expressed at all and will have zero or very few counts in all samples, and the statistical method breaks down when it is fed all zeros. So we need to decide on a threshold of gene detection and filter out all the genes whose average expression does not reach this threshold. Let’s see how a threshold of logCPM = 1 looks on a histogram and a QQ plot against normal distribution quantiles.
mean_log_cpm <- aveLogCPM(count_matrix, normalized_total_counts)
filter_threshold <- 1
ggplot() + aes(x=mean_log_cpm) +
geom_histogram(binwidth=0.2) +
geom_vline(xintercept=filter_threshold) +
ggtitle("Histogram of mean expression values")
logCPM histogram with threshold line
qqnorm(mean_log_cpm); abline(h=filter_threshold)
logCPM QQ plot with threshold line
Having chosen our threshold, let’s pick the subset of genes whose average expression passes that threshold.
keep_genes <- mean_log_cpm >= 1
filtered_count_matrix <- count_matrix[keep_genes,]
filtered_gene_info <- gene_info[keep_genes,]
nrow(filtered_count_matrix)
## [1] 11939
Now that we have computed our normalization and filtered out non-expressed genes, it’s time to fit out linear models. The basic model fitting function provided by limma is lmFit
, which is essentially a shortcut for running lm
once for each gene, using the same model formula each time.
First, we will construct our design matrix by selecting an appropriate model formula. This step is performed automatically when you run lm
, but for lmFit
we must do it manually. For our first model, we’ll keep it simple and use Treatment as the only covariate.
design <- model.matrix(~ Treatment, data=sample_table)
head(design)
## (Intercept) TreatmentIFNg
## LabA.R104.6C4_65M.P4.Control 1 0
## LabA.R104.6C4_65M.P4.IFNg 1 1
## LabA.R104.6C4_65M.P5.Control 1 0
## LabA.R104.6C4_65M.P5.IFNg 1 1
## LabA.R104.6C4_65M.P6.Control 1 0
## LabA.R104.6C4_65M.P6.IFNg 1 1
Note that the design has two columns, representing the two coefficients in our model. The first one, named (Intercept)
, represents the expression level of the Control samples, while the second coefficient represents the difference between Control and IFNg treatments. This is the coefficient that we will test for differential expression once we have fit our model. This model will be equivalent to a simple two-sample t-test.
Before we fit our model, we have to run voom
to compensate for the heteroskedasticity of the counts. (While we’re at it, we also insert the gene metadata into the resulting object. This will allow limma to include the gene metadata its result tables automatically.)
v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
Voom diagnostic plot
v$genes <- filtered_gene_info
Running voom
with plot=TRUE
produces a diagnostic plot showing the empirical relationship between log count and variance. The fitted curve is a running average of the cloud of points, and this curve is what voom uses to assign a weight to each observed count.
voom
need to know our experimental design?The object returned by voom
is an EList
, which is a complex object similar in spirit to a SummarizedExperiment. You don’t need to know anything about its internals. Just know that it contains both the matrix of log2(CPM) values and the corresponding matrix of weights. It also contains the gene info, which we added manually. Together with the design matrix, this is everything we need to fit our linear models.
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
Recall that the eBayes
function is responsible for squeezing each gene’s sample variance toward the overall mean variance of the whole data set, in the process trading a bit of bias for stability.
Anyway, now we can perform a “moderated t-test”. “Moderated” means that we are substituting the empirical Bayes squeezed variance for the sample variance in the formula for the t statistic (and also adjusting the degrees of freedom term accordingly). To get our results, we call topTable
, telling it which coefficient we wish to test, along with which multiple testing correction to use on the p-values. The n=Inf
tells it to give us the results for all the genes in the data set.
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
head(results)
## symbol ensembl_gene_id ensembl_transcript_id
## ENSP00000396478 RNF213 ENSG00000173821 ENST00000411702
## ENSP00000372722 TAP1 ENSG00000206297 ENST00000383235
## ENSMMUP00000032577 WARS ENSMMUG00000013865 ENSMMUT00000039489
## ENSMMUP00000036861 TRIM21 ENSMMUG00000004974 ENSMMUT00000043874
## ENSP00000359504 GBP1 ENSG00000117228 ENST00000370473
## ENSP00000246911 IFI35 ENSG00000068079 ENST00000246911
## ensembl_peptide_id
## ENSP00000396478 ENSP00000396478
## ENSP00000372722 ENSP00000372722
## ENSMMUP00000032577 ENSMMUP00000032577
## ENSMMUP00000036861 ENSMMUP00000036861
## ENSP00000359504 ENSP00000359504
## ENSP00000246911 ENSP00000246911
## description
## ENSP00000396478 ring finger protein 213 [Source:HGNC Symbol;Acc:14539]
## ENSP00000372722 Antigen peptide transporter 1 (APT1)(Peptide transporter TAP1)(ATP-binding cassette sub-family B member 2)(Peptide supply factor 1)(Peptide transporter PSF1)(PSF-1)(Peptide transporter involved in antigen processing 1)(Really interesting new gene 4 /.../n) [Source:UniProtKB/Swiss-Prot;Acc:Q03518]
## ENSMMUP00000032577 Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6QVM3]
## ENSMMUP00000036861 E3 ubiquitin-protein ligase TRIM21; Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6ZU58]
## ENSP00000359504 guanylate binding protein 1, interferon-inducible, 67kDa [Source:HGNC Symbol;Acc:4182]
## ENSP00000246911 interferon-induced protein 35 [Source:HGNC Symbol;Acc:5399]
## uniprot_swissprot_accession entrezgene source_species
## ENSP00000396478 <NA> <NA> HSAP
## ENSP00000372722 Q03518 6890 HSAP
## ENSMMUP00000032577 <NA> 706855 MMUL
## ENSMMUP00000036861 <NA> 717826 MMUL
## ENSP00000359504 P32455 2633 HSAP
## ENSP00000246911 <NA> <NA> HSAP
## logFC AveExpr t P.Value adj.P.Val
## ENSP00000396478 2.331039 8.306125 13.84463 7.898112e-20 9.429556e-16
## ENSP00000372722 2.823017 8.034729 13.56115 1.952273e-19 1.165409e-15
## ENSMMUP00000032577 3.742958 9.372005 13.33520 4.236850e-19 1.686125e-15
## ENSMMUP00000036861 2.371993 5.042847 12.81394 2.227684e-18 5.295882e-15
## ENSP00000359504 5.202876 7.912861 12.83494 2.261993e-18 5.295882e-15
## ENSP00000246911 2.004546 6.449793 12.71623 3.079054e-18 5.295882e-15
## B
## ENSP00000396478 34.80672
## ENSP00000372722 33.92351
## ENSMMUP00000032577 33.13629
## ENSMMUP00000036861 31.50507
## ENSP00000359504 31.50434
## ENSP00000246911 31.23283
adj.P.Val
column)?Now let’s inspect the p-value histogram. For reference, we’ll add in a horizontal line indicating what a uniform distribution would look like.
ggplot(results) +
aes(x=P.Value) +
geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
geom_hline(yintercept=1) +
ggtitle("P-value distribution for Control vs Treatment")
P-value histogram for the first analysis
Lastly, we can generate an MA plot showing the log2 fold change vs the log2 CPM for each gene:
ggplot(arrange(results, desc(P.Value))) +
aes(x=AveExpr, y=logFC,
color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
geom_point(size=0.1) +
scale_color_hue(name="Significance") +
theme(legend.justification=c(1,1), legend.position=c(1,1)) +
ggtitle("MA Plot, IFNg vs Control")
Another common plot is the volcano plot, which plots significance vs log fold change:
ggplot(arrange(results, desc(P.Value))) +
aes(x=logFC, y=-log10(P.Value),
color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
geom_point(size=0.1) +
scale_color_hue(name="Significance") +
theme(legend.justification=c(1,1), legend.position=c(1,1)) +
ggtitle("Volcano Plot, IFNg vs Control")
We got pretty good results above, but how do we know that we analyzed the data correctly? Are there any important covariates that we should have included in our model? How can we figure out which covariates are important? One way is to do a PCA plot. Limma actually provides something called an MDS or PCoA plot, which is slightly different from a PCA plot but serves the same purpose.
mds <- data.frame(plotMDS(v)[c("x", "y")])
Basic MDS plot from limma
Limma’s plotMDS
function creates an MDS plot using the sample labels, but this results in a very crowded plot. Luckily, it also returns the x and y coordinates of the plot, which we can use to make our own. Because we want to see how each covariate relates to the principal coordinates, let’s make several versions of our MDS plot, colored by each covariate.
mds <- cbind(mds, sample_table)
p <- ggplot(mds) +
aes(x=x, y=y) +
xlab("PC1") + ylab("PC2") +
geom_point(size=3) +
coord_fixed(ratio=1) +
ggtitle("Sample MDS Plot")
for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
print(p + aes_string(color=i) +
ggtitle(paste("Sample MDS Plot Colored by", i)))
}
You might have noticed that a few of the IFNg-treated samples cluster with the Control samples. These look like possible outlier samples, and we should investigate them, since they could interfere with out model fit. Let’s try coloring by Animal ID and using a different shape for the IFNg samples.
p + aes(color=Animal.ID, shape=Treatment)
MDS plot for identifying the outlier animal
Let’s remove the samples for this animal from the data set and repeat the whole limma analysis.
bad_animal <- "CN8351"
selected_samples <- !(sample_table$Animal.ID %in% bad_animal)
good_sample_table <- droplevels(sample_table[selected_samples,])
good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
good_normalized_total_counts <- normalized_total_counts[selected_samples]
design <- model.matrix(~ Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
##
## FALSE TRUE
## 6817 5122
Hopefully this demonstrates the importance of properly exploring your data before deciding on a model. You can always fit any model to the data, but the model will give you some kind of answer, sometimes even a plausible-looking one, whether or not that model is a good fit for the data. By eliminating an outlier, we significantly increased our ability to detect differential expression. Similarly, the focus of the homework will be to explore the consequences of adding additional covariates into the model formula.
Try fitting models with Animal.ID, Passage, or both in addition to Treatment. In other words, try all four of the following model formulae:
~ Treatment # (This is the model we just fit in class)
~ Animal.ID + Treatment
~ Passage + Treatment
~ Animal.ID + Passage + Treatment
Use the filtered dataset with the outlier animal samples removed. In addition to testing Treatment for differential expression in each model, also test the other covariates for differential expression by passing a different value for the coef
argument to topTable
. You will need to pass a vector of all the columns of the design matrix relevant to the covariate. Below, I have included an example of how to fit the model and test for differential expression for the formula ~Passage + Treatment
. (Hint: use colnames(design)
to help you figure out which coefficients to test.)
design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
table(passage.results$adj.P.Val <= 0.1)
Based on your results for all four models, decide which model best fits the data. Justify your choice with MDS plots and p-value distribution plots that show why you are including or excluding each of Animal.ID and Passage from your model. How do your results for Treatment change when you include your chosen covariates? Does your model give more differentially expressed genes than the one we fit in class? If so, how many more? How many genes are differentially expressed with respect to your chosen covariates? Do these results still hold when using a different signifcance threshold, such as 5% or 1% FDR instead of 10%?
For help fitting models with multiple covariates, see the Limma User’s Guide section 9.4, “Additive Models and Blocking”.