Introduction

Today we’re going to look at a data set from my work. This data comes from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys (a closely related species to rhesus macaques). MSCs are known to have immune modulatory effects, and the main goal of the project is to test whether these stem cells, when activated with interferon gamma (IFNg), can help suppress the body’s immune response against an organ transplant.

Known effects of MSC which may benefit organ transplants

The purpose of this specific data set is simply to determine which genes’ expression is affected when MSCs are treated with IFNg. As such, we have both untreated (Control) samples and IFNg-treated samples from 3 different passages of the cell cultures. So, we’ll be using limma to test each gene for differential expression between Control and IFNg samples.

Preliminary setup

First, let’s install all the packages we’ll need.

needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr")
already_installed_packages <- rownames(installed.packages())
need_to_install <- setdiff(needed_packages, already_installed_packages)
if (length(need_to_install) > 0) {
    ## http://bioconductor.org/install/
    source("https://bioconductor.org/biocLite.R")
    biocLite(need_to_install)
}

Then we’ll load those packages.

library(SummarizedExperiment)
library(edgeR)
library(limma)
library(ggplot2)
library(dplyr)

Initial data loading and exploration

Let’s load the data file. This uses the readRDS function, which reads a single R object from a file and returns it. We assign the result to a variable.

sexp <- readRDS(gzcon(url("http://salomon24.scripps.edu/~ryan/Cyno-RNASeq-SummarizedExperiment.RDS")))
print(sexp)

## class: RangedSummarizedExperiment 
## dim: 20416 54 
## metadata(0):
## assays(1): counts
## rownames(20416): ENSP00000346839 ENSP00000260356 ...
##   ENSP00000420502 ENSP00000420636
## rowRanges metadata column names(8): symbol ensembl_gene_id ...
##   entrezgene source_species
## colnames(54): LabA.R104.6C4_65M.P4.Control
##   LabA.R104.6C4_65M.P4.IFNg ... LabB.R95.CN8351.P6.Control
##   LabB.R95.CN8351.P6.IFNg
## colData names(6): Sample.ID Animal.ID ... Lab Treatment

This object is called a SummarizedExperiment, and its purpose is to hold the experimental data, sample metadata, gene metadata, and any other relevant information for the experiment all in one place. This makes it an excellent starting point for any analysis. In our case, the experimental data consists of a matrix of counts for each gene in each sample. This count matrix was generated by aligning all the RNA sequence reads in each sample to the cyno genome and then counting the number of uniquely-mapped reads that overlap each gene. I have done the aligning and counting process for you and provided the count matrix, since the alignment and counting would take many hours to commplete.

Look at the output of the print statement above. The RNA-Seq read counts are contained in the assays slot. The colData slot contains information on the samples, while the rowRanges slot contains information on the genes. The rows have “ranges” instead of just “data” because in addition to things like the gene symbol and description, the rowRanges slot also contains the genomic coordinates of each gene, that is, the chromosome, start/end positions of each exon, and the strand. This is the same information that was used to count the overlapping reads for each gene, but we won’t be needing it today.

You can read more about SummarizedExperiment objects here.

Structure of a SummarizedExperiment object

How many samples does this SummarizedExperiment object contain? How many genes does it have? Where did you find this information?
What happens when you try to select a subset of rows or columns?

Extracting the data and metadata

Ok, let’s pull out the information that we want from this SummarizedExperiment.

First, the sample information:

sample_table <- as.data.frame(colData(sexp))

Next, the gene information:

## Gene metadata
gene_info <- as.data.frame(mcols(sexp))

And finally, the count matrix:

count_matrix <- assay(sexp)

Examine the sample_table, gene_info, and count_matrix objects. Try dim, class, nrow, ncol, and similar. Try viewing the first few rows or columns of each one.

What kind of object is each one?
What kind of data do they contain?
Which dimensions do they have in common? How do these dimensions match up to the dimension of the SummarizedExperiment?

Learning about the experimental design from the sample table

Before we get to the analysis, we need to have a closer look at the sample table, to gain a better understanding of the experimental design.

names(sample_table)

## [1] "Sample.ID" "Animal.ID" "Passage"   "Run"       "Lab"       "Treatment"

head(sample_table)

##                                                 Sample.ID Animal.ID
## LabA.R104.6C4_65M.P4.Control LabA.R104.6C4_65M.P4.Control   6C4_65M
## LabA.R104.6C4_65M.P4.IFNg       LabA.R104.6C4_65M.P4.IFNg   6C4_65M
## LabA.R104.6C4_65M.P5.Control LabA.R104.6C4_65M.P5.Control   6C4_65M
## LabA.R104.6C4_65M.P5.IFNg       LabA.R104.6C4_65M.P5.IFNg   6C4_65M
## LabA.R104.6C4_65M.P6.Control LabA.R104.6C4_65M.P6.Control   6C4_65M
## LabA.R104.6C4_65M.P6.IFNg       LabA.R104.6C4_65M.P6.IFNg   6C4_65M
##                              Passage  Run Lab Treatment
## LabA.R104.6C4_65M.P4.Control      P4 R104   A   Control
## LabA.R104.6C4_65M.P4.IFNg         P4 R104   A      IFNg
## LabA.R104.6C4_65M.P5.Control      P5 R104   A   Control
## LabA.R104.6C4_65M.P5.IFNg         P5 R104   A      IFNg
## LabA.R104.6C4_65M.P6.Control      P6 R104   A   Control
## LabA.R104.6C4_65M.P6.IFNg         P6 R104   A      IFNg

How many Labs/Runs/Animals/Passages/Treatments are there in this experiment?
How many samples are in each one?

(You may find the unique and table functions useful for answering these questions.)

How many Animals are there in each Lab?
Are there Animals with samples from both Labs?

table(sample_table$Animal.ID, sample_table$Lab)

##          
##           A B
##   6C4_65M 6 0
##   6C63    6 0
##   6C7     6 0
##   6C84    6 0
##   7C37    6 0
##   8C8     6 0
##   CN7314  0 6
##   CN7875  0 6
##   CN8351  0 6

How many sequencing Runs are there in each Lab?
How many Animals are there in each sequencing Run?
Are there Animals with samples from multiple sequencing Runs?
Is it better to put all of each animal’s samples in the same run, or is it better to distribute each animal’s samples across multiple runs?

Performing a basic limma analysis

Normalization

Ok, now that we have some understanding of the experimental design, let’s find some differentially expressed genes! We’ll start by computing the total counts in each sample and then normalizing these total counts using the Trimmed Mean of M-values (TMM) method, provided by the calcNormFactors function.

total_counts <- colSums(count_matrix)
nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
print(nf)

##  [1] 0.8800745 0.9898919 0.8203199 0.9952668 0.8838364 1.0259290 1.1900295
##  [8] 1.3016013 1.1399379 1.1169711 1.0104253 1.2073377 1.0517275 1.0998364
## [15] 1.0280885 1.0526623 0.8782156 1.0749254 0.6720892 0.9333467 0.6241975
## [22] 0.8950369 0.6706472 0.9438313 1.0769607 1.1242476 1.0746693 1.0230317
## [29] 1.1620610 1.0660020 0.8419412 1.0470961 0.7138150 1.0934070 0.8403635
## [36] 1.1033499 0.9509877 1.0742606 0.9817360 1.0921636 1.2666923 1.0758749
## [43] 0.9013256 1.0833068 1.0313982 1.1667053 1.1918567 1.3519535 0.9827153
## [50] 0.9560459 1.0029934 0.9652303 0.9766554 0.9307331

normalized_total_counts <- total_counts * nf

What is the range of total_counts/normalization factors? (Try the summary function)
What is the average total count per sample? Does this seem high or low? How might we expect this to affect our results?

Filtering non-expressed genes

Our count matrix contains information on every known gene in the cyno genome, but we only want to look at the genes that are expressed in MSCs. Many genes will not be expressed at all and will have zero or very few counts in all samples, and the statistical method breaks down when it is fed all zeros. So we need to decide on a threshold of gene detection and filter out all the genes whose average expression does not reach this threshold. Let’s see how a threshold of logCPM = 1 looks on a histogram and a QQ plot against normal distribution quantiles.

mean_log_cpm <- aveLogCPM(count_matrix, normalized_total_counts)
filter_threshold <- 1
ggplot() + aes(x=mean_log_cpm) +
    geom_histogram(binwidth=0.2) +
    geom_vline(xintercept=filter_threshold) +
    ggtitle("Histogram of mean expression values")

logCPM histogram with threshold line

qqnorm(mean_log_cpm); abline(h=filter_threshold)

logCPM QQ plot with threshold line

What do the shapes of these two plots indicate?
How can we justify our choice of threshold using these plots?
Would this same detection threshold be appropriate for other experiments?
What biological or experimental factors might influence the choice of threshold?

Having chosen our threshold, let’s pick the subset of genes whose average expression passes that threshold.

keep_genes <- mean_log_cpm >= 1
filtered_count_matrix <- count_matrix[keep_genes,]
filtered_gene_info <- gene_info[keep_genes,]
nrow(filtered_count_matrix)

## [1] 11939

What fraction of the genes in the genome are considered expressed according to our threshold?
Is this number sensitive to our choice of threshold? How much of a difference does it make if we use a threshold of 0.5 or 1.5 instead?

Fitting the linear models

Now that we have computed our normalization and filtered out non-expressed genes, it’s time to fit out linear models. The basic model fitting function provided by limma is lmFit, which is essentially a shortcut for running lm once for each gene, using the same model formula each time.

First, we will construct our design matrix by selecting an appropriate model formula. This step is performed automatically when you run lm, but for lmFit we must do it manually. For our first model, we’ll keep it simple and use Treatment as the only covariate.

design <- model.matrix(~ Treatment, data=sample_table)
head(design)

##                              (Intercept) TreatmentIFNg
## LabA.R104.6C4_65M.P4.Control           1             0
## LabA.R104.6C4_65M.P4.IFNg              1             1
## LabA.R104.6C4_65M.P5.Control           1             0
## LabA.R104.6C4_65M.P5.IFNg              1             1
## LabA.R104.6C4_65M.P6.Control           1             0
## LabA.R104.6C4_65M.P6.IFNg              1             1

Note that the design has two columns, representing the two coefficients in our model. The first one, named (Intercept), represents the expression level of the Control samples, while the second coefficient represents the difference between Control and IFNg treatments. This is the coefficient that we will test for differential expression once we have fit our model. This model will be equivalent to a simple two-sample t-test.

Before we fit our model, we have to run voom to compensate for the heteroskedasticity of the counts. (While we’re at it, we also insert the gene metadata into the resulting object. This will allow limma to include the gene metadata its result tables automatically.)

v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)

Voom diagnostic plot

v$genes <- filtered_gene_info

Running voom with plot=TRUE produces a diagnostic plot showing the empirical relationship between log count and variance. The fitted curve is a running average of the cloud of points, and this curve is what voom uses to assign a weight to each observed count.

According to the plot, which count would have the highest weight (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the log2 scale)
Which would have the lowest weight?
Is the relationship between log count and variance monotonic? Why might it not be monotonic, if higher counts are supposed to be more precise?
Why does voom need to know our experimental design?

The object returned by voom is an EList, which is a complex object similar in spirit to a SummarizedExperiment. You don’t need to know anything about its internals. Just know that it contains both the matrix of log2(CPM) values and the corresponding matrix of weights. It also contains the gene info, which we added manually. Together with the design matrix, this is everything we need to fit our linear models.

fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)

Recall that the eBayes function is responsible for squeezing each gene’s sample variance toward the overall mean variance of the whole data set, in the process trading a bit of bias for stability.

Getting our results

Anyway, now we can perform a “moderated t-test”. “Moderated” means that we are substituting the empirical Bayes squeezed variance for the sample variance in the formula for the t statistic (and also adjusting the degrees of freedom term accordingly). To get our results, we call topTable, telling it which coefficient we wish to test, along with which multiple testing correction to use on the p-values. The n=Inf tells it to give us the results for all the genes in the data set.

results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
head(results)

##                    symbol    ensembl_gene_id ensembl_transcript_id
## ENSP00000396478    RNF213    ENSG00000173821       ENST00000411702
## ENSP00000372722      TAP1    ENSG00000206297       ENST00000383235
## ENSMMUP00000032577   WARS ENSMMUG00000013865    ENSMMUT00000039489
## ENSMMUP00000036861 TRIM21 ENSMMUG00000004974    ENSMMUT00000043874
## ENSP00000359504      GBP1    ENSG00000117228       ENST00000370473
## ENSP00000246911     IFI35    ENSG00000068079       ENST00000246911
##                    ensembl_peptide_id
## ENSP00000396478       ENSP00000396478
## ENSP00000372722       ENSP00000372722
## ENSMMUP00000032577 ENSMMUP00000032577
## ENSMMUP00000036861 ENSMMUP00000036861
## ENSP00000359504       ENSP00000359504
## ENSP00000246911       ENSP00000246911
##                                                                                                                                                                                                                                                                                                                   description
## ENSP00000396478                                                                                                                                                                                                                                                        ring finger protein 213 [Source:HGNC Symbol;Acc:14539]
## ENSP00000372722    Antigen peptide transporter 1 (APT1)(Peptide transporter TAP1)(ATP-binding cassette sub-family B member 2)(Peptide supply factor 1)(Peptide transporter PSF1)(PSF-1)(Peptide transporter involved in antigen processing 1)(Really interesting new gene 4  /.../n) [Source:UniProtKB/Swiss-Prot;Acc:Q03518]
## ENSMMUP00000032577                                                                                                                                                                                                                                              Uncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:F6QVM3]
## ENSMMUP00000036861                                                                                                                                                                                                          E3 ubiquitin-protein ligase TRIM21; Uncharacterized protein  [Source:UniProtKB/TrEMBL;Acc:F6ZU58]
## ENSP00000359504                                                                                                                                                                                                                        guanylate binding protein 1, interferon-inducible, 67kDa [Source:HGNC Symbol;Acc:4182]
## ENSP00000246911                                                                                                                                                                                                                                                   interferon-induced protein 35 [Source:HGNC Symbol;Acc:5399]
##                    uniprot_swissprot_accession entrezgene source_species
## ENSP00000396478                           <NA>       <NA>           HSAP
## ENSP00000372722                         Q03518       6890           HSAP
## ENSMMUP00000032577                        <NA>     706855           MMUL
## ENSMMUP00000036861                        <NA>     717826           MMUL
## ENSP00000359504                         P32455       2633           HSAP
## ENSP00000246911                           <NA>       <NA>           HSAP
##                       logFC  AveExpr        t      P.Value    adj.P.Val
## ENSP00000396478    2.331039 8.306125 13.84463 7.898112e-20 9.429556e-16
## ENSP00000372722    2.823017 8.034729 13.56115 1.952273e-19 1.165409e-15
## ENSMMUP00000032577 3.742958 9.372005 13.33520 4.236850e-19 1.686125e-15
## ENSMMUP00000036861 2.371993 5.042847 12.81394 2.227684e-18 5.295882e-15
## ENSP00000359504    5.202876 7.912861 12.83494 2.261993e-18 5.295882e-15
## ENSP00000246911    2.004546 6.449793 12.71623 3.079054e-18 5.295882e-15
##                           B
## ENSP00000396478    34.80672
## ENSP00000372722    33.92351
## ENSMMUP00000032577 33.13629
## ENSMMUP00000036861 31.50507
## ENSP00000359504    31.50434
## ENSP00000246911    31.23283

What is the estimated FDR for the most significant gene (the FDR is stored in the adj.P.Val column)?
How many genes are significantly differentially expressed at a threshold of 10% FDR?
Do you see any genes that make biological sense in the top few results?

Now let’s inspect the p-value histogram. For reference, we’ll add in a horizontal line indicating what a uniform distribution would look like.

ggplot(results) +
    aes(x=P.Value) +
    geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
    geom_hline(yintercept=1) +
    ggtitle("P-value distribution for Control vs Treatment")

P-value histogram for the first analysis

What can we conclude from this histogram?

Lastly, we can generate an MA plot showing the log2 fold change vs the log2 CPM for each gene:

ggplot(arrange(results, desc(P.Value))) +
    aes(x=AveExpr, y=logFC,
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("MA Plot, IFNg vs Control")

Another common plot is the volcano plot, which plots significance vs log fold change:

ggplot(arrange(results, desc(P.Value))) +
    aes(x=logFC, y=-log10(P.Value),
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("Volcano Plot, IFNg vs Control")

Based on these plots, are the changes balanced between up and down, or is there a bias toward a certain direction of change? Does this make sense for the biology of interferon treatment?

Improving the analysis by exploring the data

We got pretty good results above, but how do we know that we analyzed the data correctly? Are there any important covariates that we should have included in our model? How can we figure out which covariates are important? One way is to do a PCA plot. Limma actually provides something called an MDS or PCoA plot, which is slightly different from a PCA plot but serves the same purpose.

mds <- data.frame(plotMDS(v)[c("x", "y")])

Basic MDS plot from limma

Limma’s plotMDS function creates an MDS plot using the sample labels, but this results in a very crowded plot. Luckily, it also returns the x and y coordinates of the plot, which we can use to make our own. Because we want to see how each covariate relates to the principal coordinates, let’s make several versions of our MDS plot, colored by each covariate.

mds <- cbind(mds, sample_table)
p <- ggplot(mds) +
    aes(x=x, y=y) +
    xlab("PC1") + ylab("PC2") +
    geom_point(size=3) +
    coord_fixed(ratio=1) +
    ggtitle("Sample MDS Plot")
for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
    print(p + aes_string(color=i) +
          ggtitle(paste("Sample MDS Plot Colored by", i)))
}

Based on the MDS plots, which variables seem to be important, and which do not?

Investigating an outlier

You might have noticed that a few of the IFNg-treated samples cluster with the Control samples. These look like possible outlier samples, and we should investigate them, since they could interfere with out model fit. Let’s try coloring by Animal ID and using a different shape for the IFNg samples.

p + aes(color=Animal.ID, shape=Treatment)

MDS plot for identifying the outlier animal

Can you identify the misbehaving Animal?

Let’s remove the samples for this animal from the data set and repeat the whole limma analysis.

bad_animal <- "CN8351"
selected_samples <- !(sample_table$Animal.ID %in% bad_animal)

good_sample_table <- droplevels(sample_table[selected_samples,])
good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
good_normalized_total_counts <- normalized_total_counts[selected_samples]

design <- model.matrix(~ Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)

## 
## FALSE  TRUE 
##  6817  5122

How did removing the outlier animal affect the number of differentially expressed genes?

Hopefully this demonstrates the importance of properly exploring your data before deciding on a model. You can always fit any model to the data, but the model will give you some kind of answer, sometimes even a plausible-looking one, whether or not that model is a good fit for the data. By eliminating an outlier, we significantly increased our ability to detect differential expression. Similarly, the focus of the homework will be to explore the consequences of adding additional covariates into the model formula.

Homework: Basic Model Selection

Try fitting models with Animal.ID, Passage, or both in addition to Treatment. In other words, try all four of the following model formulae:

~ Treatment # (This is the model we just fit in class)
~ Animal.ID + Treatment
~ Passage + Treatment
~ Animal.ID + Passage + Treatment

Use the filtered dataset with the outlier animal samples removed. In addition to testing Treatment for differential expression in each model, also test the other covariates for differential expression by passing a different value for the coef argument to topTable. You will need to pass a vector of all the columns of the design matrix relevant to the covariate. Below, I have included an example of how to fit the model and test for differential expression for the formula ~Passage + Treatment. (Hint: use colnames(design) to help you figure out which coefficients to test.)

design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
table(passage.results$adj.P.Val <= 0.1)

Based on your results for all four models, decide which model best fits the data. Justify your choice with MDS plots and p-value distribution plots that show why you are including or excluding each of Animal.ID and Passage from your model. How do your results for Treatment change when you include your chosen covariates? Does your model give more differentially expressed genes than the one we fit in class? If so, how many more? How many genes are differentially expressed with respect to your chosen covariates? Do these results still hold when using a different signifcance threshold, such as 5% or 1% FDR instead of 10%?

For help fitting models with multiple covariates, see the Limma User’s Guide section 9.4, “Additive Models and Blocking”.

RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon gamma

Ryan C. Thompson

April 27, 2016