Przeglądaj źródła

Add missing Ch2 methods for promoter radius and relative coverage

Ryan C. Thompson 5 lat temu
rodzic
commit
60c69ba077
1 zmienionych plików z 181 dodań i 143 usunięć
  1. 181 143
      thesis.lyx

+ 181 - 143
thesis.lyx

@@ -775,7 +775,7 @@ literal "true"
 \end_layout
 
 \begin_layout Subsection
-RNA-seq analysis
+RNA-seq differential expression analysis
 \end_layout
 
 \begin_layout Standard
@@ -1230,8 +1230,8 @@ zig-zag
  pattern, such as a gene whose expression goes up on day 1, down on day
  5, and back up again on day 14, will be attenuated or eliminated entirely.
  In the context of a T-cell activation time course, it is unlikely that
- many genes of interest will follow such an expression patter, so this loss
- was deemed an acceptable cost for correcting the batch effect.
+ many genes of interest will follow such an expression pattern, so this
+ loss was deemed an acceptable cost for correcting the batch effect.
 \end_layout
 
 \begin_layout Standard
@@ -1349,7 +1349,7 @@ literal "false"
 \end_layout
 
 \begin_layout Subsection
-ChIP-seq analysis
+ChIP-seq differential modification analysis
 \end_layout
 
 \begin_layout Standard
@@ -1552,11 +1552,158 @@ MA plot of H3K4me2 read counts in 10kb bins for two arbitrary samples.
 
 \end_layout
 
+\begin_layout Standard
+\begin_inset Flex TODO Note (inline)
+status open
+
+\begin_layout Plain Layout
+Be consistent about use of 
+\begin_inset Quotes eld
+\end_inset
+
+differential binding
+\begin_inset Quotes erd
+\end_inset
+
+ vs 
+\begin_inset Quotes eld
+\end_inset
+
+differential modification
+\begin_inset Quotes erd
+\end_inset
+
+ throughout this chapter.
+ The latter is usually preferred.
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+Sequence reads were retrieved from SRA 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Leinonen2011"
+literal "false"
+
+\end_inset
+
+.
+ ChIP-seq (and input) reads were aligned to GRCh38 genome assembly using
+ Bowtie 2 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Langmead2012,Schneider2017,gh-hg38-ref"
+literal "false"
+
+\end_inset
+
+.
+ Artifact regions were annotated using a custom implementation of the GreyListCh
+IP algorithm, and these 
+\begin_inset Quotes eld
+\end_inset
+
+greylists
+\begin_inset Quotes erd
+\end_inset
+
+ were merged with the published ENCODE blacklists 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "greylistchip,Amemiya2019,Dunham2012,gh-cd4-csaw"
+literal "false"
+
+\end_inset
+
+.
+ Any read or called peak overlapping one of these regions was regarded as
+ artifactual and excluded from downstream analyses.
+ Figure 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:CCF-master"
+plural "false"
+caps "false"
+noprefix "false"
+
+\end_inset
+
+ shows the improvement after blacklisting in the strand cross-correlation
+ plots, a common quality control plot for ChIP-seq data.
+ Peaks were called using epic, an implementation of the SICER algorithm
+ 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Zang2009,gh-epic"
+literal "false"
+
+\end_inset
+
+.
+ Peaks were also called separately using MACS, but MACS was determined to
+ be a poor fit for the data, and these peak calls are not used in any further
+ analyses 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Zhang2008"
+literal "false"
+
+\end_inset
+
+.
+ Consensus peaks were determined by applying the irreproducible discovery
+ rate (IDR) framework 
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Li2006,gh-idr"
+literal "false"
+
+\end_inset
+
+ to find peaks consistently called in the same locations across all 4 donors.
+\end_layout
+
+\begin_layout Standard
+Promoters were defined by computing the distance from each annotated TSS
+ to the nearest called peak and examining the distribution of distances,
+ observing that peaks for each histone mark were enriched within a certain
+ distance of the TSS.
+ For H3K4me2 and H3K4me3, this distance was about 1
+\begin_inset space ~
+\end_inset
+
+kb, while for H3K27me3 it was 2.5
+\begin_inset space ~
+\end_inset
+
+kb.
+ These distances were used as an 
+\begin_inset Quotes eld
+\end_inset
+
+effective promoter radius
+\begin_inset Quotes erd
+\end_inset
+
+ for each mark.
+ The promoter region for each gene was defined as the region of the genome
+ within this distance upstream or downstream of the gene's annotated TSS.
+ For genes with multiple annotated TSSs, a promoter region was defined for
+ each TSS individually, and any promoters that overlapped (due to multiple
+ TSSs being closer than 2 times the radius) were merged into one large promoter.
+ Thus, some genes had multiple promoters defined, which were each analyzed
+ separately for differential modification.
+\end_layout
+
 \begin_layout Standard
 \begin_inset Float figure
 wide false
 sideways false
-status open
+status collapsed
 
 \begin_layout Plain Layout
 \begin_inset Float figure
@@ -1852,132 +1999,7 @@ PCoA plots of ChIP-seq sliding window data, before and after subtracting
 \end_layout
 
 \begin_layout Standard
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-Be consistent about use of 
-\begin_inset Quotes eld
-\end_inset
-
-differential binding
-\begin_inset Quotes erd
-\end_inset
-
- vs 
-\begin_inset Quotes eld
-\end_inset
-
-differential modification
-\begin_inset Quotes erd
-\end_inset
-
- throughout this chapter.
- The latter is usually preferred.
-\end_layout
-
-\end_inset
-
-
-\end_layout
-
-\begin_layout Standard
-\begin_inset Flex TODO Note (inline)
-status open
-
-\begin_layout Plain Layout
-Forgot to mention effective promoter radius determination.
-\end_layout
-
-\end_inset
-
-
-\end_layout
-
-\begin_layout Standard
-Sequence reads were retrieved from SRA 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "Leinonen2011"
-literal "false"
-
-\end_inset
-
-.
- ChIP-seq (and input) reads were aligned to GRCh38 genome assembly using
- Bowtie 2 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "Langmead2012,Schneider2017,gh-hg38-ref"
-literal "false"
-
-\end_inset
-
-.
- Artifact regions were annotated using a custom implementation of the GreyListCh
-IP algorithm, and these 
-\begin_inset Quotes eld
-\end_inset
-
-greylists
-\begin_inset Quotes erd
-\end_inset
-
- were merged with the published ENCODE blacklists 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "greylistchip,Amemiya2019,Dunham2012,gh-cd4-csaw"
-literal "false"
-
-\end_inset
-
-.
- Any read or called peak overlapping one of these regions was regarded as
- artifactual and excluded from downstream analyses.
- Figure 
-\begin_inset CommandInset ref
-LatexCommand ref
-reference "fig:CCF-master"
-plural "false"
-caps "false"
-noprefix "false"
-
-\end_inset
-
- shows the improvement after blacklisting in the strand cross-correlation
- plots, a common quality control plot for ChIP-seq data.
- Peaks were called using epic, an implementation of the SICER algorithm
- 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "Zang2009,gh-epic"
-literal "false"
-
-\end_inset
-
-.
- Peaks were also called separately using MACS, but MACS was determined to
- be a poor fit for the data, and these peak calls are not used in any further
- analyses 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "Zhang2008"
-literal "false"
-
-\end_inset
-
-.
- Consensus peaks were determined by applying the irreproducible discovery
- rate (IDR) framework 
-\begin_inset CommandInset citation
-LatexCommand cite
-key "Li2006,gh-idr"
-literal "false"
-
-\end_inset
-
- to find peaks consistently called in the same locations across all 4 donors.
- Reads in promoters, peaks, and sliding windows across the genome were counted
+Reads in promoters, peaks, and sliding windows across the genome were counted
  and normalized using csaw and analyzed for differential modification using
  edgeR 
 \begin_inset CommandInset citation
@@ -2013,21 +2035,37 @@ noprefix "false"
 .
 \end_layout
 
-\begin_layout Subsection
-Promoter neighborhood analysis
-\end_layout
-
 \begin_layout Standard
-\begin_inset Flex TODO Note (inline)
-status open
+To investigate whether the location of a peak within the promoter region
+ was important, 
+\begin_inset Quotes eld
+\end_inset
 
-\begin_layout Plain Layout
-Forgot I need to document the methods for this as well.
-\end_layout
+relative coverage profiles
+\begin_inset Quotes erd
+\end_inset
 
+ were generated.
+ First, 500-bp sliding windows were tiled around each annotated TSS: one
+ window centered on the TSS itself, and 10 windows each upstream and downstream,
+ thus covering a 10.5-kb region centered on the TSS with 21 windows.
+ Reads in each window for each TSS were counted in each sample, and the
+ counts were normalized and converted to log CPM as in the differential
+ modification analysis.
+ Then, the logCPM values within each promoter were normalized to an average
+ of zero, such that each window's normalized abundance now represents the
+ relative read depth of that window compared to all other windows in the
+ same promoter.
+ The normalized abundance values for each window in a promoter are collectively
+ referred to as that promoter's 
+\begin_inset Quotes eld
 \end_inset
 
+relative coverage profile
+\begin_inset Quotes erd
+\end_inset
 
+.
 \end_layout
 
 \begin_layout Subsection
@@ -3719,7 +3757,7 @@ t
 \end_inset
 
 ).
- The difference in average FPKM values when a peak overlaps the promoter
+ The difference in average log FPKM values when a peak overlaps the promoter
  is about 
 \begin_inset Formula $+5.67$
 \end_inset
@@ -5768,7 +5806,7 @@ This was where I defined interesting expression patterns and then looked
  at initial relative promoter coverage for each expression pattern.
  Negative result.
  I forgot about this until recently.
- Worth including?
+ Worth including? Remember to also write methods.
 \end_layout
 
 \end_inset
@@ -5786,7 +5824,7 @@ status open
 
 \begin_layout Plain Layout
 I forgot until recently about the work I did on this.
- Worth including?
+ Worth including? Remember to also write methods.
 \end_layout
 
 \end_inset