RSA-tools - Tutorials - Collecting peak sequences from the Galaxy Web server

Contents

  1. Introduction
  2. Accessing ChIP-seq read and peak coordinates in the GEO database
  3. Opening a custom session in Galaxy (optional)
  4. Uploading mapped reads to Galaxy
  5. Peak calling with MACS
  6. Retrieving peak sequences with Galaxy
  7. Next step
  8. References

Prerequisite

This tutorial assumes that you are familiar with the concepts related to High-Throughput sequencing (reads, read mapping) and ChIP-seq technology (peaks).


Introduction

This tutorial does not direclty use RSAT tools, but explains how to obtain datasets that can be used as input for the RSAT program peak-motifs.

The Galaxy server (http://main.g2.bx.psu.edu/) combines a wide variety of programs for accessing and analyzing genomic sequences. Those tools are remarkably powerful and efficient, and they are accompanied with an excellent documentation, including training videos.

The goal of this tutorial is to give a short explanation of the successions of operations that permit retrieving peak sequences from the Galaxy server, starting from a set of reads or peaks annotated in the Gene Expression Omnibus database (GEO http://www.ncbi.nlm.nih.gov/geo/).


Obtaining ChIP-seq read and peak coordinates in the GEO database

Goal: Identify the dataset corresponding to the article by Chen et al., 2008 (Pubmed ID: 18555785) in the GEO datasets and Retrieve the data for the Sox2 experiment.

Protocol

  1. Open a connection to the Pubmed database.

  2. In the text box, enter the title of the article:

      Integration of external signaling pathways with the core transcriptional network in embryonic stem cells

    This should give a single result. If this is not the case, you can select the publication on the basis of its PubMed ID 18555785

  3. On the right side of the Pubmed record, under the title All links from this record, click the link GEO DataSets. This opens the record GSE11431 with the title Mapping of transcription factor binding sites in mouse embryonic stem cells.

      Note: the title of the record differs from the title of the article, which makes it somewhat difficult to identify a record by browsing the GEO datasets alone. The easiest way to go from an article to the corresponding records is generally to use the direct link from PubMed to GEO DataSets, as we did.

  4. In the GEO database, the identifiers with prefix GSE denote series of experiments. Chen et al. (2008) published ChIP-seq results for various transcription factors, so that the series associated to this article contains 16 samples in total.

  5. The bottom of the GSE record provides the list of samples (identifiers starting with GSM). Click on the link corresponding an experiment of your choice (e.g. ES_Oct4, sample ID GSM288346).

  6. Read the information available about this sample. The bottom of the record provides links to the data sets at various processing stages:

    • The SRA files contain the raw sequences of the reads produced by the massively parallel sequencer. For the Oct4 sample, there are 4 files each containing more than 5 million reads (~150 Megabases of sequence per file). These files are quite heavy, and we don't need to download them for the purpose of this tutorial.

  7. The txt file SM288346_ES_Oct4.txt.gz contains a set of peaks selected by Chen et al. (2008). These peaks are provided in a tab-delimited text file, the first column indicates the chromosome, the second column the start and the third column the end of each peak. This bed file can be uploaded to Galaxy in order to directly retrieve the sequences of the peaks published by Chen and colleagues in their original paper. However, in this tutorial we will perform one additional step: we will run a peak calling program (MACS) in order to identify the regions enriched in reads (the peaks). For this, we need to access another type of data: the coordinates of the reads.
  8. The bed file GSM288346_Oct4.bed.gz contains the location of the reads. This is the file we will use as input to identify the significant peaks (see section Peak calling below).

Opening a custom session in Galaxy

The Galaxy Web server offers a wide variety of tools for analyzing genomic data.

To take benefit of all the advantages of Galaxy, you can open an account on the server, which will allow you to keep a trace of previous analyses and store the data and results on their server.

  1. Open a connection to the Galaxy server (http://main.g2.bx.psu.edu/).

  2. On the top of the window, open the command User > Login and provide your email address and password (at the first connection, you must fill a form to obtain a login).


Uploading mapped reads to Galaxy

  1. In the menu at the left of the window, click Get Data > Upload File.

  2. You can either upload a file from your computer (click Browse besides the File text box) or from a Web server (type a link to the file in the URL/Text box).

      For this exercise, we will upload the read location file from GEO (.bed file described above). In the URL/Text box, paste the URL of the Oct4 sample:
      ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FOct4%2Ebed%2Egz

  3. In the Genome pop-up menu, select Mouse Feb 2006 (NCBI36/mm8). Tip: he genome is selected if you simply type mm8 when the menu is selected.

  4. Leave the other options to their default value, and click Execute. The upload may take several minutes.

    • Note: the upload speed depends on the availability of the two servers. Indeed, the coordinates of the ChIP-seq reads were directly transferred from GEO to Galaxy, without transiting by your computer. .

  5. When the file will be uploaded, the yellow box on the right side will turn to green. Click on this box and check that the format is BED and the genome ("database") is mm8.

  6. Always check the content of a dataset before using it. For this, Galaxy provides a very convenient interface: click on the eye icon in the green box. The page should look like this:
    chr1	100000123	100000148	0	0	+
    chr1	100000387	100000412	0	0	-
    chr1	100001969	100001994	0	0	-
    chr1	100002597	100002622	0	0	+
    chr1	100002637	100002662	0	0	+
    chr1	10000261	10000286	0	0	-
    chr1	100003474	100003499	0	0	-
    chr1	100004023	100004048	0	0	+
    chr1	100004191	100004216	0	0	+
    chr1	100005158	100005183	0	0	-
    chr1	100005335	100005360	0	0	+
    ...
  7. We recommend to rename the Galaxy entry for the sake of readability.

    • In the green box representing your uploaded data, click on the pencil (Edit attributes).
    • The field Name currently contains the ftp location from which the bed file was retrieved. Cut this information and paste it in the Info box.
    • Type a more informative content in the Name box, for example Chen 2008 ES_Oct4 reads.
    • Don't forget to click Save, and check that the information has been updated in the history panel at the right of the window.

  8. By clicking on the title of the green box ("1: Chen 2008 ES_Oct4 reads"), you obtain some quantitative information about the dataset. Check that the downloaded read contains the expected number of reads. Its should display the following information.
     ~4,700,000 regions
    format: bed, database: mm8
    Info: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FOct4%2Ebed%2E

Exercise

Run the same protocol to download the sample GSM288358 from GEO. This sample was obtained by immunoprecipitating green fluorescent protein (GFP). In principle, it should thus not contain any specific peak. It can optionally be used as control ("mock") for peak-calling programs. After download, rename the dataset "Chen 2008 ES_GFP reads".


Peak calling with MACS

  1. In the left frame of the Galaxy window, you can see a set of specialized tools for analyzing data from Next Generation (NGS TOOLBOX BETA).

  2. Click NGS:Peak Calling > MACS Model-based Analysis of ChIP-Seq.

  3. Enter an Experiment Name (e.g. OCT4 Chen-2008 peaks MACS no input).

  4. For the ChIP-seq tag file, select the file you uploaded in the previous step (if you performed all the steps above, it should appear as "Chen 2008 ES_Oct4 reads" in the pop-up menu).

  5. Effective genome size: this is the size of the genome considered "usable" for peak calling. This value is given by the MACS developpers on their website. It is smaller than the complete genome because many regions are excluded (telomeres, highly repeated regions...). The default value is for human (2700000000.0), as we work on mouse, choose 1870000000.0

  6. Set the Tag size to 26bp (the default is 25).

  7. Optionally, you can activate the option Save shifted raw tag count at every bp into a wiggle file (by default, it displays (Do not create a wig file (faster)). It is useful to activate this option (select Save) in order to obtain a wig file, which indicates the density of reads at regular intervals along the chromosomes. Wig files are convenient for visual inspection of the peak-calling results (e.g. visualizing the density of reads under the peaks).
  8. Leave all other options to their default values and click Execute.

  9. While the program is running, two yellow boxes should appear in the "History" frame at the right of the Galaxy Window. After completion of the job, the boxes will be colored in green. The first box contains an HTML page with links to the results in various formats. The second box contain a BED file with the coordinates of the peaks. How many peaks ("regions") were detected by MACS ?

  10. Once the result is available, click on the pencil to change the information. Rename the dataset (e.g. Oct4 peaks from MACS).

  11. Optionally, you can upload the peak coordinates to the UCSC genome browser to visualize them on the mouse chromosomes. For this, you can simply click on the link display at UCSC main in the green box Oct4 peaks from MACS of the History frame.

Exercises

In the protocol above, we used the simplest approach to detect peaks with MACS: we entered a single file (the "test" reads) and adapted some parameters to the particularity of our experiment (e.g. genome size, tag size).

Some peak-calling programs (including MACS) allow users to submit a second set of reads as control. Typical controls are "mock" datasets, i.e. genomic sequences obtained from a non-immunoprecipitated protein, or genomic DNA. In Chen's experiment, the control consisted in performing the ChIP-seq with the Green Fluorescent Protein (GFP) instead of a transcripiton factor.

  1. Compute peaks with MACS, using the Oct4 reads as test dayaset, and the GFP experiment as control. Compare the number of peaks with the result of Oct4 when no control set was provided.
  2. Run the MACS peak-calling procedure using GFP as test dataset, and no dataset as control. Count the number of peaks. Does it correspond to your expectation ?

Retrieving peak sequences with Galaxy

The Galaxy Web server allows to quickly retrieve sequences from a coordinate file (eg. BED file). The coordinates can be provided in various forms:

Option A: retrieving sequences from the peak identified by MACS

  1. The BED file retrieved in the previous section indicates the chromosomal coordinates of the peaks, but in the next section (motif discovery) we will need to analyze the peak sequences. In the Tools frame at the left iof the Galaxy window, click Fetch Sequences - Extract Genomic DNA. Select the Oct4 peaks from MACS dataset and click Execute.

  2. Once the box become green in the History frame, click on the pencil icon and rename the data set (for example Oct4 peaks from GEO).

  3. Open the green box and click on the disk icon to store the result on your computer (for example in a file Oct4_MAC_peak_sequences.fasta.

You can skip Options B and C.

Option B: retrieving sequences from a bed file stored on your computer

If you dipose of a bed file (e.. produced by a stand-alone peak calling program running on your comptuer), you can upload this bed file to the Galaxy server and proceed as for the Option A.

Option C: retrieving the sequences from the peak coordinates stored in the GEO database

In this section, we will directly upload the peak sequences published by Chen in GEO (the above mentioned "txt" file, which is actually in BED format).

  1. In the menu at the left of the Galaxy page, click Get Data - Upload File.

  2. In the URL/Text box, paste the URL of the Oct4 sample:

    ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288346/GSM288346%5FES%5FOct4%2Etxt%2Egz

  3. In the Genome pop-up menu, select Mouse Feb 2006 (NCBI36/mm8).

    • Tip: the item will be automatically selected if you simply type mm8 after clicking on the pop-up menu.
    • Note: the choice of the genome should be adapted to the particular data set you are analyzing. As indicated in the field "Data processing" of the GEO record GSM288346, Chen's study was performed with the 2006 version of the genome;

  4. Leave the other options to their default value, and click Execute. The upload may take several minutes. When the file will be uploaded, the yellow box on the right side will turn to green.

  5. Click on this box and make sure that the format is BED and the genome is mm8. How many ChIP-seq "regions" are present in this file ?

  6. Click on the disk icon in this box to download this sequence file. Save it on your computer.


Next step

We obtained a sequence file that can now be used as input for motif discovery and TF binding site prediction. For this, we will use the RSAT workflow peak-motifs, whose usage is explained in the protocol peak-motifs: motif detection in full-size datasets of ChIP-seq peak sequences.


References


For suggestions please post an issue on GitHub or contact the