RSA-tools - Tutorials - matrix-clustering


Prerequisite

This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.

  1. PSSM theory
  2. Motif comparison

It is better to follow the corresponding tutorials before this one.

  1. Position-specific scoring matrices.
  2. peak-motifs: discover cis-regulatory motifs and predict putative TFBS from a set of peak sequences identified by high-throughput methods such as ChIP-seq.


Introduction

The program matrix-clustering enables to compare and align groups of similarities among motif collections displaying the results with different motif-representation formats.

Transcription factor binding motifs (TFBM) are classically represented either as consensus strings (stric consensus, IUPAC or regular expressions), or as position-specific scoring matrices (PSSM).

Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB, TRANSFAC, etc). These PSSMs were traditionally built from collections of transcription factor binding sites (TFBS) obtained by various experimental methods (e.g. ChIP-seq, SELEX-seq, PBM).

TFBMs can also be discovered ab initio from genome-scale datasets: promoters of co-expressed genes, ChIP-seq peaks, phylogenetic footprints, etc.

Motif collections (databases as well as ab initio motif discovery results) sometimes contain groups of similar motifs, for different reasons: curation of alternative motifs for a same TF; homologous proteins sharing a particular DNA binding domain, motifs discovered with analytic workflows combining several algorithms (e.g. RSAT peak-motifs, or MEME-chip).

The tool RSAT matrix-clustering handle the motif redundancy and includes several features to explore the motifs, which will be illustrated in this tutorial.

  1. For the computation of inter-matrices distances, support for a large series of alternative metrics (Pearson correlation (Ncor), Euclidian distance (dEucl), SSD, Sandelin-Wasserman, logo dot product, and length-normalized version of these scores).
  2. Possibility to cluster multiple motif collection in the same analysis. This allows to compute the inter collection similarity and the motif richness of each collection.
  3. Possibility to select a custom combination between several of these similarity metrics, in order to compute an integrative threshold.
  4. The set of input motifs is split into separate clusters, each of which canbe displayed in user-interactive ways.
  5. User-friendly display of motif trees with aligned logos. The trees are interactive and the user can expand/collapse the branches at will.
  6. At each level of the hierarchical tree, all the descendent matrices are aligned (multiple alignment), and a merged motif is computed (branch motif).

In this tutorial, we explain how to tune the parameters and interpret of results of matrix-clustering.

  1. Motif redundancy: examples in motif-discovery results and in motif databases.
  2. Thresholds: setting a combination of similarity measures values as a threshold to define the groups of similarities.
  3. Impact of parameters: some example showing how changing the values of the parameters can affect cluster composition or tree topology.


Study cases

Study case 1

Goal: clustering a set of partly redundant motifs discovered by various algorithms (RSAT peak-motifs, MEME-ChIP, HOMER).

Data set: To illustrate the use of motif clustering to filter out redundancy, we will analyze a set of motifs discovered with: RSAT peak-motifs , MEME-ChIP , HOMER

These motifs were discovered in a set of ChiP-seq peaks bound by the transcription factor Oct4 in mouse ES cells. This experiment had been performed in the context of a wider study, where Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).

  1. Connect to RSAT and click on the matrix-clustering button.

  2. In the Analysis Title box set the name of you analysis (e.g. Oct motifs from several tools).

  3. In the Input Matrices box paste the matrices discovered by RSAT peak-motifs

  4. In the Motif Collection Name box set the name for the RSAT peak-motifs motifs (e.g. RSAT)
  5. Set the matrix format to transfac.

  6. In the section 'Thresholds to define the clusters', you can set the parameters to separate the clusters.
    Set the parameters, in the lower threshold column to: w = 5, cor = 0.75, Ncor = 0.55

  7. In the menu 'Metric to build the trees', you can select the motif comparison metric that will be used to compute the motif similarity and build the hierarchical tree.
    Select Ncor.

  8. In the menu 'Aglomeration rule', you can select the linkage rule to built the hirarchical tree.
    Select average.

  9. In the menu 'Merge matrices', you can control if the counts of the aligned matrices will be summed or averaged.
    Select sum.

  10. In the section 'Output file options', check the heatmap option.

  11. In the section 'Labels displayed in the logo tree', check the name and ic (Information Content) options. They will be displayed in the motif trees.

  12. Click on GO and wait for the results.

Interpreting the results