RSA-tools - Tutorials - matrix-clustering

Prerequisite

This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.

It is better to follow the corresponding tutorials before this one.

Position-specific scoring matrices.
peak-motifs: discover cis-regulatory motifs and predict putative TFBS from a set of peak sequences identified by high-throughput methods such as ChIP-seq.

Introduction

The program matrix-clustering enables to compare and align groups of similarities among motif collections displaying the results with different motif-representation formats.

Transcription factor binding motifs (TFBM) are classically represented either as consensus strings (stric consensus, IUPAC or regular expressions), or as position-specific scoring matrices (PSSM).

Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB, TRANSFAC, etc). These PSSMs were traditionally built from collections of transcription factor binding sites (TFBS) obtained by various experimental methods (e.g. ChIP-seq, SELEX-seq, PBM).

TFBMs can also be discovered ab initio from genome-scale datasets: promoters of co-expressed genes, ChIP-seq peaks, phylogenetic footprints, etc.

Motif collections (databases as well as ab initio motif discovery results) sometimes contain groups of similar motifs, for different reasons: curation of alternative motifs for a same TF; homologous proteins sharing a particular DNA binding domain, motifs discovered with analytic workflows combining several algorithms (e.g. RSAT peak-motifs, or MEME-chip).

The tool RSAT matrix-clustering handle the motif redundancy and includes several features to explore the motifs, which will be illustrated in this tutorial.

For the computation of inter-matrices distances, support for a large series of alternative metrics (Pearson correlation (Ncor), Euclidian distance (dEucl), SSD, Sandelin-Wasserman, logo dot product, and length-normalized version of these scores).
Possibility to cluster multiple motif collection in the same analysis. This allows to compute the inter collection similarity and the motif richness of each collection.
Possibility to select a custom combination between several of these similarity metrics, in order to compute an integrative threshold.
The set of input motifs is split into separate clusters, each of which canbe displayed in user-interactive ways.
User-friendly display of motif trees with aligned logos. The trees are interactive and the user can expand/collapse the branches at will.
At each level of the hierarchical tree, all the descendent matrices are aligned (multiple alignment), and a merged motif is computed (branch motif).

In this tutorial, we explain how to tune the parameters and interpret of results of matrix-clustering.

Motif redundancy: examples in motif-discovery results and in motif databases.
Thresholds: setting a combination of similarity measures values as a threshold to define the groups of similarities.
Impact of parameters: some example showing how changing the values of the parameters can affect cluster composition or tree topology.

Study cases

Study case 1

Goal: clustering a set of partly redundant motifs discovered by various algorithms (RSAT peak-motifs, MEME-ChIP, HOMER).

Data set: To illustrate the use of motif clustering to filter out redundancy, we will analyze a set of motifs discovered with: RSAT peak-motifs , MEME-ChIP , HOMER

These motifs were discovered in a set of ChiP-seq peaks bound by the transcription factor Oct4 in mouse ES cells. This experiment had been performed in the context of a wider study, where Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).

Connect to RSAT and click on the matrix-clustering button.
In the Analysis Title box set the name of you analysis (e.g. Oct motifs from several tools).
In the Input Matrices box paste the matrices discovered by RSAT peak-motifs
In the Motif Collection Name box set the name for the RSAT peak-motifs motifs (e.g. RSAT)
Set the matrix format to transfac.
In the section 'Thresholds to define the clusters', you can set the parameters to separate the clusters.
Set the parameters, in the lower threshold column to: w = 5, cor = 0.75, Ncor = 0.55
In the menu 'Metric to build the trees', you can select the motif comparison metric that will be used to compute the motif similarity and build the hierarchical tree.
Select Ncor.
In the menu 'Aglomeration rule', you can select the linkage rule to built the hirarchical tree.
Select average.
In the menu 'Merge matrices', you can control if the counts of the aligned matrices will be summed or averaged.
Select sum.
In the section 'Output file options', check the heatmap option.
In the section 'Labels displayed in the logo tree', check the name and ic (Information Content) options. They will be displayed in the motif trees.
Click on GO and wait for the results.

Interpreting the results

Study case 1

In this example we will study a set of motifs dicovered in a ChIP-seq experiment done for the TF Oct4 (Pou5f1) which is an essential TF in cell fate decision, ES cells and early embryonic development, it binds the canonical sequence 5'-ATGCAAAT-3'.

In ES cells, Oct4 often interacts with another TF, Sox2, which binds to an adjacent Sox motif 5'-CATTGTA-3'. Together, both TFs coregulate specific genes.

During the analysis of Oct4 or Sox2 binding peaks, the so-called SOCT motif is usually found, which is a composite motif encompassing both Oct and Sox motifs. (Figure 1)

The result's website is separated by sections, you can expand/collapse each section by clicking on it.
1. The section Results Summary contains the parameters specified and the number of input motifs and collections.
2. The section Clusters Summary shows for each cluster, the cluster size, the collections where they come from and the logo corresponding to the root motifs, this table is sortable by clicking on the header. At the last column of this table you can download the root motif.
3. The section Logo Forest shows for each cluster, a hierarchical tree, with the aligned logos at each branch. This tree is dynamic, by clicking on each node, you can collapse/expand the tree at will to manually control the cluster visualization. The red buttons at the end of each tree allow to change the motif orientation and show or hide the IC.
4. Can you manually reduce the cluster_1 to 6 non-redundant motifs?
5. The section Individual Motif View is a dynamic table with the information of each motif.
6. The section Individual Cluster View shows each cluster separatley and the order in which the motifs were incorporated. You can click on each node to select its corresponding branch-motif.
7. The section Heatmap View shows a matrix whit all the motifs compared against themselves. The color scale indicates a high (red) and low (yellow) similarity between the motifs. Each colored line at both sides indicates the cluster.
Fig case 1:

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 1. 3D model showing the cooperative binding between Sox2 and Oct4 TFs whose closely interact to bind DNA. Together, they recognize a composite motif called the SOCT motif (SOx+OCT).
Figure 2. Table with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks andalized with matrix-clustering. Ncor<=0.4; cor>=0.6:
Figure 3. Logo tree of the cluster 1 found in the Oct4 ChIP-seq motifs. The hierarchical tree displays the logo aligment in both orientations. For each branch is calculated a branchwise-motif.
Figure 4. Branch-motif table for cluster 1. You can download the motif in TRANSFAC format or the logo in both orientations by clicking on them.
Figure 5. Logo Forest with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks. Using a combination of values as threshold (cor = 0.6; Ncor = 0.4) these motif were separated in 6 different clusters and each one is displayed in a tree.
Figure 6. The logo tree of cluster one showing the branch-motifs
Figure 7. Three examples of consensus tree when we are using the same data (21 motifs discovered by peak-motifs in the Oct4 ChIP-seq) and the same threslhold values (cor >= 0.4; Ncor >= 0.6).
In this picture we only change the hierarchical clustering agglomeration rule. From top to down: average, complete, single linkage.Each cluster is represented with a different color. Observe how the number of clusters and the tree topology change depending on the selected method.
Figure 8. Logo trees for cluster 1 and 3 which actually correspond to the SOCT and Oct4 motifs respectively. The threshold parameters used were: Ncor>=0.55 and cor>=0.625 .
Figure 9. Summary table of an example of matrix-clustering results with randomized motifs
References
1. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V. B., Wong, E., Orlov, Y. L., Zhang, W., Jiang, J., Loh, Y. H., Yeo, H. C., Yeo, Z. X., Narang, V., Govindarajan, K. R., Leong, B., Shahab, A., Ruan, Y., Bourque, G., Sung, W. K., Clarke, N. D., Wei, C. L. and Ng, H. H. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-17. [Pubmed 18555785].
2. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2011). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets Nucleic Acids Research doi:10.1093/nar/gkr1104, 9. [Open access]
3. Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen, C.-y., Chou, A., Ienasescu, H., Lim, J., Shyr, C., Tan, G., Zhou, M., Lenhard, B., Sandelin, A. and Wasserman, W. W. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles Nucleic Acids Research, 2013 [Open access]
For suggestions please post an issue on GitHub or contact the email the RSAT team

RSA-tools - Tutorials - matrix-clustering

Prerequisite

Introduction

Study cases

Study case 1

Interpreting the results

Study case 1

References