This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.
It is better to follow the corresponding tutorials before this one.
The program matrix-clustering enables to compare and align groups of similarities among motif collections displaying the results with different motif-representation formats.
Transcription factor binding motifs (TFBM) are classically represented either as consensus strings (stric consensus, IUPAC or regular expressions), or as position-specific scoring matrices (PSSM).
Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB, TRANSFAC, etc). These PSSMs were traditionally built from collections of transcription factor binding sites (TFBS) obtained by various experimental methods (e.g. ChIP-seq, SELEX-seq, PBM).
TFBMs can also be discovered ab initio from genome-scale datasets: promoters of co-expressed genes, ChIP-seq peaks, phylogenetic footprints, etc.
Motif collections (databases as well as ab initio motif discovery results) sometimes contain groups of similar motifs, for different reasons: curation of alternative motifs for a same TF; homologous proteins sharing a particular DNA binding domain, motifs discovered with analytic workflows combining several algorithms (e.g. RSAT peak-motifs, or MEME-chip).
The tool RSAT matrix-clustering handle the motif redundancy and includes several features to explore the motifs, which will be illustrated in this tutorial.
In this tutorial, we explain how to tune the parameters and interpret of results of matrix-clustering.
Goal: clustering a set of partly redundant motifs discovered by various algorithms (RSAT peak-motifs, MEME-ChIP, HOMER).
Data set: To illustrate the use of motif clustering to filter out redundancy, we will analyze a set of motifs discovered with: RSAT peak-motifs , MEME-ChIP , HOMER
These motifs were discovered in a set of ChiP-seq peaks bound by the transcription factor Oct4 in mouse ES cells. This experiment had been performed in the context of a wider study, where Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).
Connect to RSAT and click on the matrix-clustering button.
In the Analysis Title box set the name of you analysis (e.g. Oct motifs from several tools).
In the Input Matrices box paste the matrices discovered by RSAT peak-motifs
Set the matrix format to transfac.
In the section 'Thresholds to define the clusters', you can set the parameters to separate the clusters.
Set the parameters, in the lower threshold column to: w = 5, cor = 0.75, Ncor = 0.55
In the menu 'Metric to build the trees', you can select the motif comparison metric that will be used to compute the motif similarity and build the hierarchical tree.
Select Ncor.
In the menu 'Aglomeration rule', you can select the linkage rule to built the hirarchical tree.
Select average.
In the menu 'Merge matrices', you can control if the counts of the aligned matrices will be summed or averaged.
Select sum.
In the section 'Output file options', check the heatmap option.
In the section 'Labels displayed in the logo tree', check the name and ic (Information Content) options. They will be displayed in the motif trees.
In this example we will study a set of motifs dicovered in a ChIP-seq experiment done for the TF Oct4 (Pou5f1) which is an essential TF in cell fate decision, ES cells and early embryonic development, it binds the canonical sequence 5'-ATGCAAAT-3'.
In ES cells, Oct4 often interacts with another TF, Sox2, which binds to an adjacent Sox motif 5'-CATTGTA-3'. Together, both TFs coregulate specific genes.
During the analysis of Oct4 or Sox2 binding peaks, the so-called SOCT motif is usually found, which is a composite motif encompassing both Oct and Sox motifs. (Figure 1)
The result's website is separated by sections, you can expand/collapse each section by clicking on it.
The section Results Summary contains the parameters specified and the number of input motifs and collections.
The section Clusters Summary shows for each cluster, the cluster size, the collections where they come from and the logo corresponding to the root motifs, this table is sortable by clicking on the header. At the last column of this table you can download the root motif.
The section Logo Forest shows for each cluster, a hierarchical tree, with the aligned logos at each branch. This tree is dynamic, by clicking on each node, you can collapse/expand the tree at will to manually control the cluster visualization. The red buttons at the end of each tree allow to change the motif orientation and show or hide the IC.
Can you manually reduce the cluster_1 to 6 non-redundant motifs?
The section Individual Motif View is a dynamic table with the information of each motif.
The section Individual Cluster View shows each cluster separatley and the order in which the motifs were incorporated. You can click on each node to select its corresponding branch-motif.
The section Heatmap View shows a matrix whit all the motifs compared against themselves. The color scale indicates a high (red) and low (yellow) similarity between the motifs. Each colored line at both sides indicates the cluster.
Figure 1. 3D model showing the cooperative binding between Sox2 and Oct4 TFs whose closely interact to bind DNA. Together, they recognize a composite motif called the SOCT motif (SOx+OCT).
Oct4 ChIP-seq discovered motifs table WITH thresholds
Figure 2. Table with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks andalized with matrix-clustering. Ncor<=0.4; cor>=0.6:
Cluster 1 logo tree
Figure 3. Logo tree of the cluster 1 found in the Oct4 ChIP-seq motifs. The hierarchical tree displays the logo aligment in both orientations. For each branch is calculated a branchwise-motif.
Cluster 1 branch-motifs table
Figure 4. Branch-motif table for cluster 1. You can download the motif in TRANSFAC format or the logo in both orientations by clicking on them.
Logo Forest
Figure 5. Logo Forest with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks. Using a combination of values as threshold (cor = 0.6; Ncor = 0.4) these motif were separated in 6 different clusters and each one is displayed in a tree.
Branch-motif analysis
Figure 6. The logo tree of cluster one showing the branch-motifs
Figure 7. Three examples of consensus tree when we are using the same data (21 motifs discovered by peak-motifs in the Oct4 ChIP-seq) and the same threslhold values (cor >= 0.4; Ncor >= 0.6).
In this picture we only change the hierarchical clustering agglomeration rule. From top to down: average, complete, single linkage.Each cluster is represented with a different color. Observe how the number of clusters and the tree topology change depending on the selected method.Logo tree for cluster_1
Logo tree for cluster_3
Figure 8. Logo trees for cluster 1 and 3 which actually correspond to the SOCT and Oct4 motifs respectively. The threshold parameters used were: Ncor>=0.55 and cor>=0.625 .
Summary table
Figure 9. Summary table of an example of matrix-clustering results with randomized motifs