NAME

matrix-clustering

DESCRIPTION

Taking as input one or several sets of position-specific scoring matrices (PSSM), this program applies hierarchical clustering to identify clusters of similar motifs. It produces a set of trees (one per cluster) and builds branch motifs at each node of each tree by merging the matrices of all descendent nodes.

DEPENDENCIES

Some R packages are required in matrix-clustering in order to convert the hierarchical tree into different output formats, to manipulate the dendrogram which is exported, and to produce heatmaps.

    RJSONIO : http://cran.r-project.org/web/packages/RJSONIO/index.html
    ctc : http://www.bioconductor.org/packages/release/bioc/html/ctc.html
    dendextend : http://cran.r-project.org/web/packages/dendextend/index.html
    gplots : http://cran.r-project.org/web/packages/gplots/index.html
    

For visualize the logo forest it is required the JavaScript D3 (Data Driven Documents) library, the user can select an option to connect directly with the server to load the functions of this library (see option -d3_base).

    D3 : http://d3js.org/
    

As many files are produced with matrix-clustering we created a dynamic website showing the complete list of results. We use the Javascript library JQuery to create this dynamic website.

    JQuery: https://jquery.com/

AUTHORS

Implementation

Jacques.van-Helden@univ-amu.fr
Jaime Castro-Mondragon <j.a.c.mondragon@ncmm.uio.no>

Conception

Jacques van Helden
Jaime Castro-Mondragon

The following collaborator contributed to the definition of requirements for this program.

Carl Herrmann
Denis Thieffry
Morgane Thomas-Chollier

CATEGORY

util

USAGE

matrix-clustering [-matrix inputfile] [-o outputfile] [-v ] [...]

OUTPUT FORMAT

SEE ALSO

compare-matrices

The program compare-matrices is used by cluster-matrices to measure pairwise similarities and define the best alignment (offset, strand) between each pair of matrices.

WISH LIST

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-matrix matrix_title input_matrix_file

The input file contains a set of position-specific scoring matrices.

Example: -matrix OCt_motifs Oct_motifs_peakmotifs.tf tf

The matrix_title will be concatenated to each motif ID in order to create unique motif IDs. The collection label is displayed in the results.

This label is useful when two motifs for the same TF come from different files and the user wants to know to which collection does the motif come from.

Supported matrix formats

Since the program takes several matrices as input, it only accepts matrices in formats supporting several matrices per file (transfac, tf, tab, cluster-buster, cb, infogibbs, meme, stamp, uniprobe).

For a description of these formats, see the help of convert-matrix.

-matrix_file_table matrix_file_table

This option is recommended when the input number of files is large (> 20), which would have a large command line and some programs are not capable of read such large arguments.

The input file contains a tab-delimited table with two columns:

1) The motif file - The path to the file with the motif 2) The collection label

Example:

Sox_pssms.tf Sox Oct_pssms.tf Oct Nanog_pssms.tf Nanog

-matrix_format matrix_format

Specify the input matrix format.

Supported matrix formats

Since the program takes several matrices as input, it only accepts matrices in formats supporting several matrices per file (transfac, tf, tab, clusterbuster, cb, infogibbs, meme, stamp, uniprobe).

For a description of these formats, see the help of convert-matrix.

This option allows to add a link to a any website specified by the user and can be used to visualize complete databases (e.g. Jaspar), thus each motif in the logo tree will point to its respective link in the Jaspar website.

Format: a tab-separated file Column 1: Motif ID (the same as the input motif file) Column 2: Link to any website Column 3: Color in Hexadecimal code

This option may be combined with the -radial_tree_only. By default the motifs names will be displayed in black, unless the third color is specified.

-title title

Title displayed on top of the report page.

-radial_tree_only

Generates a radial motif tree (option recommended for visualization purposes). When this option is indicated, all the input motifs are forced to be aligned in a single alignment displayed in a radial tree (this tree is not interactive). This option skips the generation of branch-motifs and the generation of the dynamic output (e.g., heatmaps).

-o output_prefix

Prefix for the output files.

Mandatory option: since the program cluster-matrices returns a list of output files (pairwise matrix comparisons, matrix clusters).

-heatmap_position_tree [row,col,both,none]

The position in the heatmap where the hierarchical tree will be displayed.

-task tasks

Specify one or several tasks to be run. If this option is not specified, all the tasks are run.

Note that some tasks depend on others. This option should thus be used with caution, by advanced users only.

Supported tasks: (all, comparison, clustering, report)

all

Execute all the parts of the program (default)

comparison

Run the motif comparison step. The input set of motifs are compared against themselves. The output is the pairwise comparison between the input motifs and a description table showing the main features of each motif (name, id, consensus, width).

clustering

Skip the matrix comparison step and only executes the clustering step.

Assumes the users already have the description table and comparison table exported from the program compare-matrices.

This option is ideal to saving time once all comparison beteen the input motifs had been done.

-label_in_tree

Option to select the labels displayed in the logo tree.

Supported labels

 (name, consensus, id)
 
-label_motif

Option to select the labels displayed in the cluster summary.

Supported labels

 (name, consensus, collection)

Default: name::collection

-quick

With this option the motif comparison is done with the program compare-matrices-quick (implemented in C) rather than the program compare-matrices (implemented in Perl). The quick version runs x100 times faster, but has not all implemented options as in the Perl version.

We suggest use this option for a big set of input motifs > 300 motifs.

NOTE: By the moment the only a few thresholds can be used with this option. (cor, Ncor, w)

-no_clone_input

Desactive tha option to copy the input file. If more than one collections of motifs are provided, they are exported in a single file.

NOTE: take into account the input file size

-rand

When this option is selected, the columns of the input motifs are randomly permuted (conserving thus the Information Content), the new motifs are used as input for the pairwise-comparison and clustering.

-heatmap_color_palette Color_Palette

Select the color palette used in the heatmaps (sequential scales) The color palettes (and their names) are taken from ColorBrewer2 website (http://colorbrewer2.org/)

Supported: YlOrRd,YlOrBr,YlGnBu,YlGn,PuRd,PuBuGn,PuBu,OrRd,GnBu,BuPu,BuGn,Reds,Purples,Oranges,Greys,Greens,Blues

Default: YlOrRd

-heatmap_color_classes X

This option specifies in how many color classes the color palette will be divided.

For sequential color palettes: max 9 For diverging color palettes: max 11

If the user specified a color greater than the maximum allowed, the program takes this maximum value.

For more information see ColorBrewer2 website (http://colorbrewer2.org/)

-max_matrices X

This option specifies how many matrices can be clustered in the same analysis. If there are more matrices than the specified number, the program restrics the analyses to the first X matrices, and issues a warning.

This parameter can be useful to prevent submission of excessive datasets to the Web server, or for running quick tests before starting the analysis of a large matrix collection.

-hclust_method

Option to select the agglomeration rule for hierarchical clustering.

Supported agglomeration rules:

complete

Compute inter-cluster distances based on the two most distant nodes.

average (default)

Compute inter-cluster distances as the average distance between nodes belonging to the relative clusters. (UPGMA)

single

Compute inter-cluster distances based on the closest nodes.

-top_matrices X

Only analyze the first X motifs of the input file. This options is convenient for quick testing before starting the full analysis.

If several motif files are specified, the selection of top motifs is performed independently for each motif collection (the max number of motifs will this be X * the number of input files).

-skip_matrices X

Skip the first X motifs of the input file. This options is convenient for testing the program on a subset of the motifs before starting the full analysis.

If several motif files are specified, the option is applied to each file independently.

-metric_build_tree metric

Select the metric which will be used to cluster the motifs.based in one metric of to measure motif similarity. This metric can be a similarity or distance, in both cases the values are converted to a distance table which is used as input for the hierarchical clustering.

Supported metrics : cor, Ncor, dEucl, NdEucl, logocor, logoDP, Nlogocor, Icor, NIcor, SSD, rank_mean, mean_zscore

Default: Ncor

-lth param lower_threshold
-uth param upper_threshold

Threshold on some parameters (-lth: lower, -uth: upper threshold).

Once the hierarchical tree is built, this tree is traversed in a bottom-up way. On each branch the descendant motifs are evaluated in the same way the clustering method selected by the user (average, complete, single).

In this algorithm, the threshold can be set combining values of different metrics.

If the descendant motifs for a particular branch do not satisfy the threshold a new cluster is created.

For a complete description of the thresholds and the motif comparison metrics see the help of compare-matrices

Suggested thresholds:

    cor >= 0.7

    Ncor >= 0.4

    w >= 5
-calc merging_stat

Specify the operator used to merge matrices (argument passed to merge-matrices).

Supported:

mean (default)

Each cell of the output matrix contains the mean of the values found in the corresponding cell of the input matrices.

sum

Each cell of the output matrix contains the sum of the values found in the corresponding cell of the input matrices.

Note: the option diff, supported by merge-matrices, is not accepted for matrix-clustering.

-trim_threshold #

Trimming threshold.

Left- and right-most columns whose information content are smaller than this threshold will be trimmed, to avoid exporting large matrices with non-informative flanks.

Beware: in some cases the trimming can deteriorate the motifs, by cutting moderately informative positions.

-return return_fields

List of fields to return.

Supported fields:

 heatmap,json,newick,root_matrices
clone_input: Copy input file.

When this field is selected, the input motif database is copied and exported in the results folder.

NOTE: take into account the input file size.

heatmap: Heatmap with similarities.

When this field is selected, exports a heatmap showing the similarities, the clusters and the hierarchical tree of the input motifs.

The heatmap is exported in JPEG and PDF format.

We recommend to use this option when the number of motifs is lower than 300.

json: Hierarchical tree in JSON format.

File format used for D3 library to visualize the logo forest in HTML.

The hierarchical tree in JSON format is always exported, since it is required to display the logo tree with the d3 library.

newick: Hierarchical tree in newick format.

When this field is specified, the hierarchical tree is converted and exported in Newick format, a widely used text format to represent phylogenetic trees.

root_matrices: Return only the root motif of each cluster.

When this field is specified, matrix-clustering runs the minimal analysis and return a text file with the root motifs of each cluster.

This option is useful when the user wants to explore the data and to avoid the cimputation of the visual elements.