RSAT - info-gibbs manual

Name

info-gibbs

2009 by Matthieu Defrance

Description

info-gibbs is a motif discovery software based on a Gibbs sampling strategy. Given a set of sequences, a motif length and a background model it searches for motifs (PSSMs) that have the best relative entropy (information content).

Reference Defrance & van Helden, Bioinformatics 2009

Defrance M. and van Helden J. info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling, Bioinformatics. 2009;25:2715-2722.

Options

Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.

Format:
Input sequence format. Various standards are supported.

Sequence type:
Only A, C, G, and T residues are accepted. oligomers that contain undefined (N) or partly defined (IUPAC code) nucleotides are discarded.

Purge sequences (highly recommended)
When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programs mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).

Search both strands: (single or both strands)
By selecting "search both strands", the occurrences of the motif are searched on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).

Matrix length:
The length of the motif. For the detection of regulatory sites, we recommend starting with a length comprise between 6 and 16.

Expected number of matches per sequences:
This option allow to specify the number of the motif occurrences that are expected to be found in each of the input sequences.

Number of motif to extract:
This option allows to search for more than one motif in the input sequences.

Maximum number of iterations:
This option allows to set the maximal number of iteration of the algorithm.

Number of runs:
Due to its stochastic behavior info-gibbs can return different results each time it is run. The core algorithm should be repeated a sufficient number of times in order to produce useful results.

Background:
The background model used to compute expected frequencies can be computed from input sequences or loaded form a predefined Markov model. In this later case, a target organism and a Markov order should be selected.