RSAT - footprint-scan manual



NAME

footprint-scan


DESCRIPTION

Scan promoters of orthologous genes with one or several position-specific scoring matrices (PSSM) in order to detect motifs showing a higher number of hits than expected by chance (over-represented motifs).


AUTHORS

Jacques.van. Helden <Jacques.van-Helden\@univ-amu.fr>
Alejandra Medina-Rivera <amedina@lcg.unam.mx>


CATEGORY

comparative genomics


USAGE

footprint-scan [-m matrix_inputfile] [-o outputfile] [-v #] [...]


INPUT FORMAT

Query gene(s)

The analysis can be performed either on a single gene, or several genes separately (option -sep_genes), or on a group of genes altogether.

Query genes can be entered on the command line (option -q) or in a text file (option-genes). Alternatively, teh option -all_genes will run the analysis on all the genes of a genome.

Position-specific scoring matrices (PSSMs)

footprint-scan requires a collection of (at least one) position-specific scoring matrices (PSSM).

All the format supported by matrix-scan can be used to enter the matrices. However, we recommend to use the TRANSFAC format, which supports multiple matrices (we usually want tos can promoters with a full collection of matrices), and associates an identifier with each matrix (e.g. the name of the transcription factor).

Example of TRANSFAC format

The following example shows a text file describing two matrices, representing the binding motifs annotated in RegulonDB for AgaR and AraC, respectively. Motifs must be separated by a line containing a double slash (//).

The complete file can be downloaded from RegulonDB (http://regulondb.ccg.unam.mx/).

 AC  ECK12_ECK120012515_AgaR.24
 XX
 ID  ECK12_ECK120012515_AgaR.24
 XX
 P0       A     T     C     G
 1        5     0     1     5
 2        6     1     4     0
 3        4     0     5     2
 4        5     4     0     2
 5        4     6     0     1
 6        1     5     3     2
 7        0     2     8     1
 8        4     1     1     5
 9        4     5     1     1
 10       3     8     0     0
 11       5     6     0     0
 12       1     8     1     1
 13       2     0     4     5
 14       4     5     2     0
 15       3     8     0     0
 16       3     8     0     0
 17       0     2     9     0
 18       0     2     2     7
 19       3     7     1     0
 20       4     7     0     0
 21       3     8     0     0
 22       3     4     0     4
 23       3     4     0     4
 24       3     4     3     1
 25       3     8     0     0
 XX
 //
 AC  ECK12_ECK120012316_AraC.18
 XX
 ID  ECK12_ECK120012316_AraC.18
 XX
 P0       A     T     C     G
 1        0    10     0     3
 2        7     4     1     1
 3        0     6     5     2
 4        2     2     3     6
 5        0     0     6     7
 6        9     0     0     4
 7        0     2     9     2
 8        2     7     3     1
 9        9     3     0     1
 10       7     4     0     2
 11       4     8     0     1
 12       3     3     5     2
 13       2    10     0     1
 14       2     7     1     3
 15       6     1     6     0
 16       0    11     2     0
 17       1     0     3     9
 18       1     5     5     2
 19       5     2     0     6
 XX
 //


OUTPUT FORMAT

The result comprises several files for the orthologs, upstream sequences, matrix-scan results, feature-maps. By default, a directory is created for each query gene, with a name indicating the parameters:

 footprints/[taxon]/[Organism]/[gene]

Alternatively, the output folder can be specified manually with the option -o.


EXAMPLES OF UTILIZATION

Detecting trans-acting factors for single gene, with a collection of known motifs

Let us assume that we have a collection of PSSMs annotated for a given organism (e.g. the matrices for all the Escherichia coli transcription factors annotated in RegulonDB). We would likt to scan the promoters of orthologs of a given gene, in order to predict the transcription factors that might be involved in its regulation. The program will count the hits for each matrix, and report those showing a significant enrichment in the promoters of its orthologs.

In this example, we use a slightly higher verbosity than usually (-v 2) in order to keep track of the progress of the analysis. This also reports the commands that are executed, and allows us to examine all their parameters.

 footprint-scan -v 2  -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
    -taxon Enterobacteriaceae -q sodA -q lexA -q araC \
    -bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriaceae/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriaceae-noov-1str.freq.gz \
    -m RegulonDB_matrices_transfac_format.txt \
    -matrix_format transfac \
    -matrix_suffix RegulonDB \
    -sep_genes
 footprint-scan -v 2  -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
    -taxon Enterobacteriaceae -q sodA  \
    -bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriaceae/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriaceae-noov-1str.freq.gz \
    -m RegulonDB_matrices.tab \
    -matrix_format tab \
    -matrix_suffix RegulonDB \
    -sep_genes

Detecting all putative target genes for a given transcription factor

Given a PSSM we would like detect new putative binding sites for a given Transcription Factor. The usual approach would be to retrieve all upstream region sequences of the organism of interest and then search for high scored sites with matrix-scan, althougth to have a high score in one sequence doesn’t mean is a real binding site.

As we know sequences with a functional relevance migth be conserved througth some branches of phylogeny. So we expect binding sites with a functional rele- vance to be conserved in a group of close othologous sequences. footprint-scan can search for putative bindign sites in the hole set of up- stream regions of an organism while evaluating if the detected binding sites are conserved (over-represented) in the respective orthologous sequences.

 footprint-scan -v 2  -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
    -taxon Enterobacteriaceae -all_genes \
    -bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriaceae/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriaceae-noov-1str.freq.gz \
    -m MetJ_Regulon_matrix.tab \
    -matrix_format tab \
    -matrix_suffix RegulonDB \
    -sep_genes


SEE ALSO

footprint-discovery

The difference betsween footprint-scan and footprint-discovery is that footprint-scan requires prior knowledge of the motifs (in the form of position-specific matrices), whereas footprint-discovery perfoms ab initio motif discovery.


WISH LIST

Options to be added

-rand

When the option -rand is activated, footprint-scan scans random selections of promoters rather than promoters of orthologs.

This option serves to perform negative controls in orde to estimate empirically the rate of false prediction and check its correspondence with the theoretical estimation of the significance.

The random selections are done by passing the option -rand to the program get-orthologs.

-crer

Return Cis-Regulatory elements Enriched-Regions (CRER).

            Calculate the statistical significance of the number of hits in
            windows of variable sizes. The number of hits is the sum of
            matches above a predefined threshold set on hits p-values, for
            all matrices and on both strands (if -2str). The maximum size
            for a CRER is defined by the option -crer_max.
            The prior probability to find an instance of the motif is the
            same for all matrices, and corresponds to the chosen pval
            threshold. Within a region of maximal CRER size, subwindows are
            defined between each hits, and the observed number of matches in
            a subwindow is the sum of hits above the threshold. The
            significance of the observed number of matches in a subwindow is
            estimated by calculating a P-value using the binomial
            distribution (Aerts et al., 2003).
-lth_crer_size

Minimal CRER size in bps

-crer_pval

Pval cutoff for selecting CRERs

-uth_crer_size

Maximal CRER size in bps

Revise the manual

The manual is still very incomplete, Jacques van Helden needs to revise and complete it.

Support as Web services

On the basis of the existing Web service for footprint-discovery.

Web interface

Alejandra Medina-Rivera will implement the Web interface. It would be more convenient to program the Web page after the Web services, in order to benefit ffrom the support of Web services (including the token). To be checked with Morgane Thomas-Chollier & Olivier Sand.

Tutorial

It would be worth preparing a tutorial (or a chapter in Methods in Molecular Biology) to explain in detail the interpretation of the result.

The tutorial could cover the 3 interfaces (command-line, Web services and Web form).

Motif co-occurrences

After having detected the motifs in the different sequences, analyze their co-occurrences in order to report the factors having sites in the same sequences (putatively interacting factors). Actually , this option should be implemented in matrix-scan rather than footprint-scan, because it applies to any type of analysis.

Neighbout gene name in output table

Add name of upstream neighbour to the synthetic tables, in order to detect pairs of gene sharing the same promoter.


OPTIONS

-m matrix_file

Matrix file. This argument is mandatory.

This argument can be used iteratively to scan the sequence with multiple matrices.

-matrix_format matrix_format

Matrix format. Default is tab. This argument is mandatory.

-matrix_suffix matrix_suffix

Matrix suffix. This argument is mandatory.

The matrix suffix indicates the nature of the matrix file. For example, if your matrix file contains a single matrix for a transcription factor (say LexA), you can indicate it with

-matrix_suffix LexA

whereas if your matrix files contains all the matrices from the RegulonDB database, you can specify

-matrix_suffix RegulonDB

The matrix suffix will be concatenated to the output prefix, in order to maintain separate output files for distinct analyses performed on the same promoter sequences. For example, if you run successively the analysis with the matrix LexA, and then with the matrix CRP, you don't want to loose the results of the first scanning when running the second scanning.

-tf transcription_factor

Most matrices are derived from specific TFBS, so they represent the preferential sequence where a TF binds. This option will search for all the genomes in the given taxon where there is an ortholog for the specified tf. Orthologs for the query genes will only be retrived if the organism has an ortholog for the TF.

-tf gene_name

If the option -matrix_table is used instead of the name of an specific TF specify the names are in the file using:

-tf file

-pseudo #

Pseudo-count for the matrix (default: 1). See matrix-scan for details.

-bgfile background_file

Background model file.

-bg_format background_format

Format of background model file. For supported formats see: convert-background-model -h

-bginput

Calculate background model from the input sequence set.

-markov

Order of the markov chain for the background model.

This option is incompatible with the option -bgfile.

-window

Size of the sliding window for the background model calculation. When this option is specified, the matrix pseudo-count is equally distributed.

The background model is calculated locally at each step of the scan, by computing transition frequencies from a sliding window centred around the considered segment. The model is thus updated at each scanned position. This model is called "adaptive". Note that the sliding window must be large enough to train the local Markov model. The required sequence length increases exponentially with the Markov order. This option is thus usually suitable for low order models only (-markov 0 to 1).

This option is incompatible with the option -bgfile.

-bg_pseudo #

Pseudo frequency for the background model. Value must be a real between 0 and 1.

If this option is not specified, the pseudo-frequency value depends on the background calculation.

For -bginput and -window, the pseudo frequency is automatically calculated from the length (L) of the sequence following this formula:

sqrt(L)/(L+sqrt(L))

For -bgfile, default value is 0.05.

In other cases, if the length (L) of the training sequence is known (e.g. all promoters for the considered organism), the value can be set manually by using the option -bg_pseudo. In such case, the background pseudo-frequency might be set, as suggested by Thijs et al., to the following value:

sqrt(L)/(L+sqrt(L))

-filter

Filter TF-interactions that are not present on the query organism. The option -filter_pval can be used to set the threshold for the detected sites.

-filter_bgfile

Background model file for the scanning of query sequences for filtering,.

-filter_pval

Set the threshold to filter out TF-interactions that are not present on the query organism.

-pval

Set the threshold on site p-value to report only the evaluated over-representations of binding sites whenever the individual sites crossed it. The default is set to 1e-4.

-occ_th

Threshold set on the occurrence significance (over-representation) for scores that have p-value equal or smaller thant the one given as threshold in the option -pval.

All genes are completely analyzed, only the genes that pass both threshold on pvalue and occ_sig will be included on the synthesis table and html of the matrix.

Default is set to 5 .

-occ_sig_opt

Additional options passed to matrix-scan for the test of over-representation of matrix hits.

Supported threshold fields for the matches : score pval eval sig normw proba_M proba_B rank crer_sites crer_size

Supported threshold fields for score distributions: occ occ_sum inv_cum exp_occ occ_pval occ_eval occ_sig occ_sig_rank

Examples: To return only the "best" score for each gene -occ_sig_opt '-uth rank 1'

To analyze the distribution only above a weight threshold of 7. -occ_sig_opt '-lth score 7'

To analyze the distribution for sites having a P-value threshold of 1e-3. -occ_sig_opt '-uth pval 1e-3'

Note: the argument passed to matrix-scan is delimited by single quotes, and can thus not contain any quote.

-info_lines

Draw reference lines on the significance profile plots, to highlight some particular values.

- horizontal axis (Y=0), in violet

- vertical axis (X=0), in violet

- the weight value associated with maximal significance (only weights >=0 are considered), in red

-occ_sig_graph_opt

Additional options passed to XYgraph for drawing the occurrence significance graph.

Note: the argument passed to XYgraph is delimited by single quotes, and can thus not contain any quote.

-plot_format

Format for the occurrence plots (occurrence frequencies, occurrence sinificance). Supported: all formats supported by the program XYgraph

-scan_opt

Additional options passed to matrix-scan for site detection and feature-map drawing.

Examples:

Scan sequences with an upper threshold of 0.001 on pval. -scan_opt '-uth pval 0.001'

Note: the argument passed to matrix-scan is delimited by single quotes, and can thus not contain any quote.

Default: By default sites are filtered with a threshodl on p-value on 1e-4

-map_opt

Additional options passed to feature-map for feature-map drawing.

Examples:

Change the thickness of the maps -map_opt '-mapthick 12'

Write the weight score above each site (also activate the auto adjustment of map thickness to ensure there is enough space for drawing the labels). -map_opt '-label score -mapthick auto'

Note: the argument passed to feature-map is delimited by single quotes, and can thus not contain any quote.

Default= " -mlen 300 -mspacing 2"

-rand

When the option -rand is activated, the program replaces each ortholog by a gene selected at random in the genome where this ortholg was found.

This option is used (for example by footprint-scan and footprint-discovery to perform negative controls, i.e. check the rate of false positives in randomly selected promoters of the reference taxon.

-matrix_table matrix_table_file.tab

A table providing the paths to matrix files (one file per row) + optional columns to specify parameters (factor name, format) for each martrix.

The matrix list is provided as a tab-delimited text file, where each row specifies one matrix.

- The first column indicates the path to a file containing a single matrix (in the format specified with -matrix_format).

- The second column (optional) indicates a common name for the matrix (e.g. transcription factor name) which will be displayed in the synthetic report tables. If the option '-tf file' is used, this column must indicate the name of the transcription factor on which the taxonomic filter will be applied (i.e. the analysis will only be led in species of the taxon where an ortholog has been found for the factor).

- The third column (optional) indicates the format of each matrix, in case the search would be done with matrices obtained from different sources (e.g. TRANSFAC, consensus, meme). Note that if the file contains a third column, the option -matrix_format cannot be used.

-batch_matrix

Generate one footprint-scan command per matrix and post it on the queue of a PC cluster.

-skip_m #

Skip the first # matrices in the matrix_table (useful for quick testing and for resuming interrupted tasks when using a matrix_table or when several matrices are entered with the option -m ).

-last_m #

Stop after having treated the first # matrices in the matrix table (useful for quick testing when using a matrix_table or when several matrices are entered with the option -m ).