NAME
DESCRIPTION
ALGORITHM
AUTHORS
CATEGORY
USAGE
INPUT FORMAT
OUTPUT FORMAT
OPTIONS
SEE ALSO

NAME

infer-operons

DESCRIPTION

Given a list of input genes, infer the operon to which each of these genes belong.

The inferrence is based on a very simplistic distance-based method, inspired from the Salgao-Moreno method (Proc Natl Acad Sci U S A. 2000;97:6652-7). The Salgado-Moreno method classifies intergenic distances as TUB (transcription unit border) or OP (inside operon), and infers operons by iteratively collecting genes until a TUB is found. In the original method, the TUB or OP assignation relies on a log-likelihood score calculated from a training set.

The difference is that we do not use the log-likelihood (which presents risks of over-fitting), but a simple threshold on distance. Thus, we infer that the region upstream of a gene is TUB if its size is larger than a given distance threshold, and OP otherwise. Our validations (Rekins' Janky and Jacques van Helden, unpublished results) show that a simple threshold on distance raises a similar accuracy as the log-likelihood score (Acc ~ 78% for a threshold t=55).

ALGORITHM

The algorithm is based on three simple rules, depending on the relative orientation of the adjacent genes.

Rule 1: divergently transcribed genes: If the gene found upstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon leaders. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.
Rule 2: convergently transcribed genes: If the gene found downstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon trailers. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.
Rule 3: tandem genes (adjacent genes on the same strand): If two adjacent genes are on the same strand, then a distance threshold (option -dist) is applied to decide whether they belong to the same operon (dist <= thredhold) or not (dist > threshold). If they are predicted to be in distinct operon, the upstream gene is labelled as operon trailer, and the downstream gene as leader of the next operon.

AUTHORS

Jacques.van-Helden\@univ-amu.fr

USAGE

infer-operons [-i inputfile] [-o outputfile] [-v] [options]

Example 1: With the following command, we infer the operon for a set of input genes.; infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q hisD -q mhpR -q mhpA -q mhpD
Example 2: We now specify different return fields; infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q hisD -q lacI \ -return leader,trailer,up_info,down_info,operon
Example 3: Infer operons for all the genes of an organism.; infer-operons -v 1 -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -all -return up_info,leader,operon
Example 4: Infer operon from a set of query genes, and retrieve the upstream sequence of the inferred leader gene. Note that two of the input genes (lacZ, lacY) belong to the same operon. to avoid including twice their leader, we use the unix command sort -u (unique).; infer-operons -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -return leader,operon \ -q lacI -q lacZ -q lacY | sort -u \ | retrieve-seq -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -noorf
Example 5: Note that operons can contain non-coding genes. For example, the metT operon contains a series of tRNA genes for methionine, leucine and glutamina, respectively.; infer-operons -org Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2 -q glnV -q metU -q ileV \ -return q_info,up_info,operon

INPUT FORMAT

Each row of the input file specifies one query gene. The first word of a gene is the query, the rest of the row is ignored.

OUTPUT FORMAT

OPTIONS

-v #: Level of verbosity (detail in the warning messages during execution)
-h: Display full help message
-help: Same as -h
-i inputfile: If no input file is specified, the standard input is used. This allows to use the command within a pipe.
-org organism: Organism name.
-all: Infer operons for all the genes of the query organism.
-q query_gene: Query gene. This option can be used iteratively on the same command line to specify several query genes. Example:; infer-operons -org Escherischia_coli_K12 -q LACZ -q hisA
-o outputfile: If no output file is specified, the standard output is used. This allows to use the command within a pipe.
-dist #: Distance threshold.
-sep: Specify the separator for multi-value fields (e.g.: genes) in the output table. By default, multi-value fields are exported in a single column with a semicolon (";") as separator.
-min_gene_nb #: Specify a threshold on the number of genes in the operon. This option is generally used when predicting all operons (option -all), in order to only return predicted polycistronic transcription units (-min_gene_nb 2) or restrict the output to operons containingpredicted to contain at least a given number of genes (e.g. -min_gene_nb 4).
-return return_fields: List of fields to return.; Supported fields: leader,trailer,operon,query,q_info,up_info,down_info