NAME

infer-operons


DESCRIPTION

Given a list of input genes, infer the operon to which each of these genes belong.

The inferrence is based on a very simplistic distance-based method, inspired from the Salgao-Moreno method (Proc Natl Acad Sci U S A. 2000;97:6652-7). The Salgado-Moreno method classifies intergenic distances as TUB (transcription unit border) or OP (inside operon), and infers operons by iteratively collecting genes until a TUB is found. In the original method, the TUB or OP assignation relies on a log-likelihood score calculated from a training set.

The difference is that we do not use the log-likelihood (which presents risks of over-fitting), but a simple threshold on distance. Thus, we infer that the region upstream of a gene is TUB if its size is larger than a given distance threshold, and OP otherwise. Our validations (Rekins' Janky and Jacques van Helden, unpublished results) show that a simple threshold on distance raises a similar accuracy as the log-likelihood score (Acc ~ 78% for a threshold t=55).


ALGORITHM

The algorithm is based on three simple rules, depending on the relative orientation of the adjacent genes.

Rule 1: divergently transcribed genes

If the gene found upstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon leaders. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.

Rule 2: convergently transcribed genes

If the gene found downstream of a query gene is transcribed in the opposite direction, then the intergenic region is considered as a TUB, and the two flanking genes are labelled as operon trailers. This prediction is reliable (as far as genome annotation is correct), since operons only contain genes on the same strand.

Rule 3: tandem genes (adjacent genes on the same strand)

If two adjacent genes are on the same strand, then a distance threshold (option -dist) is applied to decide whether they belong to the same operon (dist <= thredhold) or not (dist > threshold). If they are predicted to be in distinct operon, the upstream gene is labelled as operon trailer, and the downstream gene as leader of the next operon.


AUTHORS

Jacques.van-Helden\@univ-amu.fr


CATEGORY

genomics


USAGE


infer-operons [-i inputfile] [-o outputfile] [-v] [options]
Example 1

With the following command, we infer the operon for a set of input genes.

infer-operons -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -q hisD -q mhpR -q mhpA -q mhpD

Example 2

We now specify different return fields

infer-operons -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -q hisD -q lacI \ -return leader,trailer,up_info,down_info,operon

Example 3

Infer operons for all the genes of an organism.

infer-operons -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -all -return up_info,leader,operon

Example 4

Infer operon from a set of query genes, and retrieve the upstream sequence of the inferred leader gene. Note that two of the input genes (lacZ, lacY) belong to the same operon. to avoid including twice their leader, we use the unix command sort -u (unique).

infer-operons -org Escherichia_coli_GCF_000005845.2_ASM584v2 -return leader,operon \ -q lacI -q lacZ -q lacY | sort -u \ | retrieve-seq -org Escherichia_coli_GCF_000005845.2_ASM584v2 -noorf

Example 5

Note that operons can contain non-coding genes. For example, the metT operon contains a series of tRNA genes for methionine, leucine and glutamina, respectively.

infer-operons -org Escherichia_coli_GCF_000005845.2_ASM584v2 -q glnV -q metU -q ileV \ -return q_info,up_info,operon


INPUT FORMAT

Each row of the input file specifies one query gene. The first word of a gene is the query, the rest of the row is ignored.


OUTPUT FORMAT


OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-i inputfile

If no input file is specified, the standard input is used. This allows to use the command within a pipe.

-org organism

Organism name.

-all

Infer operons for all the genes of the query organism.

-q query_gene

Query gene. This option can be used iteratively on the same command line to specify several query genes. Example:

infer-operons -org Escherischia_coli_K12 -q LACZ -q hisA

-o outputfile

If no output file is specified, the standard output is used. This allows to use the command within a pipe.

-dist #

Distance threshold.

-sep

Specify the separator for multi-value fields (e.g.: genes) in the output table. By default, multi-value fields are exported in a single column with a semicolon (";") as separator.

-min_gene_nb #

Specify a threshold on the number of genes in the operon. This option is generally used when predicting all operons (option -all), in order to only return predicted polycistronic transcription units (-min_gene_nb 2) or restrict the output to operons containingpredicted to contain at least a given number of genes (e.g. -min_gene_nb 4).

-return return_fields

List of fields to return.

Supported fields: leader,trailer,operon,query,q_info,up_info,down_info

leader

Predicted operon leader.

trailer

Predicted operon trailer.

operon

Full composition of the operon. The names of member genes are separated by a semi-column ";" (note that the gene separator can be changed using the option -sep).

q_info

Detailed info on the query gene(s).

up_info

Detailed info on the upstream gene.

down_info

Detailed info on the downstream gene.

gene_nb

Number of genes in the predicted operon.


SEE ALSO

retrieve-seq
neighbour-genes
add-gene-info