RSA-tools - Tutorials - random-genes
Contents
Prerequisite
This tutorial does not require any prior knowledge, but in order to take the maximal benefit of it, you should first follow the tutorials on sequence retrieval and oligo-analysis.
Introduction
In the context of transcriptome analysis, we are confronted with huge amounts of data, which can be processed in different ways to obtain clusters of co-expressed genes. A routine approach perform clustering on gene expression profiles, and to use a motif discovery program to each cluster of genes, in order to predict their specific regulatory elements.
It is well knwon that microarray data is noisy, and, in addition, there are dozens of ways to post-process this data (choice of a normalization method, fitering of , clustering algorithm, distance metrics, other clustering options, ...). It is thus expected that many of our supposedly "co-expressed" genes will be artifactual, resulting from all the possible biases of the data treatments, and will actually not correspond to a set of co-regulated genes (i.e. direct targeets of a same transcription factor).
The selection of a random set of genes is a very simple but very efficient way to check the rate of false positives of pattern discovery programs.
Example of utilization
We will use the tool random-genes to select a randdom selection of 20 genes in the yeast genome, and submit their promoters to oligo-analysis.
The result is a list of gene identifiers.
- In the left frame, click on the link random gene selection.
- Select as organism Saccharomyces cerevisiae
- Set the number of genes to 20.
- Leave the number of groups to 1 (this option is used for generating multiple random gene clusters).
- Select CDS as feature type.
- Click GO
You can now send this list to the program retrieve-seq, to retrieve their upstream sequences, and then to different motif discovery programs (oligo-analysis, consensus, gibbs, ...) in order to test the response of these programs when the genes are not supposed to be co-regulated.
Interpreting the results
In principle, we expect for the program to return, most of the time, a negative answer, i.e. not a single significant motif. In some cases, a motif witth slight significance should emerge by chance, and highly significant motifs should appear very exceptionally.
More precisely, if we repeat the experiment a large number of times, we expect to find more or less
- one motif with E-value <= 1 (i.e. sig >= 1) per gene selection;
- a motif with E-value <= 0.1 (sig >= 1) every 10 gene selections;
- a motif with E-value <= 0.01 (sig >= 2) every 100 gene selections;
- ... and so on.
The analysis of random gene selections is thus a very practical way to estimate the rate of false positives of motif discovery programs, on the basis of real biological sequences.
Additional exercises
- In order to check the rate of false positive of a program, you need to repeat the random control several times, and see how often a motif is selected by chance when there is supposedly no co-regulation.
- Repeat the test a few times.
- At each trial, take note of the top scoring motif, and of the score associated to it.
- Analyze the results of the successive trials, and check if the behaviour more or less corresponds to the theoretical expectation.
- Perform the same test as above, with human genes. How does it compare with the yeast, in terms of false positives ?
- You can also use the upstream sequences of random gene selections to test other motif discovery programs available on the web (see the link page).
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions please post an issue on GitHub or contact the