Skip Navigation


JXB Advance Access originally published online on December 13, 2004
Journal of Experimental Botany 2005 56(410):245-254; doi:10.1093/jxb/eri043
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
56/410/245    most recent
eri043v1
Right arrow E-letters: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when E-letters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (25)
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Goodacre, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Goodacre, R.
Agricola
Right arrow Articles by Goodacre, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Journal of Experimental Botany, Vol. 56, No. 410, © Society for Experimental Biology 2004; all rights reserved

RESEARCH PAPER

Making sense of the metabolome using evolutionary computation: seeing the wood with the trees

Royston Goodacre*

School of Chemistry, The University of Manchester, PO Box 88, Sackville Street, Manchester M60 1QD, UK

* E-mail: Roy.Goodacre{at}manchester.ac.uk

Received 14 May 2004; Accepted 1 January 2004


    Abstract
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
One should perhaps start off by asking the question, ‘But what wood is it we want to see?’ There are so many trees that make up the wood; within a post-genomics context, genes, transcripts, proteins, and metabolites are the more tangible ones. Rather than studying these components in isolation, a more holistic approach is to unravel the interactions between the myriad of subcellular components and this is vital to systems biology. Moreover, this will help define the phenotype of the organism under investigation. Metabolomics is complementary to transcriptomics and proteomics, and despite the immense metabolite diversity observed in plants, metabolomics has been embraced by the plant community and in particular for studying metabolic networks. Whilst post-genomic science is producing vast data torrents, it is well known that data do not equal knowledge and so the extraction of the most meaningful parts of these data is key to the generation of useful new knowledge. A metabolomics experiment is guaranteed to generate thousands of data points (e.g. samples multiplied by the levels of particular metabolites) of which only a handful might be needed to describe the problem adequately. Evolutionary computational-based methods such as genetic algorithms and genetic programming are ideal strategies for mining such high-dimensional data to generate useful relationships, rules, and predictions. This article describes these techniques and highlights their usefulness within metabolomics.

Key words: Evolutionary computation, metabolomics, subcellular components, systems biology


    Introduction
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
Whole genome sequence projects continue to remind us that our knowledge is really rather scant when it comes to putting some function to orphan genes. The recent completion of the human genome sequence (The International Human Genome Mapping Consortium, 2001Go; Venter et al., 2001Go) and the plant genome sequences of Arabidopsis thaliana (Arabidopsis Genome Initiative, 2000Go) and Oryza sativa (rice) (Goff et al., 2002Go; Yu et al., 2002Go), amongst others (http://wit.integratedgenomics.com/GOLD/) have accelerated demand for determining the biochemical function of orphan genes and for validating them as molecular targets for therapeutic intervention. The search for biomarkers (real or surrogate) that can serve as indicators of disease progression or response to therapeutic intervention has also increased (Harrigan and Goodacre, 2003Go). Functional analyses (Fig. 1) have thus emphasized analyses at the level of gene expression (transcriptomics), protein translation (proteomics; including post-translational modifications), and the metabolite network (metabolomics), with a view within a systems biology approach of defining the phenotype and bridging the genotype-to-phenotype gap (Fiehn, 2002Go).



View larger version (93K):
[in this window]
[in a new window]
 
Fig. 1. The phenotype-to-genotype gap, here depicted as the Clifton Suspension Bridge, Bristol, UK finally bridged by Isambard Kingdom Brunel on 8 December 1864. The need for discovering the biochemical function of orphan genes is essential for validating them as molecular targets for therapeutic intervention, or using their products as biomarkers that can serve as indicators of disease progression or response. In post-genome science this will be realized by an integrative analysis of the genome, transcriptome, proteome, and metabolome.

 
The ‘metabolome’ is the quantitative complement of all the low-molecular-weight molecules present in cells in a particular physiological or developmental state (Oliver et al., 1998Go), and whilst it is complementary to transcriptomics and proteomics it does have special advantages, particularly with respect to the paradigm shift from thinking of linear metabolic pathways to more global integrated metabolic networks and neighbourhoods (Barabási and Oltvai, 2004Go; Kell, 2004Go), since it is obviously better to measure these networks directly. This is no easy task and despite the immense metabolite diversity observed in plants metabolomics, up to 200 000 different metabolites in the plant kingdom (Fiehn, 2002Go), it is highly encouraging to see that metabolomics has been readily embraced by plant biologists.

Generally there are two sorts of data that are generated by a metabolomics experiment. The first is a list of metabolites with their levels, usually obtained via some hyphenated technique involving chromatography, and the second type of data are fingerprint-like traces from an FT-IR or NMR analysis. Irrespective of the type of data generated, these metabolomic strategies generate large amounts of data, and it is obvious (Fiehn, 2001Go; Mendes, 2002Go) that current informatic approaches need to adapt and grow in order to make the most of these data. In particular, good robust databases (Hardy and Fuell, 2003Go; Mendes, 2002Go), very good data, excellent visualization methods (Li et al., 2003Go), and even better algorithms (Goodacre and Kell, 2003Go), are needed with which to turn these data into knowledge.

Consider a hypothetical experiment where data on a modest 250 metabolites have been collected using GC-MS from two plant populations; control plants, and test plants exposed to drought. So that any natural variation in plant growth stages is accounted for 100 plants of each are measured. Once the data have been collected it is necessary to be able to discriminate reproducibly between the two different populations (or classes), and this has classically been achieved by using supervised learning algorithms such as discriminant analysis, partial least squares or artificial neural networks (Fig. 2). Each of those metabolites could be used in the construction of a discriminatory model to differentiate between the plants, however, in addition to important metabolites many will be non-relevant and it is known from the statistical literature that better (i.e. more robust) predictions can often be obtained when only the most relevant input variables are considered (Seasholtz and Kowalski, 1993Go), i.e. that ‘parsimonious’ models tend to generalize better.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2. Supervised machine learning. When the desired responses (target output(s)) associated with each of the inputs (metabolome data) are known then the system may be supervised. The goal is to find a mathematical transformation (model) that will correctly associate all or some of the inputs with the targets. In the explanatory sense this model will also allow which inputs are correlated in some way to the targets to be assessed.

 
An exhaustive search of whether a metabolite is used or not in the model would be computationally prohibitive since the possible permutations (to use or not use 250 metabolites, i.e. two choices 250 times) is 2250 or 1.8x1075. This number is so big that if a computer could check 10 million orderings every second it would still take >3x1062 years to check them all! And the lifetime of the Universe is after all only ~1017 seconds (Barrow and Silk, 1995Go). Thus, even though the way to solve this problem is known, it cannot be done. These problems are NP-complete (Garey and Johnson, 1979Go); that is to say, as the number of variables in the search space goes up linearly, the number of possible solutions, and hence the time to solve the problem, is a polynomial of the number of variables. Therefore, to solve these problems requires time which is exponential in the problem size and so to find the global optimum requires exhaustive searching and this is computationally impossible. Thus route A in Fig. 3 is unfeasible as no algorithm can do this and an alternative strategy needs to be found. The premise is that a ‘good’ solution is acceptable and so an alternative method is needed to search the huge spaces of possible solutions. Importantly, however, if the search space is large but the solution space is small, i.e. the problem can be solved with just a small number of variables, the effective search space becomes much narrower. Thus the number of permutations of five variables from 250 metabolites is 250!/[(250!–5!)5!] or just 9.4x1011. Evolutionary computational-based methods offer such an approach. These are classified as heuristic algorithms which are considered to work ‘reasonably well’; that is to say a ‘good’ solution is acceptable, but for many cases there is no proof that they are faster or that they find the best solution.



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 3. (A) The complex problem which it is hoped to solve but cannot, and (B) the strategy used by evolutionary computing methods.

 

    Evolutionary computation
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
It is commonly accepted that the origin of life started during the early Archaean period when biogenic signatures in metasediments from Akilia Island, West Greenland were discovered which dated from ~3.85 billion years ago (Mojzsis et al., 1996Go). This proto-organism (or precellular stage) positioned at the root of the tree of life, gave rise to the immense species diversity that is present on the planet today. This rich diversity has been created from DNA mutation, DNA crossover (including horizontal gene transfer), and the survival of the fittest, which has led to a plethora of different organisms occupying different ecological niches. What nature has achieved in 4x109 years can be mimicked in silico and this forms the basis of evolutionary computation.

In evolutionary computation a population of individuals, each representing the parameters of the problem to be optimized as a string of numbers or binary digits, undergoes a process analogous to evolution in order to derive an optimal or near-optimal (a good) solution (Fig. 3B). The parameters stored by each individual are used to assign it a fitness, a single numerical value indicating how well the solution using that set of parameters performs. New individuals are generated from members of the current population by processes analogous to asexual and sexual reproduction (Fig. 4), these populations can then be ‘pruned’ based on the concepts of Darwinian selection (Bäck et al., 1997Go; Darwin, 1859Go). This process might enrich the population for false positives or negatives and so to avoid this during, as well as after, evolution an independent test set is used to validate the model.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 4. The overall procedure employed by evolutionary computing methods. The criterion for a good solution is typically based on setting a threshold error between the known target and the model's response.

 
Evolutionary algorithms are particularly popular inductive reasoning and optimization methods (Corne et al., 1999Go; Michalewicz and Fogel, 2000Go) and include genetic algorithms (GAs; Goldberg, 1989Go; Holland, 1992Go; Mitchell, 1995Go), evolution strategies (Beyer, 2001Go; Schwefel, 1995Go), evolutionary programming (Fogel, 2000Go; Fogel and Fogel, 1996Go), genetic programming (GP; Banzhaf et al., 1998Go; Koza, 1992Go, 1994Go; Koza et al., 1999Go), and genomic computing (GC; Kell, 2002Go; Kell et al., 2001Go) and because the models are in English and by penalizing complex expressions may be made comparatively simple. Evolutionary computational-based algorithms are thus explanatory supervised learning techniques (Fig. 2) (Beavis et al., 2000Go; Goodacre and Kell, 2003Go; Kell and King, 2000Go; Mitchell, 1997Go) where answers to questions of biological interest are sought, such as ‘What metabolites have I measured in my metabolome that makes plants exposed to drought different from the same isogenic plants that have been adequately watered?’

The overall evolutionary procedure employed by GAs and GP is depicted in Fig. 4. However, whilst there are many analogies between how GAs and GP operate, when it comes to using these methods for variable (or metabolite) selection, each method should be considered separately. What follows is a brief description of the salient features of GAs and GP.

Genetic algorithms (GAs)
In a GA the population of individuals, each representing the parameters of the problem to be optimized, are encoded as a string of numbers or binary digits (Fig. 5A). In the GA representation each individual in the population contains a string of ‘1’s and ‘0’s, the number of which would be the total number of metabolites to choose from. Each input variable represented by a ‘1’ is selected to be used in the model, whilst each ‘0’ is not selected (Broadhurst et al., 1997Go; Horchner and Kalivas, 1995Go). Other GA variants based on the selection of spectral windows for FT-IR and Raman spectroscopy are also popular (Leardi et al., 2002Go; Roger and Bellon-Maurel, 2000Go; Taylor et al., 1998Go; Williams and Paradkar, 1997Go). In genetic terms each variable is called a gene and a set of variables is called a chromosome. For example, as depicted in Fig. 5A representing the selection or otherwise of seven metabolites, one possible chromosome would be 1101010, which can be translated as a variable selection filter such that variables 1, 2, 4, and 6 are to be used in the modelling process and variables 3, 5, and 7 are to be omitted. The model used would employ a supervised learning algorithm such as linear discriminant analysis (LDA), multiple linear regression (MLR), partial least squares (PLS), or an artificial neural network (ANN).



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 5. The ‘language’ structure of showing how a model is encoded in (A) GA and (B) GP. In GA a string of binary numbers are used to select variables prior to them being used by a supervised learning method, whilst in GP a parse tree is used which encodes the full model.

 
The output from the model would be used to assign it a fitness to each of the individuals in the population. For example, to differentiate a control plant from one exposed to drought, the fitness would be a measure of how well the model predicted the two groups. New individuals are generated from members of the current population by processes analogous to asexual and sexual reproduction (Fig. 6).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 6. The GA reproduction processes. Three strategies are used including (A) cloning, (B) mutation, and (C) crossover events.

 
Asexual reproduction would simply be cloning the chromosome into the next population, whilst mutation is performed by randomly selecting a parent with a probability related to its fitness, then randomly changing one or more of the parameters it encodes. The new individual then replaces a less-fit member of the population, if one exists. Sexual reproduction, or crossover, is achieved by selecting two parents with a frequency related to their fitnesses, and generating two new individuals by copying parameters from one parent, and switching to the other parent after a randomly-selected point; alternatively, as depicted in Fig. 6C a double crossover could be implemented. The two new individuals then replace less-fit members of the population as before. The above procedure is repeated, with the overall fitness of the population improving at each generation, until an acceptably-fit individual is produced.

However, whilst GAs are very successful search algorithms for tackling NP-hard problems, the disadvantage is that, with the GA variable selection approach, the relationship between one variable and another is not evident, only whether they contribute to a model or not. Therefore, a richer language is needed.

Genetic programming (GP)
A GP is an application of the GA approach to derive mathematical equations, logical rules or program functions automatically (Banzhaf et al., 1998Go; Koza, 1992Go, 1994Go; Koza et al., 1999Go; Langdon, 1998Go). Rather than representing the solution to the problem as a string of parameters, as in a conventional GA, a GP usually (cf. Banzhaf et al., 1998Go) uses a tree structure. The leaves of the tree, or terminals, represent input variables or numerical constants. Their values are passed to nodes, at the junctions of branches in the tree, which perform some numerical or program operation before passing on the result further towards the root of the tree (Fig. 5B).

As with GAs an initial (commonly random) population of individuals, each encoding a function or expression, is generated and their fitness to produce the desired output is assessed. In the second population three reproduction strategies are adopted (see Fig. 7 for pictorial details). (1) Cloning, some of the original individuals are allowed to survive unmodified. (2) New individuals are generated by mutation where one or more random changes to a single parent individual are introduced. This can be when a node is randomly chosen, and modified either by giving it a different operator with the same number of arguments, or it may be replaced by a new random subtree. Terminals can be mutated by slightly perturbing their numerical values, or randomly choosing an input variable. (3) Alternatively, new children are generated by crossover where random rearrangement of functional components between two or more parent individuals takes place. Two parents are chosen with a probability related to their fitness. A node is randomly chosen on each parent tree, and the selected sub-trees are then swapped. At each reproduction stage, because of the use of these trees to encode mathematical equations, the new trees are still syntactically correct. The fitness of the new individuals in population 2 is assessed and the best individuals from the total population become the parents of the next generation. An individual's fitness is usually assessed as the root mean squared (RMS) error of the difference between expected values and the GP's estimated values for the training set. In order to reduce ‘bloat’, a phenomenon in which the GP function trees gets so huge that it lacks explanatory power (Langdon and Poli, 1998Go; Podgorelec and Kokol, 2000Go), penalties to the number of nodes and depth of the tree in the individual's function tree can be applied. This overall process is repeated until either the desired result is achieved or the rate of improvement in the population becomes zero.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 7. The GP reproduction processes. Three strategies are used including (A) cloning, (B) mutation, and (C) crossover events.

 

    Application of genetic search algorithms to metabolome analysis
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
GAs and GPs are very efficient search algorithms and can be used to produce models that allow the deconvolution of metabolome data in chemical terms. Detailed below are three published plant metabolome examples illustrating this.

Metabolic fingerprinting in salt-stressed tomatoes
Samples from Edkawy tomato fruit grown hydroponically under both high- and low-salt conditions were analysed using FT-IR, with the aim of identifying biochemical features linked to salinity in the growth environment. Examination of the GP-derived trees showed that there were a small number of spectral regions that were consistently being used. In particular, the spectral region containing absorbances potentially due to a cyanide/nitrile functional group was identified as discriminatory (Johnson et al., 2000Go).

More recently (Johnson et al., 2003Go), the same authors applied FT-IR to fingerprint Edkawy and Simge F1 tomato varieties. It was observed that the exposure of the plants to salinity significantly reduced the relative growth rate of Simge F1 but had no significant effect on Edkawy. By contrast, whilst with both tomato varieties there was little effect on total fruit number, the salt treatment had significantly reduced the mean fruit fresh weight and size class in both Edkawy and Simge F1. In this study, rather than using GP, GA was used as a variable selection method prior to discriminant MLR and this approach was able to classify accurately between control and salt-treated fruit. It was encouraging that this different genetic search algorithm on two tomato varieties also identified a cyanide/nitrile functional group as being discriminatory.

It is known that cyanide is formed in plants during ethylene biosynthesis, and that ethylene production is enhanced in plants under stress conditions. Therefore, it may be proposed that plants grown under saline conditions may have enhanced levels of cyanide as a result of enhanced ethylene biosynthesis. Thus inductive reasoning via GP and GA has allowed the significance of a pathway turned on under tomatoes exposed to salinity to be highlighted as potentially important. This pathway can now be subjected to conventional biochemical analysis.

Analysis of defence in tobacco plants
Within functional genomics the potential power of evolutionary methods has been shown for the analysis of metabolites from transgenic tobacco plants (Kell et al., 2001Go). Tobacco is a model organism for the study of salicylate biology in plant defence, but despite a considerable amount of research, little is known regarding its synthesis, catabolism, and mode of action. Six-week-old control plants and a transgenic expressing a bacterial gene encoding the enzyme salicylate hydroxylase (SH-L), which is known to block salicylic acid accumulation in transgenic tobacco (Darby et al., 2000Go), were inoculated with tobacco mosaic virus and leaf samples were analysed by HPLC. GP analysis of these metabolome profiles identified three peaks as highly discriminatory for detecting the presence of the SH-L genotype in the transgenic. One of the peaks was indeed salicylate, but the other two were unknown and are now the subject of further investigation.

Analysis of the photoperiodic floral induction in Pharbitis nil
Metabolic fingerprints were obtained from unfractionated P. nil leaf sap samples by direct infusion into an electrospray ionization mass spectrometer using flow-injection (Vaidyanathan et al., 2002Go). Analyses took less than 30 s per sample and yielded complex mass spectra. Various chemometric methods, including cluster analysis, ANNs, and GP, could discriminate the metabolic fingerprints of plants subjected to different photoperiod treatments. A GP was evolved to discriminate plants 1 week after a short day exposure from all the other plants (24 h and 48 h after SD exposure and controls) and generated rules predominantly involving m/z 520, 229, and 143. Although these m/z values represented very minor peaks in the sap spectra, when they were plotted as a pseudo-3D plot (data not shown), 1-week plants could indeed be separated from the others. Whilst the identity of these peaks has not yet been established, thus ESIMS with GP has suggested which analytes are potential metabolite markers and thus this approach has generated new directions of research and potentially new knowledge (Goodacre et al., 2003Go).


    Metabolic fingerprinting of a fermentation model
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
The ability to control the industrial bioprocess is paramount for product yield optimization, and it is imperative, therefore, that the concentration of the fermentation product (the determinand) is assessed accurately. Whilst real industrial bioprocesses have been modelled in the past (McGovern et al., 2002Go, 1999Go) and GA and GP used for spectral interpretation (McGovern et al., 2002Go), for the present purpose a simple model of a bioprocess that has previously been investigated using mixtures of ampicillin and E. coli as a model system will be investigated (Winson et al., 1997Go).

Sample preparation and metabolic fingerprint generation
The bacterium used was E. coli HB101 (Maniatis et al., 1982Go); this is ampicillin-sensitive, indicating that any spectral features observed are not due, for instance, to ß-lactamase activity. The mixtures were prepared as described previously (Winson et al., 1997Go). The strain was grown in 4.0 l liquid medium: glucose (BDH) 10.0 g; peptone (LabM) 5.0 g; beef extract (LabM) 3.0 g; per litre water, for 16 h at 37 °C in a shaker. After growth, the cultures were harvested by centrifugation, washed, and resuspended in physiological saline (0.9% NaCl). Ampicillin (desiccated D[-] - aminobenzylpenicillin sodium salt, >=98% (titration), Sigma) was prepared in the bacterial suspensions to give concentration ranges of 0–5000 µg ml–1 in 250 µg ml–1 steps (0–13.46 mM) in 40 mg ml–1 E. coli (dry weight), corresponding to ~1–2x107 cells ml–1.

Aliquots (20 µl) of the above samples were evenly applied to the wells of an aluminium plate (measuring 10x10 cm) in triplicate and dried at 50 °C for 30 min. The FT-IR instrument used was a Bruker IFS28 FT-IR spectrometer (Bruker Spectrospin Ltd., Banner Lane, Coventry, UK) equipped with an MCT (mercury–cadmium–telluride) detector (cooled with liquid N2) and a motorized stage of a reflectance accessory, onto which the Al plate was loaded. Spectra were collected over the wavenumber range 4000 cm–1 to 600 cm–1. Spectra were acquired as described previously (Goodacre et al., 2000Go; Timmins et al., 1998Go) at a rate of 20 s–1, the spectral resolution used was 4 cm–1, and to improve signal-to-noise ratio 256 spectra were co-added and averaged.

For the IR map the biomass from Escherichia coli HB101 was applied evenly to the surface of a flat 7x7 cm Al plate at a concentration of ~200 µg cm–2 (dry weight); a ß-lactam ring was then drawn with ~100 µl of a 5 mg ml–1 solution of ampicillin. Data were acquired at a resolution of 1x1 mm (therefore these maps are 71x71 pixels x882 wavenumbers). Spectra were acquired as above, but for speed (since 5041 spectra were collected) the number of co-adds was only 16.

Quantifying the level of ampicillin
Initially PLS was used to predict the level of ampicillin in the mixtures. The training data were those triplicate spectra with ampicillin concentrations 0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, and 5000 µg ml–1, whilst the test data were those mixtures containing 250, 750, 1250, 1750, 2250, 2750, 3250, 3750, 4250, and 4750 µg ml–1 ampicillin. The model was test set validated and whilst 10 latent variables gave the lowest RMS error (168 µg ml–1), the most parsimonious model used four latent variables with an acceptable RMS error of 288 µg ml–1. The results for this are shown in Fig. 8A, where it is clear that PLS was then able to predict the concentration of this secondary metabolite accurately. PLS is a linear regression analysis and it is possible to inspect the regression coefficients from the PLS model (Fig. 8B), whilst many areas of the 882 inputs were selected as positively contributing to the model (and hence correlated with ampicillin), it is clear that a vibration at 1767 cm–1 was dominant.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 8. (A) The PLS estimates versus the true ampicillin titre (µg ml–1) in E. coli HB101; asterisks represent the training data, whilst circles the test data, also shown is the expected y=x line. (B) The PLS latent variable loadings used in the model shown in (A).

 
The same training and test sets were analysed by GP using the Genomic Computing software Gmax-bioTM (Aber Genomic Computing, Aberystwyth, UK) which runs under Microsoft Windows NT on an IBM-compatible PC. An introduction to Gmax-bioTM is given elsewhere (Kell et al., 2001Go). The default parameter settings for population size (1000), mutation and recombination rates were used throughout, and the fitness was assessed using OLS (ordinary least squares). The operators that were used were the arithmetical ones +, –, ÷, x, asterisk, log(x), 10x, the hyperbolic Tanh(x), and numerical inputs of 0.1, 1, 3, 5, and random. Ten GPs were run and the GP results were equivalent to the PLS predictions (data not shown). The frequency of the number of times each input (wavenumber) was used for the 10 evolved populations was calculated and plotted against the wavenumber of the infrared light (Fig. 9) where it was clear that a single dominating area of the spectra was chosen which was a vibration at 1767 cm–1, this peak was also clearly evident in the E. coli+5000 µg ml–1 ampicillin mix, but was absent from E. coli alone (Fig. 9).



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 9. Summed frequency plot from GP analysis of the number of times each input (wavenumber) was used for the 10 evolved populations. Also shown are the normalized FT-IR spectra from E. coli (blue trace) and E. coli+5000 µg.ml–1 (red trace), and the structure of ampicillin.

 
Spectral interpretation
That the vibration at 1767 cm–1 was selected as characteristic of ampicillin is highly encouraging since it is known (Winson et al., 1997Go) that this corresponds to the constrained carbonyl bond in the ß-lactam ring of penicillins in general (the structure of ampicillin is shown in Fig. 9). To highlight the usefulness of this spectral deconvolution an FT-IR image was collected from an E. coli background with an image of the ß-lactam ring drawn with ampicillin and slices of this hypercube were taken and colour maps generated (Fig. 10). The first image shows the level of protein on the plate by simple integration under the Amide I band at 1662 cm–1, and since the level of protein is constant the picture can essentially not be seen. By contrast, integrating under the molecular vibration from the constrained carbonyl on the ß-lactam ring at 1767 cm–1 allows the hidden chemical image to be clearly seen.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 10. Chemical maps of (A) the integration under the Amide I band at 1662 cm–1 from 1674 cm–1 to 1651 cm–1, and (B) the integration under the constrained carbonyl vibration on the ß-lactam ring at 1767 cm–1 from 1778 cm–1 to1755 cm–1.

 

    General conclusions
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
In the early stages of functional genomics programmes there is a scenario where current knowledge is minute, that is there are no ideas about the role of an orphan ORF and there are few if any hypotheses to test (Brent, 2000Go; Kell and King, 2000Go). However, experiments can be designed based, for example, on gene knockouts and controlled over-expression and the effect on the phenotype of the organism observed. Alternatively, an isogenic organism might be exposed to different abiotic and biotic stresses to assess how it adapts to these new environments.

Metabolomics is one ‘omics approach with which data floods can be generated from these genetic manipulations and environmental stimuli (as indeed are transcriptomics and proteomics, and the same general conclusions given here apply equally to these methods), however, deconvolution of these data in terms of which metabolites are of key importance to the genetic manipulations and environmental stimuli is essential to generate new knowledge. Evolutionary algorithms can be considered to be rule induction methods that are entirely data-driven and are thus especially appropriate for problems that are data-rich but hypothesis/information-poor. As described above rule induction can be used to generate rules and hence hypotheses from suitable examples. Of course, these new theories will not necessarily be correct, but by testing them new knowledge will be generated which will lead to an increased understanding of the function of the orphan gene, or how an organism responds to different environmental conditions.


    Acknowledgements
 
I am indebted to the UK BBSRC (Engineering and Biological Systems Committee), the UK EPSRC and the Royal Society of Chemistry for supporting our research within metabolomics.


    References
 Top
 Abstract
 Introduction
 Evolutionary computation
 Application of genetic search...
 Metabolic fingerprinting of a...
 General conclusions
 References
 
Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815.[CrossRef][Medline]

Bäck T, Fogel DB, Michalewicz Z. 1997. Handbook of evolutionary computation. Oxford: IOP Publishing/Oxford University Press.

Banzhaf W, Nordin P, Keller RE, Francone FD. 1998. Genetic programming: an introduction. San Francisco: Morgan Kaufmann.

Barabási A-L, Oltvai ZN. 2004. Network biology: understanding the cell's functional organization. Nature Reviews Genetics 5, 101–113.[CrossRef][Web of Science][Medline]

Barrow JD, Silk J. 1995. The left hand of creation: the origin and evolution of the expanding universe. London: Penguin.

Beavis RC, Colby SM, Goodacre R, Harrington PB, Reilly JP, Sokolow S, Wilkerson CW. 2000. Artificial intelligence and expert systems in mass spectrometry. In: Meyers RA, ed. Encyclopedia of Analytical Chemistry, 11558–11597.

Beyer H-G. 2001. The theory of evolution strategies. Berlin: Springer.

Brent R. 2000. Genomic biology. Cell 100, 169–183.[CrossRef][Web of Science][Medline]

Broadhurst D, Goodacre R, Jones A, Rowland JJ, Kell DB. 1997. Genetic algorithms as a method for variable selection in PLS regression, with application to pyrolysis mass spectra. Analytica Chimica Acta 348, 71–86.[CrossRef]

Corne D, Dorigo M, Glover F. 1999. New ideas in optimization. London: McGraw Hill.

Darby RM, Maddison A, Mur LAJ, Bi Y-M, Draper J. 2000. Cell specific expression of salicylate hydroxylase in an attempt to separate localised HR and systemic signalling establishing SAR in tobacco. Plant Molecular Patholology 1, 115–124.

Darwin C. 1859. On the origin of species by means of natural selection. John Murray.

Fiehn O. 2001. Combining genomics, metabolome analysis, and biochemical modelling to understand metabolic networks. Comparative and Functional Genomics 2, 155–168.[CrossRef]

Fiehn O. 2002. Metabolomics—the link between genotypes and phenotypes. Plant Molecular Biology 48, 155–171.[CrossRef][Web of Science][Medline]

Fogel DB. 2000. Evolutionary computation: toward a new philosophy of machine intelligence. Piscataway: IEEE Press.

Fogel DB, Fogel LJ. 1996. An introduction to evolutionary programming. Artificial Evolution 1063, 21–33.

Garey M, Johnson D. 1979. Computers and intractability: a guide to the theory of NP-completeness. San Francisco: Freeman.

Goff SA, Ricke D, Lan TH, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100.[Abstract/Free Full Text]

Goldberg DE. 1989. Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison-Wesley.

Goodacre R, Kell DB. 2003. Evolutionary computation for the interpretation of metabolomic data. In: Harrigan GG, Goodacre R, eds. Metabolic profiling: its role in biomarker discovery and gene function analysis. Boston: Kluwer Academic Publishers, 239–256.

Goodacre R, Shann B, Gilbert RJ, Timmins ÉM, McGovern AC, Alsberg BK, Kell DB, Logan NA. 2000. The detection of the dipicolinic acid biomarker in Bacillus spores using Curie-point pyrolysis mass spectrometry and Fourier transform infrared spectroscopy. Analytical Chemistry 72, 119–127.[Medline]

Goodacre R, York EV, Heald JK, Scott IM. 2003. Chemometric discrimination of unfractionated plant extracts profiled by flow-injection electrospray mass spectrometry. Phytochemistry 62, 859–863.[CrossRef][Web of Science][Medline]

Hardy N, Fuell H. 2003. Databases, data modelling and schemas: database development in metabolomics. In: Harrigan GG, Goodacre R, eds. Metabolic profiling: its role in biomarker discovery and gene function analysis. Boston: Kluwer Academic Publishers, 277–291.

Harrigan GG, Goodacre R. 2003. Metabolic profiling: its role in biomarker discovery and gene function analysis. Boston: Kluwer Academic Publishers.

Holland JH. 1992. Adaption in natural and artifcial systems. Cambridge, MA: MIT Press.

Horchner U, Kalivas JH. 1995. Further investigation on a comparative study of simulated annealing and genetic algorithm for wavelength selection. Analytica Chimica Acta 311, 1–13.[CrossRef]

Johnson HE, Broadhurst D, Goodacre R, Smith AR. 2003. Metabolic fingerprinting in salt-stressed tomatoes. Phytochemistry 62, 919–928.[CrossRef][Web of Science][Medline]

Johnson HE, Gilbert RJ, Winson MK, Goodacre R, Smith AR, Rowland JJ, Hall MA, Kell DB. 2000. Explanatory analysis of the metabolome using genetic programming of simple, interpretable rules. Genetic Programming and Evolvable Machines 1, 243–258.

Kell DB. 2002. Genotype:phenotype mapping: genes as computer programs. Trends in Genetics 18, 555–559.[CrossRef][Web of Science][Medline]

Kell DB. 2004. Metabolomics and systems biology: making sense of the soup. Current Opinion in Microbiology 7, 296–307.[CrossRef][Web of Science][Medline]

Kell DB, Darby RM, Draper J. 2001. Genomic computing. Explanatory analysis of plant expression profiling data using machine learning. Plant Physiology 126, 943–951.[Free Full Text]

Kell DB, King RD. 2000. On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends in Biotechnology 18, 93–98.[CrossRef][Web of Science][Medline]

Koza JR. 1992. Genetic programming: on the programming of computers by means of natural selection. Cambridge, MA: MIT Press.

Koza JR. 1994. Genetic programming II: automatic discovery of reusable programs. Cambridge, MA: MIT Press.

Koza JR, Bennett FH, Keane MA, Andre D. 1999. Genetic programming III: Darwinian invention and problem solving. San Francisco: Morgan Kaufmann.

Langdon WB. 1998. Genetic programming and data structures: genetic programming+data structures=automatic programming! Boston: Kluwer.

Langdon WB, Poli R. 1998. Fitness causes bloat: mutation. In: Banzhaf W, Poli R, Schoenauer M, Fogarty TC, eds. Proceedings of the first european workshop on genetic programming, Vol. 1391. Berlin: Springer-Verlag, 37–48.

Leardi R, Seasholtz MB, Pell RJ. 2002. Variable selection for multivariate calibration using a genetic algorithm: prediction of additive concentrations in polymer films from Fourier transform-infrared spectral data. Analytica Chimica Acta 461, 189–200.[CrossRef]

Li XJ, Brazhnik O, Kamal A, Guo D, Lee C, Hoops S, Mendes P. 2003. Databases and visualization for metabolomics. In: Harrigan GG, Goodacre R, eds. Metabolic profiling: its role in biomarker discovery and gene function analysis. Boston: Kluwer Academic Publishers, 293–309.

Maniatis T, Fritsch EF, Sambrook J. 1982. Molecular cloning: a laboratory manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.

McGovern AC, Broadhurst D, Taylor J, Kaderbhai N, Winson MK, Small DA, Rowland JJ, Kell DB, Goodacre R. 2002. Monitoring of complex industrial bioprocesses for metabolite concentrations using modern spectroscopies and machine learning: application to gibberellic acid production. Biotechnology and Bioengineering 78, 527–538.[CrossRef][Web of Science][Medline]

McGovern AC, Ernill R, Kara BV, Kell DB, Goodacre R. 1999. Rapid analysis of the expression of heterologous proteins in Escherichia coli using pyrolysis mass spectrometry and Fourier transform infrared spectroscopy with chemometrics: application to {alpha}2-interferon production. Journal of Biotechnology 72, 157–167.[CrossRef][Web of Science][Medline]

Mendes P. 2002. Emerging bioinformatics for the metabolome. Brief Bioinform 3, 134–145.[Abstract/Free Full Text]

Michalewicz Z, Fogel DB. 2000. How to solve it: modern heuristics. Heidelberg: Springer-Verlag.

Mitchell M. 1995. An introduction to genetic algorithms. Boston: MIT Press.

Mitchell TM. 1997. Machine learning. New York: McGraw Hill.

Mojzsis S, Arrheius G, McKeegan KD, Harrison TM, Nutmann AP, Friend CPL. 1996. Evidence for life on earth before 3800 million years ago. Nature 385, 55–59.

Oliver SG, Winson MK, Kell DB, Baganz F. 1998. Systematic functional analysis of the yeast genome. Trends in Biotechnology 16, 373–378.[CrossRef][Web of Science][Medline]

Podgorelec V, Kokol P. 2000. Fighting program bloat with the fractal complexity measure. Lecture Notes in Computer Science; Genetic Programming Proceedings 1802, 326–337.

Roger JM, Bellon-Maurel V. 2000. Using genetic algorithms to select wavelengths in near-infrared spectra: application to sugar content prediction in cherries. Applied Spectroscopy 54, 1313–1320.[CrossRef]

Schwefel H-P. 1995. Evolution and optimum seeking. New York: Wiley.

Seasholtz MB, Kowalski B. 1993. The parsimony principle applied to multivariate calibration. Analytica Chimica Acta 277, 165–177.[CrossRef]

Taylor J, Rowland JJ, Gilbert RJ, Jones A, Winson MK, Kell DB. 1998. Genetic algorithm decoding for the interpretation of infra red spectra in analytical biotechnology. Birmingham: University of Birmingham.

The International Human Genome Mapping Consortium. 2001. A physical map of the human genome. Nature 409, 934–941.[CrossRef][Medline]

Timmins ÉM, Howell SA, Alsberg BK, Noble WC, Goodacre R. 1998. Rapid differentiation of closely related Candida species and strains by pyrolysis mass spectrometry and Fourier transform infrared spectroscopy. Journal of Clinical Microbiology 36, 367–374.[Abstract/Free Full Text]

Vaidyanathan S, Kell DB, Goodacre R. 2002. Flow-injection electrospray ionization mass spectrometry of crude cell extracts for high-throughput bacterial identification. Journal of the American Society of Mass Spectrometry 13, 118–128.

Venter JC, Adams MD, Myers EW, et al. 2001. The sequence of the human genome. Science 291, 1304–1351.[Abstract/Free Full Text]

Williams RR, Paradkar RP. 1997. Correcting fluctuating baselines and spectral overlap with genetic regression. Applied Spectroscopy 51, 92–100.[CrossRef]

Winson MK, Goodacre R, Woodward AM, Timmins ÉM, Jones A, Alsberg BK, Rowland JJ, Kell DB. 1997. Diffuse reflectance absorbance spectroscopy taking in chemometrics (DRASTIC). A hyperspectral FT-IR-based approach to rapid screening for metabolite overproduction. Analytica Chimica Acta 348, 273–282.[CrossRef]

Yu J, Hu SN, Wang J, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
56/410/245    most recent
eri043v1
Right arrow E-letters: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when E-letters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (25)
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Goodacre, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Goodacre, R.
Agricola
Right arrow Articles by Goodacre, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?