Download (direct link):
Helicobacter pylori 1.66 Saccharomyces cerevisiae 13.0
Mycobacterium leprae 3.26 Candida albicans 15.0
Mycobacterium tuberculosis 4.4
genome size is in the region of 3.2 gigabases (Gb), approximately 1000 times larger than a typical bacterial genome (Table 2.1). Less than one-third of the genome is transcribed into RNA. Only 5% of that RNA is believed to encode polypeptides and the number of polypeptide-encoding genes is estimated to be of the order of 30000 — well below the initial 100000-120000 estimates.
From a drug discovery/development perspective, the significance of genome data is that it provides full sequence information of every protein the organism can produce. This should result in the identification of previously undiscovered proteins which will have potential therapeutic application, i.e. the process should help identify new potential biopharmaceuticals. The greatest pharmaceutical impact of sequence data, however, will almost certainly be the identification of numerous additional drug targets. It has been estimated that all drugs currently on the market target one (or more) of a maximum of 500 targets. The majority of such targets are proteins (mainly enzymes, hormones, ion channels and nuclear receptors). Hidden in the human genome sequence data is believed to be anywhere between 3000 and 10 000 new protein-based drug targets. Additionally, present in the sequence data of many human pathogens (e.g. Helicobacter pylori, Mycobacterium tuberculosis and Vibrio cholerae; Table 2.1) is sequence data of hundreds, perhaps thousands, of pathogen proteins that could serve as drug targets against those pathogens (e.g. gene products essential for pathogen viability or infectivity).
While genome sequence data undoubtedly harbours new drug leads/drug targets, the problem now has become one of specifically identifying such genes. Impeding this process is the fact that (at the time of writing) the biological function of between one-third and half of sequenced gene products remains unknown. The focus of genome research is therefore now shifting towards elucidating the biological function of these gene products, i.e. shifting towards ‘functional genomics’.
Assessment of function is critical to understanding the relationship between genotype and phenotype and, of course, for the direct identification of drug leads/targets. The term ‘function’ traditionally has been interpreted in the narrow sense of what isolated biological role/activity the gene product displays (e.g. is it an enzyme and, if so, what specific reaction does it catalyse?). In the context of genomics, gene function is assigned a broader meaning, incorporating not only the isolated biological function/activity of the gene product, but also relating to:
• where in the cell that product acts and, in particular, what other cellular elements it
• how such influences/interactions contribute to the overall physiology of the organism.
THE DRUG DEVELOPMENT PROCESS 47
The assignment of function to the products of sequenced genes can be pursued via various approaches, including:
• sequence homology studies;
• phylogenetic profiling;
• Rosetta Stone method;
• gene neighbourhood method;
• knock-out animal studies;
• DNA array technology (gene chips);
• proteomics approach;
• structural genomics approach.
With the exception of knock-out animals, these approaches employ, in part at least, sequence structure/data interrogation/comparison. The availability of appropriate highly powerful computer programs renders these approaches ‘high-throughput’. However, even by applying these methodologies, it will not prove possible to immediately identify the function of all gene products sequenced.
Sequence homology studies depend upon computer-based (bioinformatic) sequence comparison between a gene of unknown function (or, more accurately, of unknown gene product function) and genes whose product has previously been assigned a function. High homology suggests likely related functional attributes. Sequence homology studies can assist in assigning a putative function to 40-60% of all new gene sequences.
Phylogenetic profiling entails establishing a pattern of the presence or absence of the particular gene coding for a protein of unknown function across a range of different organisms whose genomes have been sequenced. If it displays an identical presence/absence pattern to an already characterized gene, then in many instances it can be inferred that both gene products have a related function.
The Rosetta Stone approach is dependent upon the observation that sometimes two separate polypeptides (i.e. gene products X and Y) found in one organism occur in a different organism as a single fused protein, XY. In such circumstances, the two protein parts (domains), X and Y, often display linked functions. Therefore, if gene X is recently discovered in a newly sequenced genome and is of unknown function, but gene XY of known function has been previously discovered in a different genome, the function of the unknown X can be deduced.