December 2009

MEDICAL SCIENCE:

Errors in Biomedical Databases May Threaten Public Health

Extremely valuable medical information can be elucidated from knowing that a person has too many, too few, or defective proteins of a specific type. For example, type 2 diabetes is caused by possessing defective tyrosine kinase receptor proteins, and people with defective vitamin K epoxide reductase proteins are at risk for adverse reactions to warfarin, a blood thinner.

The success of medical efforts based on protein analysis critically depends upon unraveling the function of the many proteins relevant to human health. There are far too many proteins to characterize experimentally.

Consequently, software has been developed for predicting protein function, based upon protein composition (more specifically, the amino acid sequence, i.e. the connectivity between and identity of the protein subunits). The results of these computational predictions are submitted to public databases, intermixed with protein functional assignments based on direct experiments.

These public databases of protein function will only be useful if they are accurate. Among many other medical consequences, inaccurate information on protein function could easily lead scientists down the wrong path for drug discovery.

Patricia Babbitt (University of California, San Francisco) and coworkers have found that only one of the four public databases they studied often provides relatively accurate protein function data. For some protein types, in some of the databases, errors in reported protein function exceed 60%.

Investigating error.

The scientists' study focused on four public databases of enzyme function. Enzymes were selected in preference to other proteins because their function is often precise, and easier to define than other proteins.

These databases list the amino acid sequence of the enzymes. The function of the enzymes is also listed.

The scientists investigated whether the functions attributed to the listed enzymes were accurately attributed. They based their accuracy protocol on a multi-step procedure, utilizing "standard" enzymes which have been experimentally investigated by many scientists in detail.

If the enzyme in question possesses an amino acid sequence that matches those of other related enzymes which perform the same function via the same mechanism, possesses amino acid units which are critical for the function attributed to the enzyme, and possesses a structure similar to related enzymes, it passes the attribution test. Failure of any of these steps strongly suggests that the enzyme in question does not perform the function attributed to it.

The manually managed public database is the most accurate.

The scientists found that three of the four public databases often possess a high rate of error. The average error rate is 60% for some classes of enzymes in one of the databases.

The most accurate public database (typical average error rates of around 5% or less) is the one in which submissions are screened manually by the managers of the database. The scientists found that by far the most common (85%) type of error was in attributing too many functions to a specific enzyme than is scientifically justified.

One would hope that more recent software would reduce the attribution problem. However, the scientists found that the error rate in enzyme functional attribution increased over time, starting at around 15% in 1993 and increasing to nearly 40% in 2005.

Implications.

The scientists recommend that public databases of enzyme function should enable users to search for enzymes for which the function has been experimentally validated, screening out all others. They further recommend that a function should be attributed to an enzyme only in the face of strong evidence.

This latter recommendation would reduce the error rate, but may also hinder scientists' ability to classify enzymes into groups and deduce common physiological relationships among enzymes. The scientists further praise the accuracy of the manually managed public database, with the knowledge that the maintenance of such databases is labor-intensive and expensive.

This research should not be taken to suggest that public biomedical databases of protein function are worthless. The results do suggest that more care needs to be taken before submitting entries, and possibly more advanced software needs to be developed for predicting protein function based on its amino acid sequence.

Scientists who scour these databases for hints on how to design a drug to target a particular protein should keep in mind that the databases are not fully accurate. They should base their drug design on information elucidated from proteins that have undergone rigorous experimental investigation, if at all possible.

for more information:
Schnoes, A. M., Brown, S. D., Dodevski, I., & Babbitt, P. C. (2009). Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies PLoS Computational Biology, 5 (12) DOI: 10.1371/journal.pcbi.1000605