Showing posts with label openbabel. Show all posts
Showing posts with label openbabel. Show all posts

Tuesday, December 18, 2007

How good are biological data - II: Trombine, GSK, GPCR

Many bindign affinity prediction methods, such as scores and QSAR models, rely on availability of accurate information on binding constants. The figure on the left is a result of our sdf-file parser applied to trombine (blue) and GSK (yellow) binding data from BindingDB database. The parser is written with python and uses pybel to extract unique molecules from a given multimolecular sdf.
The parser not only finds identical (in Tanimoto-similarity sense) compounds, but also prints the binding constants from the sdf records. The graph shows the correlation of the reported inverse log(binding constants) for the same molecules from different entries (sources).
The result is in fact fairly impressive (the blue points): the discrepancies-"errors" are quite large and are especially profound for good (or better say very good) binders.
The yellow points represent the result of the same script over GSK-kinase activity data. Although the total number of molecules in BindDB is much larger, almost all of them are unique. The difference between different sources is not as much as for trombine.
The Figure on the right is the visualized script output for GPCR(5-HT2B) from PDSP Ki database. The situation is roughly the same: the accuracy of a typical biological experiment reported in a literature amounts roughly to a single unit of pKd.

This and previously reported correlation for HERG ion channel should serve as an example when the results of binding affinity calculations are compared to experimental data.

Friday, December 7, 2007

How good are biological experiments? HERG binding data analysis


A correlation between predicted and expermentally measured values of biological activity is a natural measure of a model quality. For instance, QUANTUM docking software calculates binding free energies, which are directly comparable with experimental values of -p(binding constant, Kd). Root mean squared error between the measured and the calculated quantities is the quantitative measure of the software performance.
Whatever the correlation is presented to prove the validity of a model, another important issue is the quality of the experimental data itself. The reported values for binding constants (or activities) often vary because of different measurement strategies, experimental errors or interpretation uncertanties. To visualize the situation we investigated a few datasets for HERG binding taken from QSAR World website.
The downloaded files were saved in source folder and processed with the following simple python script (thanks to openbabel):
files = os.listdir('source/')
molecules = []
for file in files:
molfile = readfile("sdf",'source/'+file)
for mol in molfile:
molfp = mol.calcfp()
present = 0
for savedmol in molecules:
savedmolfp = savedmol.calcfp()
if (molfp | savedmolfp == 1):
present = 1
print mol.data, savedmol.data
if (not present):
molecules.append(mol)

The results where analyzed in a spreadsheet program and represented on the graph above. A lot of molecules occur multiple times in the datasets. While in many of the cases the activities coinside up to 0.01 (which most probably indicates citing from a single source), the remaining values thouch correlated with each other, differ by roughly a single pKd unit.