
A correlation between predicted and expermentally measured values of biological activity is a natural measure of a model quality. For instance, QUANTUM docking software calculates binding free energies, which are directly comparable with experimental values of -p(binding constant, Kd). Root mean squared error between the measured and the calculated quantities is the quantitative measure of the software performance.
Whatever the correlation is presented to prove the validity of a model, another important issue is the quality of the experimental data itself. The reported values for binding constants (or activities) often vary because of different measurement strategies, experimental errors or interpretation uncertanties. To visualize the situation we investigated a few datasets for HERG binding taken from QSAR World website.
The downloaded files were saved in source folder and processed with the following simple python script (thanks to openbabel):
files = os.listdir('source/')
molecules = []
for file in files:
molfile = readfile("sdf",'source/'+file)
for mol in molfile:
molfp = mol.calcfp()
present = 0
for savedmol in molecules:
savedmolfp = savedmol.calcfp()
if (molfp | savedmolfp == 1):
present = 1
print mol.data, savedmol.data
if (not present):
molecules.append(mol)
The results where analyzed in a spreadsheet program and represented on the graph above. A lot of molecules occur multiple times in the datasets. While in many of the cases the activities coinside up to 0.01 (which most probably indicates citing from a single source), the remaining values thouch correlated with each other, differ by roughly a single pKd unit.