nails .- cheminformatics tools

Introduction
Molecular Formats
Descriptor Calculation
Fingerprints
Pharmacophore
Maximum Overlapping Sets
Structure Modification
Synergies
Challenges and TODO
References

Introduction

Nails is a Chemical Descriptors Library (CDL) client for Unix environments. It supports conversion between different molecular formats, substructure search, fingerprints, 2d similarity functions, topological pharmacophore algorithms, structure modification, molecular grow, and calculation of maximum common subgraphs.
The interaction with nails is possible via command line interface.

Molecular Formats

Nails supports the two most popular molecular formats: MDL's mol format and Daylight's smiles. In addition, nails works with its two internal formats: juice, and nails. The difference between these last two formats will become clear soon.

Descriptors

Please refer to the special link

Fingerprints

Fingerprint (FP) is an useful tool for fast substructure search pruning. The idea is to calculate fingerprints for each molecule on a database, and then compare this fingerprints to the one generated for a substructure. Because different substructures can generate the same FP, once the algorithm tells us that the fragment can be within the complete structure (via FP comparison), a full substructure search is performed.
There are two special clients for fingerprint inplementation: fingernails, which generates and reads FP, and calculates tanimoto similarities on those; and subnails, which performs substructure search on juice's FP.
Please read more about these clients from the README file.
Let's consider we have a small set of 6 fragments that we want to search in a database of 650 compounds. The average molecular weight of the database is 196 g/mol. Table 2 shows the results for the substructure search of the fragments in the database. The test was performed using a pentium III 1 GHz machine and Linux as operating environment. Nails was compiled using gcc 3.2 with no optimizations options. The worst-case complexity of the substructure search algorithm is O( n^3 ).

**Table 2:**Substructure search results
Fragment smile	Full search time(sec)	Fingerprint-pruned search time(sec)	Improvement (%)
C1CCC2CCCCC2C1	5.32	1.81	66
c1ccccc1	0.71	0.49	31
OC=O	0.36	0.06	86
C1CCCCC1	2.09	1.23	41
CCCCC1CCCNC1	1.77	0.01	99
CC1CC(CN(C)C1)OCCC2CCCC2	0.74	0.0	100

Obviously improvement is high when the fingerprint of the fragment is not present in the molecule fingerprint: improvement is independent from the molecular size.

Another common application of the FPs is the calculation of 2d topological similarity. fingernails implementation uses Tanimoto distances on the bitsets to calculate a similarity value between two molecules. The Tanimoto distance exhibit all required properties from distance mesures, and its value lies within the range [1,0].
To show an example, Tanimoto distances where calculated on fingernails' FP from a database of 7205 random pairs, and the same number of bioisosteric pairs.
Figure 1 shows the bins distribution of both, random and bioisosters pairs.

**Figure 1:** `fingernails` similarity.

Figure 1 shows a clear separation of bioisosters from the random pairs. However, there's still a big overlapping region. The abstraction achieves a statistically satisfactory separation, with a simple calculation. Other methods are intended to provide a better separation.

Pharmacophore

Nails implements a very simple but comprehensible description of pharmacophores, taken from ideas expessed on [Schneider et al., 1999]. Please refer to the paper for a throroughly description of the technique.
As a proof of concept, the algorithm was tested with the same dataset fingernails were tested. Figure 2 shows the bins distribution of both, random and bioisosters pairs.

**Figure 2:** Tanimoto distance on the CATS vector.

Figure 2 shows a similar distribution than the fingernails.
But how orthogonal both methods are? are they classifying same coumpounds to same classes?. Figure 3 shows the Tanimoto distance on the CATS plotted against the Tanimoto distance on the fingerprints:

**Figure 3:** Methods comparison.

See the accumulation of of points on the right edge of the plot. Suppose we combine both methods with a weighting function. Would be a good idea to create a function that shift the weight to the edges of the bounds: when one of the methods is giving a value near one of the edges, then more weight is given to that result. Have fun.

Maximum Overlapping Sets

The maximum overlapping set (MOS) of a graph (or the maximum common edge subgraph), is calculated with a stand-alone client moils, and is used to provide 2d similarity values. Also you can print the resulting MOS(s) of two structures. Please read the README text file or call the client with the -h option for a synopsis on how to use it.
The algorithm was implemented with ideas given in [Raymond, 2002].
The basic idea in similarity calculations with MOS is that molecules that share the mayority of their edges should be 2d similar.
The bad news is that the calculation of the maximum common edge subgraph is a NP-complete problem. Some heuristics are on their way, but so far, is you're going to calculate the MOS, be careful with the size of the molecule you're inputing: as the molecular size grows, the problem becomes untractable.
Figure 4 shows MOS distance and substructure of two molecules.

**Figure 4:** MOS distance and substructure.

Structure Modification

Structure modification has its own client as well: nailgrow. Please refer to the README file, or run it with the -h option to get a synopsis on how to use it.
Functionalities are provided to remove fragments, add fragments, replace fragments, etc. You can have a lot of fun with it...

Synergies

Nails offers nice possibilities to work with data mining and/or classification algorithms. In this line, we can use nails to describe molecules (whether descriptors, pharmacophores, or 2d fingerprints) and use data-mining engines to cluster, classify or predict biological or phisico-chemical activities.
Following are given some examples of classification algorithms:

Support Vector Machines

We use the Support Vector Machines Template Library to generate support vector machines on data obtained from the calculation of cats pharmacophore vectors of molecules.
The idea is to calculate hyperplanes that will hopefully separate active molecules from non-actives. Obviously the information obtained from a collection of molecules is more general than the one obtained from just one molecule. When we make calculations of distances from a template to a set of molecules, just the information of one molecule is given.
The next example was performed on a database of 41 compounds (20 actives, and 21 non-actives). We use this database to generate SVMs, and later we tested the performance classifying 43 compounds we left out for testing. Sadly we cannot show the structures of the molecules, less the target.

**Table 3:**Support Vector Machines classification
Real Cluster	Separation from the calculated hyperplane
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1	2.16815 1.10605 1.70249 0.852959 2.29406 1.34863 0.966051 0.566114 0.666675 0.844608 0.949025 0.757625 2.41355 0.844608 -0.628612 1.2351 1.10605 2.31583 0.647329 0.289167 -0.548536 2.19804 1.81983 0.838507 -0.957572 1.17329 -0.115009 -0.52718 -2.149 -1.57006 0.183763 2.42272 -0.29059 3.35241 -0.916957 -0.117745 0.121313 -0.366695 -0.616971 -0.813379 -1.15801 -0.667753 -1.40077

The SVMs were able to classify correctly 34 out of the 43 compounds (79% of the compounds).

Neural Networks

Just your imagination is the limit for using Neural Networks in conjunction with nails. Good examples are offered in the MolBrain link. However, following a small example is given that show you the synergy.
Some topological descriptors of a database were calculated, and a neural network was trained using these as the input variables. The output to fit was the molweight of the compounds. The dataset was composed of 256 compounds, 36 of which were randomly selected for a test set, and the rest were used to train the NN.
Figure 5 shows the predicted values plotted against the real values of the molweight.

**Figure 5:**NN prediction of molweight

The correlation obtained was 0.98.

Challenges and TODO

Heuristics to calculate the MOS; then hopefully will be possible to calculate MOS of large molecules
Similarity calculations for 3d structures as described in [Raymond, 2003]
Grow pharmacophore on semi-random positions given by the CATS matrix

References

Schneider, G.; Neidhart, W.; Giller, T.; Schmid G.; "'Scaffold-Hopping' by Topological Pharmacophore Search: A Contribution to Virtual Screening". Communications of Angew. Chem. Int. Ed. 1999. 38. No.19.
Raymond, J.; Gardiner, E.; Willett, P. "Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm". J.Chem.Inf.Compu.Sci., 2002. 42, 305-316.
Raymond, J.; Willet, P.; "Similarity Searching in Databases of Flexible 3D Structures Using Smoothed Bounded Distance Matrices". J.Chem.Inf.Comp.Sci.; 43. 908-916. 2003.

Copyright (c) Vladimir Josef Sykora and Morphochem AG 2003

nails .- cheminformatics tools

Contents