Chemical Descriptors Library: Fingerprints

Client Algorithms


Fingerprints
Molecular fingerprinting is a common algorithm which basically produce a
bitset that contains information on the molecular connectivity and
its associated atomic elements.
For example, ethane (C2H6) is represented as a molecular graph (h depleted)
which has two connected vertices. To generate a fingerprint, we will
consider pathdistances for each vertex, begining in 0, to a pathlenght that
is userdefined, and we will annotate the atomic symbol (or other atomic property)
of each vertex along the path. In the case of ethane: distance cero (0) will give
C and C, distance one will give CC. Each one of these strings are converted to
a numerical value with the use of a hash function. The next step will be to
switch ON a bit of the fingerprint (bitset) that corresponds to this generated
value. Usually, we want to generate a fingerprint of a given size, hence the
hash function must be bounded, or we have to apply other trick.
There are plenity of flavours for hash functions out there: bounded hash functions,
truncated, projected, and so on. CDL uses an unbounded hash function. Now you
might be asking: gee, then what happens if the generated hash value is greater
than the size of the fingherprint I want to generate? Then the hash value will
seed a pseudo random number generator that produces numbers between a well
defined range: normally from cero to the size of the bitset.
Now you see that the fingerprint algorithm gives a bitset with infomation on the
connectivity of the molecule. This is 2d information that can express molecular
similarity, 2d similarity.
The most common distance function used for this case (a distace function has
three properties: given 'a' and 'b' points, i. d(a,b)==0 iff a==b; ii.
d(a,b)==d(b,a) for every a and b; iii. for every a, b, d(a,b)>=0 and assigns
one and only one real value) is the Tanimoto distance: The Tanimoto distance is
calculated as follows: given two sets of values 'a' and 'b': TD = SUM( ((ai + bi)  abs(ai  bi))/(ai + bi)) / num i.
Which in binary format translates to: TD = bits(a and b) / (bits(a) + bits(b)  bits(a and b)).
CDL fingerprints are used for both: substructure search and 2d similarity.
Following documentation is given to generate a fingerprint with CDL. Similarity
functions documentation are provided somewhere
else.
Practical Issues concerning the use of Fingerprints for substructure search screening :
The Fingerprints algorithm generates linear (noncyclic) paths between any two vertices
of the molecular graph. The algorithm only considers the shortest path between
two vertices. This has an important practical issue, better described with an example:
Consider you are searching for CCCCCCCC (a chain of 8 Carbon atoms) in Cc1c(C)cccc1.
You will get a hit using full substructure search, whereas you won't using FP pruning.
This is because the FP algorithm will consider only the shortest path between the
two substituent Carbons of the cyclic compound (shortest path will be CCCC
instead of the longest path CCCCCCCC).
This is an intrinsic feature of the FP algorithm (not a bug). Consider using
a full substructure search in situations when you want to consider all paths
in rings to search for a linear substructure.
Prototype
Is an overloaded function:
template <class Molecule, class Block, class Allocator>
void
generate_fingerprint(const Molecule& m, boost::dynamic_bitset<Block,Allocator>& bitset,
const size_t max_path_lenght = 7);
template <class Molecule>
inline
boost::dynamic_bitset<>
generate_fingerprint(const Molecule& m, const size_t max_path_lenght = 7,
const size_t fingerprint_size = 1024);
Arguments
Parameter  Description  Models 
m 
The molecule 
CDL's molecule 
bitset 
The bitset 
boost::dynamic_bitset<>, or object that provides the same signature 
max_path_lenght 
The max lenght of the path to consider 
size_t 
fingerprint_size 
The lenght of the fingerprint to generate 
size_t 
Description
The first version of the function inserts values into the passed bitset,
the second version returns the bitset.
Definition
#include <morpho/cdl/fingerprints/fingerprints.hpp>
Preconditions
For the first version, the size of the bitset is a valid range.
Complexity
O(E.V + V^2). Application of a BFS for every atom of the molecule.
Example
Consider that you want to calculate the tanimoto distance on the fingerprints
of two molecules:
//  std inclusions :
#include <iostream>
//  morpho inclusions :
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>
#include <morpho/cdl/fingerprints/fingerprints.hpp>
#include <morpho/cdl/similarity/similarity.hpp>
//  boost inclusions :
#include <boost/dynamic_bitset.hpp>
int main() {
using namespace std;
using namespace morpho::cdl;
typedef molecule<> M;
nail_juice<M> j1, j2;
get_juice_from_stream(std::cin,j1,0,smiles_formatT());
get_juice_from_stream(std::cin,j2,1,smiles_formatT());
M m1(j1,false,true), m2(j2,false,true);
boost::dynamic_bitset<> fp1 = generate_fingerprint(m1);
boost::dynamic_bitset<> fp2 = generate_fingerprint(m2);
std::cout << "The taminoto distance between : ";
std::cout << get_can_smile(m1) << " and " << get_can_smile(m2);
std::cout << " is : " << bin_tanimoto(fp1,fp2) << '\n';
return 0;
}
References
Copyright (c) Vladimir Josef Sykora & Morphochem AG 2003