Chemical Descriptors Library: Fingerprints

Client Algorithms

Fingerprints

Molecular fingerprinting is a common algorithm which basically produce a bitset that contains information on the molecular connectivity and its associated atomic elements.
For example, ethane (C2H6) is represented as a molecular graph (h depleted) which has two connected vertices. To generate a fingerprint, we will consider path-distances for each vertex, begining in 0, to a path-lenght that is user-defined, and we will annotate the atomic symbol (or other atomic property) of each vertex along the path. In the case of ethane: distance cero (0) will give C and C, distance one will give CC. Each one of these strings are converted to a numerical value with the use of a hash function. The next step will be to switch ON a bit of the fingerprint (bitset) that corresponds to this generated value. Usually, we want to generate a fingerprint of a given size, hence the hash function must be bounded, or we have to apply other trick.
There are plenity of flavours for hash functions out there: bounded hash functions, truncated, projected, and so on. CDL uses an unbounded hash function. Now you might be asking: gee, then what happens if the generated hash value is greater than the size of the fingherprint I want to generate? Then the hash value will seed a pseudo random number generator that produces numbers between a well defined range: normally from cero to the size of the bitset.
Now you see that the fingerprint algorithm gives a bitset with infomation on the connectivity of the molecule. This is 2d information that can express molecular similarity, 2d similarity.
The most common distance function used for this case (a distace function has three properties: given 'a' and 'b' points, i. d(a,b)==0 iff a==b; ii. d(a,b)==d(b,a) for every a and b; iii. for every a, b, d(a,b)>=0 and assigns one and only one real value) is the Tanimoto distance: The Tanimoto distance is calculated as follows: given two sets of values 'a' and 'b': TD = SUM( ((ai + bi) - abs(ai - bi))/(ai + bi)) / num i. Which in binary format translates to: TD = bits(a and b) / (bits(a) + bits(b) - bits(a and b)).
CDL fingerprints are used for both: substructure search and 2d similarity.
Following documentation is given to generate a fingerprint with CDL. Similarity functions documentation are provided somewhere else.

Practical Issues concerning the use of Fingerprints for substructure search screening :

The Fingerprints algorithm generates linear (non-cyclic) paths between any two vertices of the molecular graph. The algorithm only considers the shortest path between two vertices. This has an important practical issue, better described with an example:
Consider you are searching for CCCCCCCC (a chain of 8 Carbon atoms) in Cc1c(C)cccc1. You will get a hit using full substructure search, whereas you won't using FP pruning. This is because the FP algorithm will consider only the shortest path between the two substituent Carbons of the cyclic compound (shortest path will be CCCC instead of the longest path CCCCCCCC).
This is an intrinsic feature of the FP algorithm (not a bug). Consider using a full substructure search in situations when you want to consider all paths in rings to search for a linear substructure.

Prototype

Is an overloaded function:

  template <class Molecule, class Block, class Allocator>
  void
  generate_fingerprint(const Molecule& m, boost::dynamic_bitset<Block,Allocator>& bitset,
  	const size_t max_path_lenght = 7);

  template <class Molecule>
  inline
  boost::dynamic_bitset<>
  generate_fingerprint(const Molecule& m, const size_t max_path_lenght = 7,
  	const size_t fingerprint_size = 1024);

Arguments

Parameter	Description	Models
`m`	The molecule	CDL's molecule
`bitset`	The bitset	boost::dynamic_bitset<>, or object that provides the same signature
`max_path_lenght`	The max lenght of the path to consider	size_t
`fingerprint_size`	The lenght of the fingerprint to generate	size_t

Description

The first version of the function inserts values into the passed bitset, the second version returns the bitset.

Definition

#include <morpho/cdl/fingerprints/fingerprints.hpp>

Preconditions

For the first version, the size of the bitset is a valid range.

Complexity

O(E.V + V^2). Application of a BFS for every atom of the molecule.

Example

Consider that you want to calculate the tanimoto distance on the fingerprints of two molecules:


// --- std inclusions :
#include <iostream>
// --- morpho inclusions :
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>
#include <morpho/cdl/fingerprints/fingerprints.hpp>
#include <morpho/cdl/similarity/similarity.hpp>
// --- boost inclusions :
#include <boost/dynamic_bitset.hpp>

int main() {
  using namespace std;
  using namespace morpho::cdl;
  
  typedef molecule<>    M;
  
  nail_juice<M>         j1, j2;
  get_juice_from_stream(std::cin,j1,0,smiles_formatT());
  get_juice_from_stream(std::cin,j2,1,smiles_formatT());
  M   m1(j1,false,true), m2(j2,false,true);
  boost::dynamic_bitset<>   fp1 = generate_fingerprint(m1);
  boost::dynamic_bitset<>   fp2 = generate_fingerprint(m2);
  std::cout <<  "The taminoto distance between : ";
  std::cout << get_can_smile(m1) << " and " << get_can_smile(m2);
  std::cout << " is : " << bin_tanimoto(fp1,fp2) << '\n';
  return 0;  
}

References

Copyright (c) Vladimir Josef Sykora & Morphochem AG 2003