Chemical Descriptors Library: Quick Tour of the CDL
A Quick Tour of the Chemical Descriptors Library

CDL main component is the molecular class. It is designed in a way that decouples graph structure from the atomic and bond properties; that is, the graph representation and the molecular properties are on different layers of abstraction: they aren't coupled. This provides large flexibility since the user is not required to have a particular atomic and bond properties class with specific interfaces.
The basic component of the molecular class is a graph class, represented as an adjacency list, and is contained as a member. This graph in turn has properties attached to its vertices and edges. Vertex properties represent "atoms" and edge properties "bonds". There's also some properties which respect specifically to the molecule as a whole; these are called "molecule properties".
CDL provides interfaces to navigate over the graph representation: vertex, edge, and other iterators. Since many of the algorithms you'd like to write require bonds and atoms properties, these interfaces are of no great interest. What's of outmost importance are the interfaces to navigate over the properties directly.
Iterators that navigate over atom and bond properties are nothing else that adaptors from the graph's vertex and edge iterators.

You can convert an iterator from the graph representation to an iterator that navigates their underlying properties, and the other way around. What you can not do is convert an atom property class (structure you get when you derreference and atom iterator) to a vertex; the same is true for the bond and edge. This is a limitation intrinsic of the type of abstraction, but which you won't ever care about because it's possible to fully navigate the graph properties providing an atom or bond.

Property Accessors

As the molecular properties (atom, bond and molecule properties) are parameterized, there is the need to access the members of this properties in a generic way. The property selectors were written to address this requirenment. They provide a generic interface between CDL's algorithms and the molecular properties.

There is one property accessor per property of the molecule: three in total: AtomProperty<Selector,Atom>, BondProperty<Selector,Bond>, and MolProperty<Selector,Molecule>.
The Selector parameter is a tag type to select the desired property. Partial specializations of the property accessors (via the Selector parameter) provides the mechanism to access the property value and type. All property accessors inherit from std::unary_function<>, in this way, argument and result type are exposed. The only requirenment that the CDL places on the accessors, is the name of the accessor itself, and the name of the selectors. Also they must provide the get() and set() methods. There are some properties of the atom that never change once the atom is specified (as the atomic number, or symbol); this properties do not have the requirement to provide a set() method. Hence, the AtomProperties parameter for the molecule<> class provides two classes of properties: static (readeable) and read-write. I'll come back on this later.

For example, atomic number is a basic property of the atom. When the user provides the AtomProperties parameter for the molecule<> class, CDL assumes that AtomProperties parameter has a property of arbitrary type which represents the atomic number. There are no requirements that CDL places on this property, other than that it must be readable. The details of how the user provides this property lies on the property accessor: for this reason, property accessors are user-defined. When CDL requires this property, it uses the accessor with the specific Selector to access it. For the atomic number case, the Selector is atomic_numberS. CDL assumes that there is an AtomProperty<atomic_numberS,Atom> specialization, and that it can be accessed via the get() method. Also, the type of this property can be accessed via AtomProperty<atomic_numberS,Atom>::result_type.
Assume that you want to write an algorithm to calculate the molweight of your drug. In a generic programming paradigm (the conceptual fundation of the CDL) you just need to put your hands on two iterators : they point to some arbitrary object that somehow knows the value of some atomic mass. This arbitrary object is the AtomProperties parameter that you provided for the molecule<> class. Upon construction of the molecule, you need to specify which "atom" each vertex of the molecular graph is; in this way CDL constructs a conceptual molecular graph: it is a graph at the end, but with properties attached to each vertex and edge. Also, CDL knows which "physical" atom each vertex is. In this way:

template <class AtomIterator>
typename AtomProperty<atomic_massS, typename AtomIterator::value_type>::result_type
mol_weight(AtomIterator first, AtomIterator last) {
  typedef typename AtomProperty<atomic_massS, typename AtomIterator::value_type>::result_type atomic_mass_type;
  atomic_mass_type  rvo=atomic_mass_type();
  while(first!=last) {
    rvo+= AtomProperty<atomic_massS, typename AtomIterator::value_type>::get(*first++);
  }
  return rvo;
}
To be even more generic, we could write an algorithm to work on all properties, providing a property Iterator, and a UnaryOperation, in this way:
  template <class InputIterator, class UnaryOperation>
  typename UnaryOperation::result_type
  accumulate_prop(InputIterator first, InputIterator last, UnaryOperation op) {
    typedef typename UnaryOperation::result_type prop_type;
    prop_type rvo=prop_type();
    while (first!=last) {
      rvo+= op(*first++);
    }
    return rvo;
  }

With this algorithm, we can use whichever property accessors, being for atoms or bonds. The only hack we have to perform is to convert the member function get() of the property accessor, to a functor. We can do this using std::mem_fun(). I'll come back to this function later.
Note that if your properties have the same name for their members as the default properties CDL provide, then you don't have to write your own property accessors. This is because the property accessors are also templated on the type of the property : AtomProperty<Selector, Property>.

Constructing a molecule

There are many formats for molecular representations that are commonly used: smiles, mol, mol2, etc. CDL provides a structure that encapsulates all the information these formats carry: the 'nail_juice'. Once you have filled the nail_juice with information about the molecule, you can construct a CDL's molecule class.
The information nail_juice provide is:

With this information you can construct a CDL's molecule.

Streaming molecules in and out :

There are free functions available to convert different formats to the nail_juice. To construct a molecule then, you have first to get the nail_juice from a molecular format. The main function to format conversion is get_juice_from_stream(). You have to call the function with 4 arguments: the stream containing the molecule, a constructed (and empty) nail_juice in which the information is going to be stored, a counter to indicate the molecule number you're processing, and the tag for the type of format you're converting from.
The format tags provided are : sdf_formatT, juice_formatT, nail_formatT and smiles_formatT. You have to construct one of these formats upon calling.

The following example illustrates how to use the tags to produce different molecular formats. You can find it in examples/streaming_mols.cpp


// --- std inclusions :
#include <fstream>

// --- morpho inclusions :
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>

using namespace morpho::cdl;

int main() {

  // Suppose that we want to stream-in a molecule on sdf
  // format from the standard input, and output it as
  // smile to a file. Also, we want to get a molecule
  // on smiles format from a file named "bioiso.3d.smi"
  // and output it on sdf format to the standard output.

// lets process all the stream :
while( !std::cin.eof() ) {

  // first check that the stream has a valid sdf :
  std::basic_istream<char, std::char_traits<char> >::pos_type  pos;
  pos=std::cin.tellg();
  if ( !morpho::look_ahead(std::cin,4)) break;
  std::cin.seekg(pos);


  // So, lets get the first molecule from standard input:
  // We need to get the information contained in the
  // character string format (this case sdf) using the
  // juice. Declare a juice variable:
  nail_juice<>  j_from_sdf;
  // Now use get_juice_from_stream() function to get the
  // molecule from a stream :
  // The first argument is the stream, the second the
  // juice, the third a counting variable to track error
  // structures, and the last argument we specify the
  // format we want:
  get_juice_from_stream(std::cin, j_from_sdf, 0, sdf_formatT());

  // Now we construct the molecule:
  molecule<>    mol_from_sdf(j_from_sdf, false, true);

  // Now lets output it to a file :

  // First we declare the stream :
  std::ofstream  out_stream("structures.smi");

  // Now stream it out as a smile :
  out_stream << write_nail( mol_from_sdf , smiles_formatT() ) << std::endl;

} // while()


  // Now get a smile from a file named "bioiso.3d.smi" and
  // output it as sdf to standard output :

  // Declare the input stream :
  std::ifstream  in_stream("bioiso.3d.smi");

  // check if its good :
  if(!in_stream.good()) {
    std::cerr << "You must give bioiso.3d.smi" << std::endl;
    exit(-1);
  }

// let's go through all stream :
while( !in_stream.eof() ) {
  // get the juice :
  nail_juice<>  j_from_smi;
  get_juice_from_stream(in_stream, j_from_smi, 0, smiles_formatT());

  // the molecule :
  molecule<>    mol_from_smi(j_from_smi, false, true);

  // Now streamed to the standard output as sdf file :

  std::cout << write_nail( mol_from_smi , sdf_formatT() ) << std::endl;

} // while()

  return 0;
}


You have to take care of implicit hydrogens. The constructor of the molecule takes two flags: if to add implicit hydrogens, and if to initialize the properties. Both flags are defaulted: add implicit hydrogens if defaulted to false, and initialize is defaulted to true. If you add implicit hydrogens, then you have to remove them before canonizing a smile. You can remove the hydrogens calling the free function remove_cdl_implicit_hydrogens(molecule). With the other formats, handling hydrogens is totally left to the user. All hydrogens that come from any format are considered explicit!.

Vetices, Edges, Atoms, Bonds, and Iterators

Revisiting:
vertices maps to atomic properties and edges maps to bond properties.
You can get an atom from a vertex, but not a vertex from an atom : the map is unidirectional (except when you have the atom iterator, then you can map back to the vertex iterator). In the same way, you can get a bond from an edge, but not an edge from a bond (except when you have the bond iterator).
The vertex iterator points to a structure called "vertex descriptor", which is of std::size_t, and represents the index of the vertex in the graph. In the same way, the edge iterators points to a structure called "edge descriptor" which is of a type of no interest to us, however is important to know that it contains a pair of vertex descriptors that makes the edge.
Vertex and Edge structures are always non-const, but the atoms and edges can be const depending on the qualification of the molecule that contains them.
For example, if we accept a reference of a molecule as an argument to a function, the atoms and edges will be therefore const. In this case, we must use the constant types versions of our iterators: iterators that points to const structures: const_atom_iterator, const_bond_iterator, adjacency_const_atom_iterator and incident_const_bond_iterator

You can find the following example in examples/accessing_props.cpp


// Suppose that we are streaming in a smiles:
// i.e. from the command line: "accessing_props < smile.smi"

// --- std inclusions :
#include <iostream>

// --- morpho inclusions :
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>

using namespace morpho::cdl;
using namespace std;

int main() {
  typedef molecule<>    molecule_t;
// let go through all stream :
while( !std::cin.eof() ) {

  // get the molecule :
  nail_juice<>  juice;
  get_juice_from_stream(std::cin, juice, 0, smiles_formatT());
  
  // Now we construct the molecule:
  molecule_t    mol(juice, false, true);
  
  // First access the vertex Set :
  // declare the iterators :
  molecule_t::vertex_iterator  vi, vi_end;
  // get the iteretors of the molecule :
  tie(vi,vi_end)=mol.vertices();
  while(vi,vi_end) {
    // print the vertex descriptor :
    cout << "==========================\n";
    cout << "        Vertex : " << *vi << endl;
    cout << "==========================\n";
    
    // access the adjacent vertices :
    molecule_t::adjacency_iterator  adjacent_i, adjacent_i_end;
    tie(adjacent_i,adjacent_i_end)=mol.adjacent_vertices(*vi);
    cout << "-- adjacent vertices --\n";
    while(adjacent_i!=adjacent_i_end) {
      cout << "   " << *adjacent_i << " ";
      ++adjacent_i;
    }
    cout << endl;
    
    // access the incident edges :
    molecule_t::out_edge_iterator  inc_e_i, inc_e_i_end;
    tie(inc_e_i,inc_e_i_end)= mol.incident_edges(*vi);
    cout << "-- incident edges --\n";
    while(inc_e_i!=inc_e_i_end) {
      cout << "   source : " << mol.source(*inc_e_i) << ", target: " << mol.target(*inc_e_i) << endl;
      ++inc_e_i;
    }
    cout << endl;
    
    // Now get the properties that this vertex maps to :
    molecule_t::atom_type& atom = mol.get_atom(*vi);
    // Now print one of the members :
    cout << "  Atomic property: atomic number : " << 
      AtomProperty<atomic_numberS,molecule_t::atom_type>::get(atom) << endl;
    cout << endl;
    ++vi;
  }
  
  // Now let's access the edges :
  molecule_t::edge_iterator  ei, ei_end;
  tie(ei,ei_end)=mol.edges();
  while(ei,ei_end) {
    cout << "==========================\n";
    cout << "        Edge : (" << mol.source(*ei) << "," << mol.target(*ei) << ")" << endl;
    cout << "==========================\n";
    
    // Get the bond that maps to this edge :
    molecule_t::bond_type&  bond = mol.get_bond(*ei);
    // Print a property of the bond :
    cout << "  is bond type : " << BondProperty<type_of_bondS,molecule_t::bond_type>::get(bond) << endl;
    ++ei;
  }
  cout << endl;
  
  // Now let's access the atom set :
  molecule_t::atom_iterator ai, ai_end;
  tie(ai,ai_end)=mol.atoms();
  while(ai!=ai_end) {
    cout << "==========================\n";
    cout << "    Atom of symbol : " << AtomProperty<symbolS,molecule_t::atom_type>::get(*ai) << endl;
    cout << "==========================\n";
    // get the incident bonds :
    molecule_t::incident_bond_iterator ibi, ibi_end;
    tie(ibi,ibi_end)=mol.incident_bonds(ai);
    while(ibi!=ibi_end) {
      cout << "   has incident bond of type : " << 
        BondProperty<type_of_bondS,molecule_t::bond_type>::get(*ibi) << endl;
      ++ibi;
    }
    ++ai;
  }
  
} // while()
  
  return 0;
}


Algorithms

You can find the following example in examples/mol_weight.cpp


//----- cdl
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>
//----- std
#include <iostream>
#include <functional>  // for std::mem_fun()


using namespace morpho::cdl;

int main(int argc, char* argv[]) {
  // first get the juice
  nail_juice<>  j;
  get_juice_from_stream(std::cin, j, 0, sdf_formatT());
  // now construct the molecule
  typedef   molecule<>  M;
  M  m(j, false, true);

  // now call the previously written algorithm

  // first let's use atoms_begin() and atoms_end() functions
  std::cout << " The molecular weight is : ";
  std::cout << mol_weight(m.atoms_begin(), m.atoms_end()) << std::endl;

  // now lets use tie() and atoms()
  M::atom_iterator  ai, ai_end;
  tie(ai, ai_end)= m.atoms();
  std::cout << " The molecular weight using other iterators : ";
  std::cout << mol_weight(ai, ai_end) << std::endl;

  // Now lets use the more general algorithm we wrote:
  std::cout << " The molecular weight using other algorithm : ";
  std::cout << accumulate_prop(ai,ai_end,
    std::mem_fun(&AtomProperty<atomic_massS,M::atom_type>::get))
  return 0;
}

See that to get atom iterators we use the function atoms_begin(), atoms_end() and atoms(). This last one, returns a std::pair<> containing the begin and end of the atom iterators respectively.

Fingerprints

The following example shows how to calculate the fingerprints of molecules streamed in as a smiles. You can find this example here : examples/fingerprints.cpp

// --- std inclusions :
#include <ostream>
#include <string>

// --- morpho inclusions :
#include <morpho/cdl/molecule/molecule.hpp>
#include <morpho/cdl/parsers/parsers.cpp>
#include <morpho/fingerprints/fingerprints.hpp>

// --- boost inclusions :
#include <boost/dynamic_bitset.hpp>

int main() {
  using namespace std;
  using namespace morpho::cdl;
  
  while( !cin.eof() ) {
    nail_juice<>  j;
    get_juice_from_stream(cin, j, 0, smiles_formatT());
    // check if the mol in non empty :
    if (!j.good()) continue;
    // get the molecule :
    molecule<>	m(j, false, true);
    
    // parameter to the fingerprint calculation :
    unsigned int max_path_lenght = 7;
    unsigned int fp_size = 1024;
    // calculate the fingerprint :
    boost::dynamic_bitset<>   fingerprint = generate_fingerprint(m,max_path_lenght,fp_size);
    
    // now stream the FP out :
    std::string  str_bitset;
    boost::to_string(fingernail,str_bitset);
    cout << str_bitset << endl;
    
  }

  return 0;
}

Structure modification

Care should be taken when modifying the structure of a molecule. The reason for this is that many of the functions for structure modification invalidate indices and/or iterators of the molecule. Following is an example that shows how to deal with the reposition of the indices of a molecule when deleting fragments. Consider that you've been told from your medicinal chemist that somehow an alcoholic fragment makes the molecules carring it magically pass the BBB, and causes neural destruction. Now you want to remove all those alcoholic fragments from the library that you've been working on. Usually you'd use your favorite editor to draw the fragment, then convert it to your favorite format. The next step is to get the vertex indices of the molecules of the library that are part of the fragment. You can find 'em with the use of a substructure search algorithm. CDL provides the Ullmann algorithm for subgraph isomorphism matching, but this is an advance subject, so documentation is provided to deal with that particuar issue. For the moment just consider you know the indices, and are strored in a container. Because all the indices of the molecule are repositioned after the removal of an atom, you have to reposition the indices of the vertices to remove as well (the ones you have in the container,) so you write something like:
  template <class InputIterator>
  void reposition_indices(InputIterator first, InputIterator last,
  typename InputIterator::value_type  pos) {
    while(first!=last) {
      if(*first>pos) *first -= 1;
      ++first;
    }
  }
To substract 1 to every index that is greater than the index of the vertex that you're removing. Then:
  template <class Molecule, class ContOfVertices>
  void remove_vertices(Molecule& m, ContOfVertices& vert2remove) {
    typename ContOfVertices::iterator vi = vert2remove.begin(), vi_end = vert2remove.end();
    while(vi!=vi_end) {
      m.remove_atom(*vi);
      detail::reposition_indices(vert2remove.begin(),vert2remove.end(),*vi);
      ++vi;
    }
  }
To get rid of the atoms that form part of the fragment.

Copyright © Vladimir Josef Sykora 2003-2006
SourceForge.net Logo