Skip to main content


Showing posts from 2010

Final example for today : bowtieane

The simplest-yet-most-complex example I could find is this one from another previous post (I like to recycle :)

I should mention again that I don't know if this is an actual chemical. I call it 'bowtieane', but maybe it has a proper name. In any case, it makes a nice small test case. The ring equivalence classes shown (A, B, C) are defined by the signatures as usual.
What's really strange though is SSSRFinder's partition of the rings. It divides the two A-rings into separate classes. This seems like the opposite behaviour to the fullerene-26 case where there were fewer equivalence classes than for the signature method. I really, really should read the papers referenced in the code.

Fullerene-26 revisited : more vertices, less complexity

There is an example graph that makes for a better comparison between my code and the SSSRFinder. It is the 'fullerene-26' molecule that I used in a previous post. Unfortunately the diagram is wrong in that post; the one below has the correct vertex equivalence classes (colors):

The difference is just a pair of yellows that should have been pink (and v.v.) in the topmost C ring. Anyway, the graph is complex enough that there are different vertex symmetry classes without multiple bonds. Dodecahedrane, on the other hand, is so symmetric that without double bonds all vertices are the same.
So how does the SSSRfinder do? Well it considers all the rings to be different apart from two of the B rings. It is not reporting the final B ring, which could be considered as the bounding face of the map. In short, there are two equivalence classes (5-ring and 6-ring) which makes some kind of sense but isn't terribly informative.

Dodecahedrane has 12 faces, right?

So there was this guy called Euler, and he had a formula that goes something like F = E - V + 2. Well, actually it is χ = V - E + F, where χ is the Euler characteristic, and this is equal to 2 for polyhedra. Anyway, the point is that dodecahedrane has 12 faces (cycles).
For the SSSRFinder, however, it has only 11; which is annoying. Moreover the ring equivalence class method only distinguishes based on the underlying simple graph - in other words it ignores bond order. In some applications this might be exactly what is needed, but I'm glad that my method gives a more detailed result:

So, apart from being a ridiculously detailed image, the above shows the face (ring, cycle) equivalence classes for dodecahedrane with a particular double bond network. Clearly any face could be 'glued' to another along one of the edges, following the vertex classes. All possible combinations of faces are shown in the 'face quotient graph' at the bottom right.

SSSRFinder and Ring Equivalence Classes

With hubris and arrogance, I implemented my own ring equivalence class finder. With humility and grace, the SSSRFinder method getRingEquivalenceClasses does better for at least one case:

This is the test case used in SimpleCycleBasisTest, and probably in the paper on the subject. I guess I should read that instead of other papers on cycle bases. I do get the same answer for cuneane as when I do it by hand. That is to say, 3 equivalence classes with [1, 2, 2] elements.
For the 'Bauer' graph above (as I am currently calling it - Ulrich Bauer is the author of the SSSRFinder) the signature method puts almost every vertex in its own equivalence class. Now my method makes equivalence classes from cycles with the vertices in the same vertex equivalence class. Naturally enough, this makes a lot of ring equivalence classes.
The graph on the left has vertices colored by degree, to show how dissimilar the rings are. Is the SSSRFinder really getting the right answer?

Herschel graph with different planar embeddings

Another example from a paper that I have mentioned before. This time the Herschel graph which is another of these crazy graphs thought up to prove or disprove some conjecture. The spring-layout paper gives two different layouts:
These really are the same graph :) One way to see this is to look at the vertex colors that I have applied. Each face is given a label (A or B) based on the key shown below the two graphs. The 'A' face is {Black, White, Grey, White} for example. The left-hand embedding (or layout) has such an A-face for its border, while the right-hand one has a B-face for a border.Presumably, it is possible to convert a vertex partition (into equivalence classes) into a partition of faces. It seems easy for examples - like this - that have a planar embedding. More difficult for graphs that don't have one. On the other hand, not all planar layouts look very informative:This is twistane (again) but not looking as symmetric as it can. However, the faces show the regul…

The many faces of fused cycles

Although many ring systems in molecules are quite easy to layout as 2D diagrams, there are some that are inherently 3D. Bridged rings are usually in this class; consider my favourite example molecule, cuneane:
Each of (A, B, C) is a particular layout of the same molecule, but with a different boundary (hexagon, pentagon, er...kind of fused squares). It would be nice to have a layout method that picked the same choice each time - regardless of the permutation of atoms and bonds. Even better if it could allow enumeration of the alternative possibilities.
As another example, consider a series based on twistane (which is a molecule) to two other graphs that may well not be actual molecules:

Twistane itself is in the middle, surrounded by five- and seven- ring equivalents. The upper layouts emphasise one ring in the graph while the lower ones emphasise the dual rings in each.

Why modular decomposition is not very useful for chemical graphs

It is difficult to publish negative results in a journal, but a blog post seems like a good place to record the experience. Especially situations like this, where it probably should have been obvious not to try in the first place...
So; what is modular decomposition? Briefly, a module is a little like a connected component in a graph - indeed, a connected component is made up from one or more modules, but modules can overlap. Decomposition of a graph into its modules is, therefore, like finding the connected components of the graph. An example is shown here:

Two modules in the graph are circled, there may be others. The definition is a set of vertices that have the same neighbours outside the set. So, there was no need for me to make them complete graphs, but it looked nicer. Anyway, already looking at this example it is clear that these are not very 'chemical' graphs. They look more like networks (for example, see : [1]).
Indeed, I tried out some code made by the authors of a pa…

Muhahahaha! Things can always be more complex...

A good measure of how right a model or an implementation is can be how quickly it extends to more complex situations:
This is essentially the same except that 'SSE' has been added (no big deal) but also the leaf list has been generalised to AbstractLeafCollection. :)
I'm sure there are better ways to do this, but it fits neatly with some existing ideas I had on searching through lists vs searching through sets.

Rose Forests

Carl Masak blogged about tree data structures, which caught my interest because of a pet-project of mine (tailor; a structure description and measurement tool) where I found myself using trees a lot. An awful lot. Perhaps ... too much.
Anyway, a related tweet by AudreyT mentioned an article called "Origami Programming" by Jeremy Gibbons. Which is in haskell (perhaps not surprisingly), a language I don't speak very well. However, while reading - and not understanding it - I did get one thing which was the idea of having a tree datatype where the node (called a 'rose') references a forest (a list of roses). I think that's right.
In any case, it solves a object-modelling problem for me that I had. The difficulty was that protein structures are hierarchical, yes, but have a strange mixed hierarchy of types. Perhaps this is obvious to haskell programmers and compiler-code writers, but this makes it very difficult to use the 'simple' tree datatype, where a No…

Molecule Layouts

I've been doing experimental work on layouts for the CDK. Not for atoms, exactly, for which the StructureDiagramGenerator is doing a pretty good job - could be better, of course, but what couldn't?
No, layout of MoleculeSets, and Reactions. Well actually IMoleculeSets and IReactions. With an ILayout class - my apologies to anyone who doesn't like generics, but it can be quite useful. Anyway, here is an example of what it is looking like at the moment:
Hmmm. Well, it is a grid I suppose. The problems with the ring bonds are known to me, please do not mention them >:|
The code for this is quite short: IMoleculeSet moleculeSet = makeMolSet();
ILayout<IMoleculeSet> gridLayout = new GridMoleculeSetLayout(3, 3);
makeImage(moleculeSet, gridLayout, "three_by_three", 500, 500);
where the methods 'makeMolSet' and 'makeImage' do what you might expect (I hope :). Similarly:IMoleculeSet molSet = makeMolSet();
AxisOrientation o = AxisOrientation.PLUS_X_PLUS_Y;

Consistent Zoom with Models of Different Scales

So there is a way to get the zoom to work:
(to zoom on the picture, click for bigger :)
The approach taken here is to create graphical objects (LineElement, RectangleElement, etc) that are scaled at the origin, but not zoomed or translated to the center of the draw area. These last two parts of the transform are then added to the graphics transform.
One downside of altering the transform in the graphics is that if we want to draw extra stuff on the panel (like the detail string "Zoom = x, Scale = y" in the picture above) the original transform has to be captured before drawing, then restored after.
For example, see this commit:

Scaling and Text

An obvious question about the CDK rendering code is : "Why not scale text with AffineTransform?" So, of course this is possible, and works quite nicely - but there is a cost.
One of the goals of the rendering code is to start from models of any scale, and render them as consistently sized diagrams on screen. By 'scale' here I simply mean the average distance between points. So the CDK layout code might use a distance of (say) 1 between two carbon atoms, but a file with a structure made in some other chemical editor might have an average atom-atom distance of 100. These are unitless values, by the way.
Now, what we could have done was transform the coordinates in the model to a consistent scale, then rendered these transformed coordinates. What we chose to do, however, was to calculate a single transform for the model and draw with this. If you use this transform to scale the graphics object before drawing you get this for a model scale of 10:

and this for a model scale…

Generic Rendering

Egon++ is continuing the process of merging the CDK-JCP rendering core into CDK master. Some proposed generification of the classes was made on the mailing list, and here is a sketch of some of the classes and interfaces:
I realise that this looks horribly complex, but the question is : "Is it just complex enough, or too complex?". One of the things missing from the diagram is layout - there may be a need for classes like LinearMoleculeSetLayout or GridMoleculeSetLayout. Oh, and yes (you guessed it!) an IMoleculeSetLayout and ChemObjectLayout classes :)
The goal here is not to make convoluted code, but to avoid repeating stuff. A reaction renderer should know how to layout and paint molecule sets, and then pass on the task to the molecule set renderer, and so on. Some key things to avoid will be a) not to relayout on each paint and b) generate the diagrams in the right places, at the correct scales. I think that this will be possible.

How (not) to remove items from a (CDK) list

So there I was, trying to remove all mappings from a Reaction like this:
for (int i = 0; i < reaction.getMappingCount(); i++) { reaction.removeMapping(i); }and found that only half the mappings were being removed ... can you see why? :)
In fact, this is not some obscure CDK bug, but a logic error on my part. Equivalent code is this: List list = getListSomehow();
for (int i = 0; i < list.size(); i++) { list.remove(i) }using for example a java.util.ArrayList. The problem is that the index (i) is being tested against a changing number (the size). Once half the items have been removed, i is at the half-way point, so on the next pass it stops.
One way to 'solve' this is to go backwards: for (int i = list.size(); i > 0; i--) { reaction.removeMapping(i); }but this is slightly less clear than just :
List list = getListSomehow();
int size = list.size();
for (int i = 0; i < size; i++) { list.remove(i) }which is clearer. Of course, even better is the List method removeAll(). It would …

Combinations and Filters

So there is now the beginning of a possible re-write of the DBST that uses basically the same approach, but is a bit more flexible. The code is here, but it's still a bit rough.
The original idea seems to have been to encode arrangements of double bonds for different ring sizes as a kind of 'library'. For each ring, a particular arrangement is picked until all possible combinations are generated. As a concrete example, see this example for a napthalene skeleton:
Here, the arrangements (1, 2) are applied to each ring (A, B) and then these are combined. Of the four combinations (A1B1, A2B1, A1B2, A2B2) only three are valid. The A1B2 combination has two atoms highlighted in red that have two double bonds and one single bond.
So one way to filter the combinations is to try and type the atoms, and reject any structure that has untypeable atoms. Another possible filter rejects structures that don't have atoms that are SP2 hybridized. Both of these are from the original code, b…

1,4-Benzoquinone and the DeduceBondSystemsTool

Once upon a time, there was a DeduceBondSystemsTool, and...
Er, anyway. Further to a patch made on the tool (patch ID : 3040138), there is a failing test for 1,4-benzoquinone:
The tool generates A, and the test wants B. Now, the problem is not that the tool is not trying B as a possibility, but that it generates A first and the final step doesn't remove it or rank it as better than A.
Understanding this requires an understanding of the algorithm. This is (roughly):For each ring, generate a list of possible positions for all numbers of double bonds.Generate a set of molecules by combining these positions together.Remove 'bad' solutions and pick a solution with the least number of 'bad' N/S atoms.where the definition of 'bad' is based on chemical rules like atom types.
Now, neither A nor B are bad solutions, and they don't contain N or S atoms, so they both have a rank of zero, and the first one generated will be returned. So, there is really no particular r…

Line Graphs and Double Bonding Systems

After looking at a CDK tool for fixing bond orders for aromatic systems (DeduceBondSystemTool in the smiles package) I wondered if there was a more general approach. That is, the problem is to take a molecular graph with no double bonds and generate all possible double bonded systems.
One possibility might be to first convert the graph (G) into a form known as a line graph (lg(G)) where every vertex in lg(G) is an edge in G. If these vertices are labelled to represent the bond order, then an aromatic system has a particular line graph. For example, here is benzene:
The dashed lines show the construction of the line graph, and the labels '-' and '=' mean single and double. Now obviously, the two resulting graphs are essentially the same, so it would be nice to remove this redundancy. An example of two different bonding systems comes from phenanthrene:
Which is great, but how to generate all non-redundant colorings of the line graphs? Since a line graph is just a graph, it…

CDK Export

A short post just to get a picture in place. Someone else's code, my diagram..

Click for bigger, as usual.

Formula Debugger

(Please click for bigger - there is quite a bit of detail :)

The left panel shows the search tree of structures; each circle is a structure, the red ones at the tips of some branches are fully connected and saturated. The blue branch has been selected, and is shown in the middle, with the structure at the tip at the top.
This solution is also highlighted (in red - colors are not very consistent) and shown in detail on the right. The right/middle panel is a traditional molecule layout, but with numbers in place of atom symbols. The lower right panel is the spanning tree of the atom highlighted in red in the upper right panel.
Clearly, there are still a lot of unnecessary structures generated since most branches are dead ends. However, this version does at least avoid duplicates - sadly it also misses a few structures :( Clearly refining partitions based on element symbols doesn't totally work...

CDK Signature implementation now in review

What is it?
I have written about them quite a bit, but here is a quick summary : signatures are a little like SMILES, but also somewhat like HOSE codes. They are a description of the connectivity of a molecule, or an atom in the molecule. A more detailed description can be found in these papers by Faulon et al: [1], [3] or in this blog post by me.
The java implementation of this algorithm is a collaboration between Lars Carlsson (who wrote a C++ version) and me (who ported this version to java). However, I was also influenced by my previous attempt at a port from the c implementation by Faulon's group. There is an online service for using their program called "sscan" here. It also deals with stereochemistry.
What is it used for?
So, what can be done with all this new code? Here are some possibilities: Smiles-like canonical strings that represent molecules. Note that signatures are considerably longer than smiles, but are guaranteed to work for cuneane, and indeed a broad rang…

Non-chemical example : Grinberg's graph

While looking for an algorithm to lay out fullerenes as 2D graphs, I came across this paper (PDF). It describes an annealing/spring layout method for Schlegel diagrams. I don't think I can spare the time to implement it at the moment, but one of the graphs in the paper is this:

known as the Grinberg graph.

C60 double bonding networks

C60 (or buckminsterfullerene) has no hydrogens, so it must have quite a few double bonds. I am beginning to understand that bond orders are in some sense a simplification, and I suppose that the bonding is in some way delocalised across the sphere. However, here is a picture of two different bonding patterns:

(click for bigger, as usual). The 'ChEBI' bonding pattern on the left is from the molfile in a ChEBI entry while the 'radial' bonding one on the right is bonded according to schemes from this site which has an interesting graph-theory perspective on fullerenes.
The radial bonding version has a simpler, layered structure like this:

Ok, so that's a slightly comical picture of the slices. What is also nice is the quotient graph for the ChEBI structure:

I didn't color this, but what is great is that it looks like a subgraph of a fullerene! Apart from the loops along the top and bottom, of course.

Fullerene symmetries

Continuing the theme of colored graphs, some of the more interesting examples are fused ring structures, especially those with some symmetries, but not completely symmetrical. Fullerenes fit this description, for example this 26 atom example:

the distribution of colors might look a little odd, but the dark blue atom surrounded by three cyan atoms is actually repeated at the top - which is really the other side of the sphere. These kind of 3D molecules don't lay out very well with the CDK's layout code, so I used JChempaint instead to make a more symmetric 28-fullerene:

This is actually a screengrab of a crude viewer I put together (commit) that takes a molfile and calculates the signatures. Selecting a signature from the right hand list highlights it on the graph. Anyway, it's easier than making images like this by hand:

Except that the little graph at the bottom (which is a quotient graph) can't be drawn by the renderer as it has loop-edges.

Chemicals as colored graphs

The interface between maths and chemistry can be tricky when it comes to terminology - sets (maths) have elements, chemistry has a different kind of element; graphs have colors which are usually just numbers, diagrams of chemicals have colors which usually relate to the element type of the atom, and so on.

So, for maximum confusion, here are two pictures of graphs (that could represent chemical connectivity) colored by equivalence class (determined by signature). The signature trees are also drawn with graphical colors, but these represent the integer colors in the signature, which are not the same as the colors used to indicate equivalence class. Firstly, a structure that the smiles algorithm is meant to have trouble with (but may not exist):

It looks quite strained, so I expect that it may not be possible to synthesise. Another multi-ring system is this one:

I don't even know what this one would be called, even if it did exist. Annoyingly, this structure triggers a bug if the two d…

Orienting a Pyrene diagram

Another example from Symmmetry in Chemistry, of drawing pyrene:

the standard (IUPAC?) orientation on the left doesn't show the symmetries of the molecule as well as the rotated version on the right. The letters indicate equivalent atom positions - oh, and aromatic indicators are missing :)
It seems like there should be a way to discover symmetry axis from the graph - without coordinates. In a similar way that some ring perception algorithms work by reducing the graph:

However, it is not obvious that this would be better than laying out the structure, then using the 2D coordinates to determine symmetries. Also not clear to me is how to choose which vertices to merge, and which to duplicate. On the left hand side of the image above, the fragments have duplicate (a), (d), and (e) vertices, but (e) has also been merged before being duplicated.

1,2-dichlorocyclopropane and a spiran

As I am reading a book called "Symmetry in Chemistry" (H. H. Jaffé and M. Orchin) I thought I would try out a couple of examples that they use. One is 1,2-dichlorocylopropane :

which is, apparently, dissymmetric because it has a symmetry element (a C2 axis) but is optically active. Incidentally, wedges can look horrible in small structures - this is why:

The box around the hydrogen is shaded in grey, to show the effect of overlap. A possible fix might be to shorten the wedge, but sadly this would require working out the bounds of the text when calculating the wedge, which has to be done at render time. Oh well.
Another interesting example is this 'spiran', which I can't find on ChEBI or ChemSpider:
Image again courtesy of JChempaint. I guess the problem marker (the red line) on the N suggests that it is not a real compound? In any case, some simple code to determine potential chiral centres (using signatures) finds 2 in the cyclopropane structure, and 4 in the spiran…

Stuck : Detailed Description

Ok, so this is the detailed version of the previous post.
To recap; the structure generation code is still missing a vital piece - the canonical checking. I have been implementing Jean-Loup Faulon's algorithm for generation, but there is no precise algorithm given for canonical checking. Here the relevant paragraph from the enumeration paper:
"Checking for canonicity is a common procedure of orderly enumeration algorithms, the procedure guarantees that the graphs generated are nonisomorphic. ...To verify that a graph is canonical, one labels the vertices of the graph in all possible ways. The graph is canonical if the initial labeling leads to a list of edges that is lexicographically smaller than the lists obtained with all other labelings. In the present paper, we have implemented two algorithms to verify canonicity, Tarjan tree canonization algorithm if the tested graph is acyclic and McKay’s Nauty technique otherwise" Okay, so I don't understand this for a couple of…

Stuck : The Summary

A couple of people have asked how the structure generation stuff is going, and the short answer is that I am stuck. This post will give a short summary of the problem, and the next will give a much more detailed description.
So the problem is this : given an elemental formula (like C6H12) or a list of fragments plus a formula (like {2 * CH, 2 * CH2, 2 * CH3 : C6H12}) return all possible connected structures.
There are simple ways to do this, such as connecting every atom to every other atom, and removing duplicates. The downside is that this takes forever, because this procedure will make many, many isomorphic copies of each solution. At the final filtering step, an all-v-all comparison would have to be done on these many copies.
A better solution is to check each structure each time a bond is made, to see if it is canonical. Although I know how to do this in theory, it turns out to be more difficult in practice. For simple graphs, I have a solution that seems to work. Chemicals are not …