Skip to main content

Posts

Showing posts from 2009

Compatibility Table

With the canonical code in place (thanks to open source :) the structure generation goal is much nearer. The first thing to improve is the bond compatibility . This image shows cuneane ( again !) and the bond compatibility table. The table is tricky to calculate, but relatively easy to understand; there is only one bond between atoms of type A - so there is a one in the cell (A, A). Conversely, no atom of type C is connected to an atom of type A ( see this for more detail), so there is a O in both (C, A) and (A, C). Note that the table is not symmetric, as can be seen with (A, B) = 1 and (B, A) = 2. This makes sense, in that an atom of type A is connected to two of type B, and yet an atom of type B is only connected to one of type A.

Crude port of Faulon's C-implementation

I finally did the sensible thing and just ported over the c code. I don't particularly like doing this, as the resulting code is probably quite fragile, and unreadable - java spoken with a c 'accent' usually is. However, it does reproduce the same signatures as the c code, so that's good. It only took 1 day to port, but 2 days to debug... Check it out here : http://github.com/gilleain/generation/tree/master

Nicer tree pictures

Here are some pictures of proper trees, or rather proper signatures, to celebrate getting the translator program to compile (it involved replacing some 'local' keywords with 'define's - who knew?). Also, thought I would post the code for tree layout I'm using: public int layout(Node node) { node.y = node.depth * ySep; if (node.isLeaf()) { leafCount += 1; node.x = leafCount * xSep; return node.x; } else { int min = 0; int max = 0; for (Node child : node.children) { int childCenter = layout(child); if (min == 0) { min = childCenter; } max = childCenter; } if (min == max) { node.x = min; } else { node.x = min + (max - min) / 2; } return node.x; } } basically, it lays out the leaves fi

Tree Canonization Simplified

While debugging the methods to make canonical signatures, I learned something about tree isomorphism from various sources, including Prof. Valiente 's excellent looking book on trees . One way of checking isomorphism is canonisation, since two trees are only isomorphic if they have the same canonical form. For simple labelled trees, it looks like there is an almost trivial way to get a canonical string representation. Say we have two trees: The are rooted, labelled trees. So the conversion to a canonised string proceeds as follows; for each node, lexicographically sort the string form of the labels of its children, and return the concatenated string. In python this looks like: which...is unreadable. hmmm. Wish there was a better way to get marked-up code into blog posts. Perhaps there is one, and I don't know of it. Anyway, the point is that it is very short. edit : the code is... def printSorted(node): if len(node.children) > 0: childStrings = [printSorted(child) for ch

Square Grids, Cylinders, Spheres, and Toruses

This is straying from the point; but if any graph can be described by canonized trees made from its subgraphs, then what are the properties of very large (regular) graphs? A grid, for example? Starting with a square, and fusing squares together results in this situation: two and three fused squares are similar, at the height-two tree level. With four rings, a new type c appears ( b becomes b' with three rings). Beyond this point, any number of rings fused in a row like this has the same structure. Moving to a second dimension of growth, a square grid (G nn ) has the following structure: On the left are 'snapshots' of the advancing wave of the tree. Looks a lot like a breadth-first search, I suppose. On the right are the trees for each snapshot, with increasing heights. Although this is not shown, 8 of the 20 leaves of the third tree are duplicates. This should be clear from the third snapshot, which has only 12 (filled) circles on the 'wavefront'. These trees can e

Warning : Abstraction!

This is a throwaway mathematical point, that I am not qualified to make, but it looks like three of the previous examples (diamantane, twistane, and cuneane) have a very abstract connection when colored by signature: what I mean by this diagram is that diamantane has atoms colored by (a) connected to both other (a) atoms, and to (b) atoms. Its (c) atoms are only connected to (b)s; the arrows could well be double-headed, by the way. The most complex situation is cuneane, where each 'type' of atom is connected to another in its type and to two in another type. Adamantane would just look like : (a)-(b). Interesting, but it doesn't get the signature canonization methods debugged any faster...

Adamantane, Diamantane, Twistane

After cubane, the thought occurred to look at other regular hydrocarbons. If only there was some sort of classification of chemicals that I could use look up similar structures. Oh wate, there is . Anyway, adamantane is not as regular as cubane, but it is highly symmetrical, looking like three cyclohexanes fused together. The vertices fall into two different types when colored by signature: The carbons with three carbon neighbours (degree-3, in the simple graph) have signature (a) and the degree-2 carbons have signature (b). Atoms of one type are only connected to atoms of another - the graph is bipartite . Adamantane connects together to form diamondoids (or, rather, this class have adamantane as a repeating subunit). One such is diamantane , which is no longer bipartite when colored by signature: It has three classes of vertex in the simple graph (a and b), as the set with degree-3 has been split in two. The tree for signature (c) is not shown. The graph is still bipartite accordin

Cuneane

While taking a look at the InChI discussion mail archives, I came across a discussion on graphs that are difficult to canonize (so called 'isospectral' graphs - I've heard the term before, but I haven't worked with eigenvalues of adjacency matrices, so did not pay them much attention). Anyway, one such was this structure, cuneane : which shows 3D and 2D representation of the molecule (reproduced from the wiki page), and an arbitrary numbering in the center. The three signatures A, B, and C are shown labelled by this numbering. What's interesting about this particular example is the number of times that the same atom is represented in the signatures. For the signature called 'B', which is rooted at atom 2, the atom numbered 5 appears four times in the lowest layer of the tree. This naturally follows from the large number of rings of different sizes that make up cuneane - two 5-membered rings, two 4-membered rings, and two three-membered rings.

Two pass rendering

So, there was a question on the cdk-devel mailing list about bounding boxes, reactions, and text. An unfortunate consequence of the new design is that the renderer will not calculate bounding boxes that can fully contain the text. Concretely, this is what it would look like (not made in JCP!) The blue box is the bounds that would be created, which is minimal with respect to the atom centers. The black box is the bounds that should be created, if we respected the text size. The problem is, the size of text is not known until the point it is drawn. Or, more precisely, until we have some sort of GraphicsContext to ask about the width in a particular font. So, a two-pass system was suggested. When this was mentioned before, I was dismissive - perhaps even rude. Sorry about that Egon, Sam. I still think it is better avoided; in the case of transparency, I don't know why alpha values can't be used for fill colours. I understand there was some SWT problems..? Anyway, here is a sketch

More chemical signature example

A neat and tidy example of generating a structure from atoms and signatures occurred to me : adenine. It is known that this molecule forms readily from HCN under conditions similar to those thought to exist on the early Earth. Anyway, the point is that the structure is almost exactly 6 C-N units (ignoring hydrogens and multiple bonds). A C-N is very similar to a height-1 signature, and the height-1 signatures for adenine are shown in the figure below: Each signature ( a - e ) is mapped to an atom in the original structure, and then a table is shown of the maximum number of compatible bonds possible between each pair of signatures. Note that the table is not symmetric, as the compatibility operation is not commutative. Also, notice that the diagonal is not all 0 - bonds can form between atoms that both have b as a target signature. Although these target signatures are smaller than the diameter of the adenine molecular graph, they still seem to define it completely. However, they are n

Signature bond compatibility

So, given my previous posts on what Faulon's signatures are, here is an explanation of how they are used in the structure enumeration algorithm that I am almost finished implementing . The core test in this algorithm is for compatible bonds. Two atoms are only joined if : a) they have compatible target signatures and b) there are less than the target number of bonds already. A target signature here is just a signature that is set on the atom for it to match, like a pattern. The first of these tests is illustrated here: Another (overly) complex diagram! But the formula here is a bit difficult to interpret otherwise. In the top left corner is a graph G (slightly resembling hexane without the hydrogens) which is, by convention, composed of vertices ( V ) and edges ( E ). The equation to the right of the graph defines part of the condition for a compatible bond. The tau terms are just target signatures, as shown on the upper right. The tricky term is h-1 σ τ(y) (z) which means 

The Voynich blog?

I apologise to any readers. I do tend to make complex diagrams. Perhaps some distant historian will find them in an archive and a new voynich-style mystery will result.. This does mean something, but you would have to read the previous post to discover what, exactly.

Faulon's Signatures : A Possible Interpretation

Several recent papers by Faulon concern an idea he calls 'signatures'. This post is just a record of what I understood them to be. Firstly, a signature is a subgraph of a molecular graph. There is a distinction between atomic signatures - which is a tree rooted at a particular atom - and a molecular signature, which is the set of atomic signatures for each atom in a molecule. A tree is a graph with no cycles, so an atomic signature is not just a subgraph. Like a path, a signature has a length - or rather a height . Here is a picture of signatures of heights 1-4 for a fused ring structure: The graph G on the left has one of its atoms labelled (a), and each of the trees in the center is a signature rooted at that atom. On the right, is the simple string form of the tree, as a nested list. I should point out that the signatures in these images may not be canonical, as I worked them out by hand (as I have not yet fully implemented the canonization algorithm). Signatures of the sam

Automorphism groups and fragment graphs

Structure generation involves not just graph theory, but group theory. Or, I should say, it does in some of the papers I have read. For example, in this paper by J.L.Faulon, there is the sentence: "The two main steps are to compute the orbits of the automorphism group of G and to saturate all the atoms of a chosen orbit which may well be incomprehensible to many readers, except if the reader is a mathematician. I am no mathematician, but thanks to some books on groups, I now understand both what an automorphism group is and what an orbit is. On the other hand, I also believe that this definition of how the algorithm works is overly complex. A more simple term might just be "fragment sets" - as it is fairly clear, if not mathematically exact. So, for the fragment graph [CH 3 , CH 3 , CH 2 , CH 2 , CH, CH] the fragment set is [CH 3 , CH 2 , CH]. Anyway, here is a short analysis of the automorphism group of the fragment graph [CH 2 , CH 2 ]. This first image shows the t

Simplest example : 1-methylcyclobutane

This is the simplest possible hydrocarbon example (there being none with four carbons) of multiple isomorphic solutions: The only difference in numbering is the swap of 2 and 4. Oh well.

Doesn't quite work

Sadly, using the degree sequence as a kind of 'mask' on canonically numbered matrices isn't enough. I should have guessed, it was too simple to be true :( The smiles for all matrices in the (2, 2, 2, 2, 2, 3, 3) partition are: C1CC2CC(C1)C2 C1CC2CC(C1)C2 C1C2CC12.C1CC1 C1CC2CCC1C2 C1CC2CCC1C2 C1CC1CC2CC2 C1CCC2CC2(C1) C1CCC2CC2(C1) and (ignoring the third one, which is disconnected) there are only 3 or 4 different simple graphs in the list. Perhaps I would have known all this if I knew what automorphisms were, or orbits of elements in sets, or any of that...

Bridged Cyclohexanes - same partition, different PVR sequences

I wanted to find an example of two non-isomorphic molecules with the same degree sequence, but different PVR sequences. The simplest I could think of was this pair: although there might be a similar pair of bridged 5-membered rings, I suppose. The meaning of this is quite simple - for a particular degree sequence (partition), there are multiple (different) molecules with a valid PVR sequence. I thought that this must be true, but there are none in the G4 set.

Rows and Columns

Aha! So, a comment by Egon (on my last post ) showed the benefits of showing people what you do. He suggested summing the columns - not converting from binary to integers, as with the rows - to remove the last traces of redundancy. So this seems to work for graphs with 4 vertices: Sorry for the extremely detailed diagram, but it is necessary to show my point. These matrix/graph pairs are all the PVR numbered adjacency matrices for n=4. There are isomorphic structures here, but note the column sums along the bottom of each matrix. These column sums form another sequence - which can be used to select only one of the isomorphs. Arbitrarily, we choose sequences that are partially ordered - i.e. no number in the sequence is less than the previous number in the sequence. This seems to work...

PVR numbering scheme not solution to all woes : film at 11

On a whim, I decided to try generating all adjacency matrices with the property that they are PVR numbered . The short summary of that link is that a matrix can be expressed as a sequence of positive integers by considering each row of the matrix as a binary number. The point of doing this (I thought) was that you can number a molecule in such a way that the adjacency matrix is PVR-numbered, and that this is canonical. So my cunning plan was to generate all sequences of n numbers that are partially ordered, choosing them from [1, 2 n ] to give all non-redundant (simple) graphs with n vertices. Unfortunately, it seems like this can't work: This image shows all adjacency matrices for n = 3 which are PVR-numbered. They are made by backtracking through all sequences of integers with a partial order, pruning the solutions using the symmetry of the matrix as a constraint. Anyway, the point is that the first two graphs are clearly isomorphic! More simply, they both represent propane. Mayb

Generation : Overview

To sum up the previous post flood; generation of constitutional isomers from the elemental formula can be done by generating all partitions of the total 'free' valence of the heavy atoms. The overall scheme is shown here: (click for bigger, as usual). So, for each formula, multiple partitions can be made, and each of these makes multiple sub-partitions, and each of these correspond to one or more molecules. Now, I won't pretend that any of this is particularly novel. I am no doubt re-expressing the problem of generating all possible molecules in a slightly different way. Having tried (and failed) to implement published methods, this was the best I could come up with. I suspect that there are many improvements that could be made to the algorithm, and the implementation of it. Getting something that works, even in a limited way, seems like progress, however :)

Final step : nested partition thingies to actual molecules

So, the last step: One thing to note about this is that the algorithm again has to backtrack to get all the molecules for any sub-partition list (another name for the things like [[3, 1], [3], [1, 1], [1]]). Anyway, on the right hand side is the final (only) molecule made for this nested partition. It has the [4, 3, 2, 1] partition structure, naturally, and the correct constitutional formula. Now what would be nice, would be to combine the second and third steps, so that only those nested partitions that produce valid molecules were tried. However, as the saying goes : "First, make it work, then make it work fast".

Partitions to er..'nested partitions'

So I don't have a good name for the objects that I create half-way through this process : the code uses 'solution', which is confusing. Anyway, here is the process: The left hand side is clear enough, I think, and follows on from the image in the previous post. Conceptually, it is similar to 'gathering' the attachment points into half-bonds in all possible ways. So, the 4 attachment points on the bare carbon fragment can become a triple (half) bond, and a single half bond. This is shown, as [3, 1]. Of course, there are other possibilities, and each combination at each atom fragment has to be paired with each other possibility at each other fragment! If this sounds like a backtracking problem, then you might understand why I did exactly this in the code. What would be nice, would be to prune the solutions - for example, [[3, 1], [2, 1], [1, 1], [1]] is generated, but is clearly impossible. Neither the triple half-bond, nor the double half-bond have partners. Pruning

Formula to Partitions

For the benefit of my (2) readers, here is the process of going from the chemical formula to the partitions, and what this actually means. So, the list of numbers at the bottom (the partition) is the simplest possible representation of the attachment points for each atom fragment. Oh, and the python code for generating partitions for is here - it is basically a straight copy of the code from the book " Combinatorial Algorithms : Generation, Enumeration, and Search ", which I highly recommend - a good balance of maths and computer science.

Uniqified

Hmm. Not my favourite solution, but I think that the isomorphism checks can be done in batches, resulting in actual isomorph spaces, like this (for C5H10): I should probably check that these are right...

Step one : catch them all

Well, this is the C4H6 space again. Or <10,4> as I have taken to calling it, for no obvious reason. There are duplicates, yes. But there are nine (unique) structures here, which is good.

Exploring the wild beasts of the layout jungle

There was a bug submitted to the CDK sourceforge tracker (bug number 2783741) with a list of molecules that are laid out badly. I had a look at some of them with the help of bioclipse. For example, this calix[4]arene: or this: which is a clearer case of something going wrong. More difficult is structures that are fully 3D, like: Can you guess what it is? :) Try the 3D version (also made with bioclipse, using the CDK 3D layout): It's a paracyclophane! The phenyl rings are lost in the 2D layout because there is a bigger 'ring'. Perhaps a chemist would look at the 3D structure and think that those chains are linkers, not parts of a ring, but the algorithm doesn't know this. I think that it is difficult to have general rules for this. Of course, any fully 3D structure will be difficult to lay out in 2D (if it is not embeddable in the plane then it is impossible) so things like this: are truly awful.

Bioclipse : Safe When Used As Directed

Finally used bioclipse for a real purpose, and to good effect, too: what this shows (the images do get larger if you click on them! :) is the following basic workflow: 1) Exploring manager functions in the js console (bottom). 2) Writing a script in the js editor (top left). 3) Running and getting feedback in the rhino console (far right). 4) Viewing the results in the sdf viewer (top right). What I was doing was searching through an sdf file (C10H16_filtered.sdf) for all structures with a cyclohexane ring as a substructure, then writing those out to a file. Probably could have been done 5 other ways, but, well, it was more fun this way. Oh, and it is a gist here .

ChEBI in Bioclipse

Says it all, really. The scrolling can hang if you move too fast, which might be due to garbage collection? It is a very large file....

Portable whiteboard : deployed

Proposed CDK changes related to PDBReader and BioPolymer

This is expanding on one of the points that Rajarshi made in his blog (which he followed up here ) on the PDB file handling capabilities of the CDK. There are two related topics : reading of PDB format files (the ancient, fixed column-width ATOM files) and the model that these are read into. PDBReader The old PDB format is being replaced with mmCIF and/or PDBML formats. Only there are lots of programs that write out this format, so it makes sense to still support it for a while at least. However, it is a quite nasty format, in some ways. Not so much the fixed column width, but the fact that crystallographers abuse the file format in all sorts of ways. Even simple things like expecting that atom numbers will always increase, may not be true. So it is not easy making a good reader for PDB files. The current CDK one won't read a file with just ATOM records, for example. Think that's reasonable? Well, tough luck for people that made programs that produce simple files like this. A

Generated Wallpaper, anyone?

Heh. This is exhaustive generation of all C4Hn (where n=any number of hydrogens), badly smashed together onto one canvas.

Structure zoo

So, I'm getting better at generating structures : but, not quite there yet. Oh, and the [1.1.1] propellane is drawn oddly (with one of its carbons at 0,0) due to a mac-specific bug in the layout. Irritating, but difficult to fix. edit: Oh, and for interests sake, here is an auto-generated tree of graphs Even more complex looking is this! : which is a set of structures produced by descending to depth 6 for 5-carbon graphs.

Numbering atoms, numbering vertices

Further to the similarities between numbering atoms in a structure, and generating unique graphs here is this: which shows the same molecule with two different numberings on the left, and the resulting graphs on the right. The double bond is not shown on the graphs; but it would probably have to be a labelling of the edge, rather than an actual multiple edge, to still be a simple graph. So, this quickly shows how - if you start with the vertices of the graph and connect 'all possible ways' - you get molecules that are isomorphic, but numbered differently. Therefore (perhaps) the numbering of the vertices and edges is one of he keys to not creating all the isomorphs and then having to expensively check them all.

Step 3 : Extend all possible ways

The title of this post refers to the tendency of algorithms in papers to have detailed explanation of every step except the most crucial one. Right now I am rediscovering this peculiar pleasure in structure generation. As an example - or more as visual decoration - have this image: which looks nice, but needs some explanation. The diagrams in boxes that look like parachutes (as one of my colleagues put it :) are simple representations of atoms connected by bonds. Each point is an atom, and a curved line connecting them is a bond. These are grouped together by what structure they correspond to; which is shown on the right of each set of diagrams. What this shows, then, is the redundancy you get from a simple generator. If you connect all atoms like this: for atomA in atoms: for atomB in atoms greater than atomA: connect(atomA, atomB) You quickly get a very large number of isomorphic structures. And then your process runs out of memory, in my experience. Oh, and the numbers below

Symmetric Generations

I've been trying to work on an implementation of a structure generator algorithm due to Faulon (its a paper in this list somewhere). One problem I have difficulty with is reduction of the number of isomers generated. For example: It may be hard to read, but the idea is that the full tree of possible ways to attach {Br, Cl, F, I} to c1ccc1 is 4 * 3 * 2 * 1 (you can attach Bromine to all four of the carbons, leaving 3 places to attach chlorine, and so on). Of course c1(Br)c(Cl)cc1 is the same as c1(Cl)c(Br)cc1 [never mind that C1(Br)=C(Cl)C=C1 is not the same as C1=C(Br)C(Cl)=C1]. Or, in other words, there is a high degree of symmetry in the tree. There is a way to solve this, perhaps even described in the paper - if I could just understand them...

Multi-view

A quick example of a mini-application made using the new JChempaint renderer: the picture is of an isomer space (C 3 H 7 NO), with compact mode on and atoms rendered as circles. The code is here, for now: http://gist.github.com/70342 which is a kind of dumb way to achieve this, as it creates an instance of a renderer for each molecule, instead of adding them to a molecule set, and then laying that out... edit: just realised; it's rendering the hydrogens as compact black circles :( edit2: Ahhh. that's better: edit3: Ha! This was a mistake, but it looks kind of cool: