Some Stuff

Posts

Showing posts from 2012

Comparing the EquivalentClassPartitioner and PartitionRefiner : Fullerenes

A comment on my previous post reminded me of the "EquivalentClassPartitioner" already in the CDK, written back in 2003 by Junfeng Hao and based on this article by Chang-Yu Hu and Lu Xu. After some testing, it seems they give the same results on various molecules - although, if you only want the equivalence classes the Hu/Xu method is much, much faster. Like 10-100 times faster. The test molecules I used for comparing speeds are a library of fullerenes that range in size from 20 carbons up to 720. Naturally, I started with the smaller ones, but even there the difference was clear. For instance, here is a table of numbers from the C40 run: The left-hand column is just the name of the cc1 file, the next two columns are the times for the AtomPartitionRefiner and EquivalentClassPartitioner, and the last two are the order of the automorphism group and the number of equivalence classes. Times are in milliseconds, so clearly the HuXu method is far faster at only one or t

Using the CDK's group module

There is a new module and package for the CDK that is currently under review. This is a short guide to how to use it - to help both reviewers and users. The basic idea is that molecules can be considered as a kind of graph , and that one useful thing to calculate about such graphs is the automorphism group that preserves element labels and/or bond labels. To put it another way, calculating the symmetries of the molecule - although I should point out that it's not quite the same as the crystallographic symmetry groups. As a simple example, consider these two molecules (1,4-cylohexadiene and 4h-pyran) : They are numbered from 0-5 for programming convenience; on the right each molecule has a table of automorphisms written as permutations in cycle notation. It should be fairly obvious that - for example - the H-Flip sends atom 0 to atom 4, 1 to 3, and fixes 2 and 5. Only the H-Flip is an automorphism for 4h-pyran, due to the oxygen atom. The code to do this is fairly s

Graphs of Trees of Graphs (ok, just timings again)

Line -graphs of search -trees of molecular -graphs, that is. I should also point out that I am not using the cutting-edge development version of OMG, but rather the commit 30b08250efa4.... - sadly, I don't have java7 on this machine, so I can't run the latest versions. Anyway, it doesn't make much difference for the alkanes and alkenes. Here are the times in miliseconds, and log(t in ms) for CnH2n and CnH2n+2. Click for bigger, as always: Clearly AMG (in blue) is significantly slower than OMG (in red), roughly by 10 times. On the other hand, the picture is surprisingly different if we add in a few oxygens: Weird, but I suspect that this kind of problem has been fixed in more recent versions of OMG. User "mmajid" seems to have been doing some interesting experiments with using bliss instead of nauty, multithreading, and semicanonical checking.

Overcoming the Problem of Crown Ethers in Generating

Due to an overly-optimistic algorithm, AMG was not generating crown ethers properly. In fact it was missing out on a whole bunch of structures, but these particular ones are a good example: The familiar structure is on the left, but the one on the right is the key to the problem that AMG had. Carbons are connected only to oxygens, and oxygens to carbons - in other words the molecule graph is elementally bipartite . Well, ok that's not really a phrase anyone uses - but it should be clear from the image. The flaw in AMG's algorithm was (again) in failing to try disconnected structures. It is not obviously not possible to reach a (CO)n structure from a C-C bond, only from C.C - that is, from two disconnected carbons. Surprisingly, after a fruitless attempt to use partition refinement on the connected parts and then 'add back' the disconnected atoms it turned out that refinement works just great on disconnected structures. So (for now) results for C3H6O3 and simila

Speed Optimization of AMG

Now that AMG 's results are getting better, I've started to look at improvements to the speed. At the moment, it's about 10 times as slow as OMG - probably due to my use of an inferior canonical checker (ie: not nauty). One obvious improvement has been to avoid extending molecules with too many or too few hydrogens. Previously, the hydrogens were only checked for the leaves - that is, at the final step. It is possible to prune the generation tree earlier, if you make two simple assumptions (see image). The min extension checks the smallest possible way to add all the remaining atoms to the molecule - which is just a tree. This gives the maximum number of hydrogens that could be added at this point. The max extension does the opposite, checking the case where all remaining atoms are added (at their maximum valence) which has the effect of effectively removing hydrogens. So, the number of implicit hydrogens for a partial structure is added or removed by these two bo

Tests that Pass, Tests that Fail

The AMG (alternative molecule generator) is now good enough to run proper tests on, with help from Tobias Kind who has long promised - or threatened, perhaps :) - to test a structure generators. It should lead to software that is of more than theoretical interest. Currently, there is a download available from github , or it can be built from the project directory if you are familiar with ant and are willing to change the build.properties file to point to a CDK directory. There is an instructions.txt file, with some examples of usages; the -h flag also works as might be expected. As for passing tests, it currently does better with hydrocarbons - CnH2n + x for x in {-2, 0, 2}. However, it's starting to improve on the more mixed formulae, with oxygen, nitrogen, and so on. The two child-listing methods (filter/symmetric) have different behaviour, annoyingly. Looking at one of the two pairs of duplicates in the set of C6H4 structures shows why it fails. The method here is the sym

Alternative Molecule Generation Implementation using the CDK and Signatures

Over the weekend, I cobbled together some components that I've been developing for a while (the past four years in fact) to make a molecule structure generator. As described in recent posts, OMG is one new available solution; now here is a proof-of-concept for a quite similar one. So what are the differences? Well, firstly OMG uses NautY to check candidates for canonicity while this implementation uses signatures , so is currently slower. The major difference, though is the algorithm. While OMG uses bond-augmentation of a parent structure to make children, this one uses atom-augmentation. Here is a small example (augmenting ethane): Of course, these are only the immediate children of the C-C parent; for OMG the unique, canonical ones will themselves be augmented further until they are the right size and have the correct number of hydrogens. The atom-augmentation algorithm, by contrast, produces a next generation with exactly one new atom, but a different number of bonds.

More Augmentation Timings and Conclusions

The previous post showed some rough timings for generating all graphs on n vertices. I've repeated this for graphs where the vertex degree is limited to 4, which is more relevant to molecule generation. Here is the data in the same format: The series have also been expanded to graphs on 9 vertices, as it takes less time overall to generate degree-limited graphs. However, the main conclusion that I draw - at the moment - from these two timing data charts is this; the choice of canonical augmentation method doesn't make a huge difference. A lot of variables are unaccounted for here, of course. Each of the five variants of this method could be implemented more efficiently (especially the edge-augmentation methods). Also, there may be other issues that only occur with multigraphs (as with molecules). It seems unlikely that these will overcome the difference between OMG's 18-45 ms per structure and Molgen's 0.008-0.009 ms (from the paper, page 11).

Timing Four Augmentation Algorithms

There are many possible ways to canonically augment graphs, but here I'm picking two pairs of possibilities - vertex vs edge augmentation, and filtering duplicates vs picking a representative from symmetrically equivalent positions. So, the four algorithms are vertex/filter (V/Fil), vertex/symmetric (V/Sym), edge/filter (E/Fil), and edge/symmetric (E/Sym). Here is a graph of log-average timings (in milliseconds) of the four implementations running on graphs of 4-8 vertices. One very important caveat is that the graph counts for 7 and 8 vertices are not 100% correct. The rows in blue on the table (4, 5, 6, 7, 8) are the log-averages of rows abov; so 4 = log(average(4a, 4b, 4c)), etc. The full spreadsheet is available here on github (as a .numbers file), or the code is here . I'm not particularly confident in the crude System.getTimeMilliseconds() as a timing method. However, the striking thing to me is that these numbers suggest that for larger input sizes (n > 6),

Two Different Ways to Handle Disconnected Graphs when Generating Graphs with Signatures

Disconnected graphs were mentioned in this post on OMG's algorithm. I also said that signatures could not handle such graphs, but this is not totally true. Here are a couple of approaches to handle them. Firstly, why is this a problem at all? Well, a signature of a graph is a bit like a spanning tree - and the canonical signature is the maximal string form of the set of trees. However, for disconnected graphs you get a spanning forest - which is just a set of spanning trees. The maximum tree from the forest will no longer span the whole graph. The image shows an example of this for a very simple graph. The upper panel shows a disconnected graph and its corresponding forest. The lower panel shows two different labelings of the same graph (A, B) and their canonical form. The algorithm for canonicalization simply has to 'paste' together canonical labels or signature strings, and has to make sure that a canonical signature is generated for each component. The compone

Schmidt's Bridge Finding Algorithm

While out looking for an algorithm to check for bridges in graphs , I found this paper by Jens Schmidt . It's very simple, and claims to be as fast as Tarjan's classic algorithm. The basic idea is to make something he calls a 'chain decomposition' of the graph, where a 'chain' is either a cycle or a path. This is quite similar to the 'parts' of a graph I described here , except more formal than my slightly ad-hoc definitions. Bridges, then are just those edges that are not in a chain. For example, take the graph from the paper: Click for bigger, as always. First make a spanning tree from the graph, and store the back-edges (curved edges in the picture). Then go through the back-edges in order of the depth-first index, and make chains from them and the tree edges. A chain that ends up back at the start is a cycle, otherwise it is an edge. The chains A-E are shown in the middle, along with a bridge F. The author also claims that the chain de

Open Molecule Generator's Algorithm

With the recent release of the Open Molecule Generator (OMG) I thought it would be nice to add to (or augment ) the description of the algorithm in the paper . The description here will be in terms of graphs, but the principle is largely the same. OMG's algorithm is a variant of McKay's canonical path augmentation algorithm , as mentioned before here . However, instead of augmenting by vertex it augments by edges. To illustrate this, consider a diagram for graphs with up to 4 vertices: The graphs are grouped by vertex count (boxes), and each box is sorted into columns. Along the top are the number of edges for graphs in that column, and along the bottom are number of vertices for graphs in the box. Two graphs are connected by an arrow if the larger can be made from the smaller by adding a single edge. A couple of important things to note about edge-addition are : 1) two paths can lead to the same graph, and 2) at least one of the graphs is disconnected. The first of th

How many isomers of C4H11N are there?

One of the most popular queries that lands people at this blog is about the isomers of C4H11N - which I suspect may be some kind of organic chemistry question on student homework. In any case, this post will describe how to find all members of a small space like this by hand rather than using software. Firstly, lets connect all the hydrogens to the heavy atoms (C and N, in this case). For example: Now eleven hydrogens can be distributed among these five heavy atoms in various ways. In fact this is the problem of partitioning a number into a list of other numbers which I've talked about before . These partitions and (possible) fragment lists are shown here: One thing to notice is that all partitions have to have 5 parts - even if one of those parts is 0. That's not strictly a partition anymore, but never mind. The other important point is that some of the partitions lead to multiple fragment lists - [3, 3, 2, 2, 1] could have a CH+NH2 or an NH+CH2. The final s

The Gale-Ryser Theorem

This is a small aside. While reading a paper by Grüner, Laue, and Meringer on generation by homomorphism they mentioned the Gale-Ryser (GR) theorem. As it turns out, this is a nice small theorem closely related to the better known Erdős-Gallai (EG). So, GR says that given two partitions of an integer ( p and q) there exists a (0, 1) matrix A iff p* dominates q such that the row sum vector r(A) = p and the column sum vector c(A) = q . As with most mathematics, that's quite terse and full of terminology like 'dominates' : but it's relatively simple. Here is an example: The partitions p and q are at the top left, they both sum to 10. Next, p is transposed to get p* = [5, 4, 1] and this is compared to q at the bottom left. Since the sum at each point in the sequence is greater (or equal) for p* than q , the former dominates. One possible matrix is at the top left with the row sum vector to the right, and the column sum vector below. Finally,

Canonical Augmentation : How not to do it

In the interest of not publishing false information (even on a blog), this is a brief outline of why the approach outlined in the previous post does not work. When constructing a set of graphs - from their degree sequences or otherwise - it is usual to describe the process as traversing a tree. The root of the tree is the empty set, and the leaves are the completed graphs. Internal nodes in this tree are partially completed graphs, and parent nodes are connected to children by augmentations . An augmentation to a graph can be made in various ways: adding a single edge, adding multiple edges to a single new node, etc. However, the important thing is to avoid duplicating solutions at the leaves. One easy way to do this is to check all newly generated leaves against all the previously generated ones. However, this requires not only storage of all the solutions so far, but also a lot of isomorphism checks. A better way is to check if a solution is canonical by some means. Even b

Orbit Saturation of Degree Sequences

Degree sequences and signatures are related, so it seems reasonable that Faulon's orbit-saturation algorithm for enumerating molecules should also work for just degree sequences. For example, height-1 signatures for a carbon skeleton are practically identical to a degree sequence. See this image for details: On the left is a target signature, with its corresponding carbon skeleton below. On the right is the equivalent degree sequence, and its simple graph. Of course, signatures can handle multiple bonds, and different elements, but the principle is the same. So the major problem with Király's method for enumerating graphs from their degree sequences is that it produces isomorphic solutions. The problem I had with Faulon's method was that I couldn't get it to work ! A functional implementation for the (much simpler) degree sequence problem should help. Encouragingly, the initial crude implementation does work for simple examples. The key seems to be to check

Király's Method for Generating All Graphs from a Degree Sequence

After posting about the Hakimi-Havel theorem, I received a nice email suggesting various relevant papers. One of these was by Zoltán Király called " Recognizing Graphic Degree Sequences and Generating All Realizations ". I have now implemented a sketch of the main idea of the paper, which seems to work reasonably well, so I thought I would describe it. See the paper for details, of course. One focus of Király's method is to generate graphs efficiently , by which I mean that it has polynomial delay. In turn, an algorithm with 'polynomial delay' takes a polynomial amount of time between outputs (and to produce the first output). So - roughly - it doesn't take 1s to produce the first graph, 10s for the second, 2s for the third, 300s for the fourth, and so on. Central to the method is the tree that is traversed during the search for graphs that satisfy the input degree sequence. It's a little tricky to draw, but looks something like this: At the top

Havel-Hakimi Algorithm for Generating Graphs from Degree Sequences

A degree sequence is an ordered list of degree s for the vertices of a graph. For example, here are some graphs and their degree sequences: Clearly, each graph has only one degree sequence, but the reverse is not true - one degree sequence can correspond to many graphs. Finally, an ordered sequence of numbers (d1 >= d2 >= ... >= dn > 0) may not be the degree sequence of a graph - in other words, it is not graphical . The Havel-Hakimi (HH) theorem gives us a way to test a degree sequence to see if it is graphical or not. As a side-effect, a graph is produced that realises the sequence. Note that it only produces one graph, not all of them. It proceeds by attaching the first vertex of highest degree to the next set of high-degree vertices. If there are none left to attach to, it has either used up all the sequence to produce a graph, or the sequence was not graphical. The image above shows the HH algorithm at work on the sequence [3, 3, 2, 2, 1, 1]. Unfortunat

General Graph Layout : Putting the Parts Together

An essential tool for graph generation is surely the ability to draw graphs. There are, of course, many methods for doing so along with many implementations of them. This post describes one more (or perhaps an existing method - I haven't checked). Firstly, lets divide a graph up into two parts; a) the blocks, also known as ' biconnected components ', and b) trees connecting those blocks. This is illustrated in the following set of examples on 6 vertices: Trees are circled in green, and blocks in red; the vertices in the overlap between two circles are articulation points. Since all trees are planar, a graph need only have planar blocks to be planar overall. The layout then just needs to do a tree layout on the tree bits and some other layout on the embedding of the blocks. One slight wrinkle is shown by the last example in the image above. There are three parts - two blocks and a tree - just like the one to its left, but sharing a single articulation point. I had

Radial Tree Layout

Fairly simple post here : radial tree layout of trees. Here is an image for trees on 8 vertices: The center vertices are colored red, similar to the top-down image in this post . Oh, and degree-2 vertices are arranged as if they were degree-3, which looks nicer I think (similar to chemical layouts).

Double Bonds and Edge Colorings

There is an effort going on to improve the double bond assignment machinery in the CDK, which is great. Of interest to me, however, is how many possible arrangements of double bonds can you have in fused ring systems. This was mentioned in at least one previous post - or perhaps two . However, lets get a very rough upper bound; how many ways are there to color the N edges of a graph with two colors? This is 2 to the power N, or the set of all subsets of the edges. Of course, many of these are chemically meaningless, where atoms have too high a valence. So filter out those where adjacent edges have the same color - or more exactly, where adjacent edges are colored with the 'double bond' color (let's call it '2'). The image shows a sketch of the simple procedure (above) and a slightly better approach (below). The better way of doing things is similar to the k-independent chessboard solution (sorry to link to my own pages so much - but it is relevant!). The

Expensive and Exhaustive Ring Finding

As the title states - this is a computationally expensive way to get all rings in a graph, but it's fairly simple, and illustrates some nice principles. For a better way to do things, perhaps Rich Apodaca's description of the Hanser, Jauffret, and Kaufmann algorithm would suit. Anyway, back to the expensive way. The set of cycles in a graph form what is called a 'cycle space' - which I didn't understand at all for a while, but is not actually that hard. For example, here is a basis set for the cycle space on a 3x3 hexagonal lattice: Looks like a bunch of cycles, really. The important thing is that it is possible to combine any subset of these cycles to get another cycle (or one of the other cycles in the basis). By 'combine' we mean XOR or the symmetric difference of the edge vectors. This sounds more complicated than necessary, so it's useful to consider a simple example. Well, the example is simple - the picture is not: On the left here

Cycles on Lattices

So, first a small correction to the last post : a 'honeycomb' layout is a cycle on a hexagonal lattice, while the other layouts there are not on-lattice. I've renamed them 'flower' layouts, as I'm not sure if they have a name - they're not at all new, I just haven't looked! To be clear: these are two different layouts of the same 12-cycle. As it happens, the honeycomb layout is also a [5, 5, 5]-flower layout with straight edges on the first edge of the outer cycle. However, not every honeycomb is a flower. There might be many ways to do it, but one way to make honeycomb layouts is to find cycles on a hexagonal lattice. Assuming, of course, that you have such a lattice already - it's not terribly hard to make one, but connecting it up properly is a bit fiddly. Luckily, it turned out that making the dual of a triangular lattice is slightly easier. To see how this works, consider the three possible (regular) planar lattices: The grey dots ar

Honeycombs and Macrocyles

Small rings in chemistry are usually laid out as polygons; 5-membered rings as a pentagon, 6-membered as a hexagon, and so on. Once you get beyond about 9-10, this tends to look a little nasty. Or at least, unconventional. So for 'macrocycles', it makes sense to make a less circular, and more wavy outline. Or, to be more exact, there is an inner cycle and several outer cycles, like this: To make it clearer, I have used a chemical-like structure with oxygens in the inner ring. These crown ether s are a particularly clear-cut case, as the ethylene linkages force the particular geometry of the drawing. However, it is not so obvious for other sizes of rings - what possible arrangements are there? Well, it occurred to me today that there is a simple formula for these macrocycle drawings. For a ring of size n with an inner ring of size i and outer rings or size o , you have to have n = (i * o) - i . The formula can be rearranged, but the idea is that you add up all the oute

Duals and Inner Duals of Planar Graphs

Graphs that can be embedded in the plane are called planar graphs - that is, they can be drawn without crossings. These drawings (or 'embeddings') have a dual which also corresponds to a graph, with a vertex in the dual for every face in the original, and an edge in the dual for every pair of faces that share an edge. For example, for each of the embeddings above (in black) of the same graph there is a different dual drawing corresponding to different graphs. If the vertex that represents the outer face is deleted from a dual, you get an inner dual. These inner duals are what Brinkmann et al use for generating fusanes, as I mentioned . The basic idea is to generate the duals, and then expand those into fusanes. Since the (inner) duals necessarily have less vertices than the full graph, this is quite efficient. However, I admit that I couldn't be bothered to implement the algorithm properly, even though the technique is partly based on McKay's method which I kn