Skip to main content

Posts

Showing posts from October, 2012

Speed Optimization of AMG

Now that AMG 's results are getting better, I've started to look at improvements to the speed. At the moment, it's about 10 times as slow as OMG - probably due to my use of an inferior canonical checker (ie: not nauty). One obvious improvement has been to avoid extending molecules with too many or too few hydrogens. Previously, the hydrogens were only checked for the leaves - that is, at the final step. It is possible to prune the generation tree earlier, if you make two simple assumptions (see image). The min extension checks the smallest possible way to add all the remaining atoms to the molecule - which is just a tree. This gives the maximum number of hydrogens that could be added at this point. The max extension does the opposite, checking the case where all remaining atoms are added (at their maximum valence) which has the effect of effectively removing hydrogens. So, the number of implicit hydrogens for a partial structure is added or removed by these two bo

Tests that Pass, Tests that Fail

The AMG (alternative molecule generator) is now good enough to run proper tests on, with help from Tobias Kind  who has long promised - or threatened, perhaps :) - to test a structure generators. It should lead to software that is of more than theoretical interest. Currently, there is a download available from github , or it can be built from the project directory if you are familiar with ant and are willing to change the build.properties file to point to a CDK directory. There is an instructions.txt file, with some examples of usages; the -h flag also works as might be expected. As for passing tests, it currently does better with hydrocarbons - CnH2n + x for x in {-2, 0, 2}. However, it's starting to improve on the more mixed formulae, with oxygen, nitrogen, and so on. The two child-listing methods (filter/symmetric) have different behaviour, annoyingly. Looking at one of the two pairs of duplicates in the set of C6H4 structures shows why it fails. The method here is the sym

Alternative Molecule Generation Implementation using the CDK and Signatures

Over the weekend, I cobbled together some components that I've been developing for a while (the past four years in fact) to make a molecule structure generator. As described in recent posts, OMG  is one new available solution; now here is a proof-of-concept for a quite similar one. So what are the differences? Well, firstly OMG uses NautY  to check candidates for canonicity while this implementation uses signatures , so is currently slower. The major difference, though is the algorithm. While OMG uses bond-augmentation of a parent structure to make children, this one uses atom-augmentation. Here is a small example (augmenting ethane): Of course, these are only the immediate children of the C-C parent; for OMG the unique, canonical ones will themselves be augmented further until they are the right size and have the correct number of hydrogens. The atom-augmentation algorithm, by contrast, produces a next generation with exactly one new atom, but a different number of bonds.

More Augmentation Timings and Conclusions

The previous post  showed some rough timings for generating all graphs on  n  vertices. I've repeated this for graphs where the vertex degree is limited to 4, which is more relevant to molecule generation. Here is the data in the same format: The series have also been expanded to graphs on 9 vertices, as it takes less time overall to generate degree-limited graphs. However, the main conclusion that I draw - at the moment - from these two timing data charts is this; the choice of canonical augmentation method doesn't make a huge difference. A lot of variables are unaccounted for here, of course. Each of the five variants of this method could be implemented more efficiently (especially the edge-augmentation methods). Also, there may be other issues that only occur with multigraphs (as with molecules). It seems unlikely that these will overcome the difference between OMG's 18-45 ms per structure and Molgen's 0.008-0.009 ms (from the paper, page 11).

Timing Four Augmentation Algorithms

There are many possible ways to canonically augment graphs, but here I'm picking two pairs of possibilities - vertex vs edge augmentation, and filtering duplicates vs picking a representative from symmetrically equivalent positions. So, the four algorithms are vertex/filter (V/Fil), vertex/symmetric (V/Sym), edge/filter (E/Fil), and edge/symmetric (E/Sym). Here is a graph of log-average timings (in milliseconds) of the four implementations running on graphs of 4-8 vertices. One very important caveat is that the graph counts for 7 and 8 vertices are not 100% correct. The rows in blue on the table (4, 5, 6, 7, 8) are the log-averages of rows abov; so 4 = log(average(4a, 4b, 4c)), etc. The full spreadsheet is available here on github  (as a .numbers file), or the code is here . I'm not particularly confident in the crude System.getTimeMilliseconds() as a timing method. However, the striking thing to me is that these numbers suggest that for larger input sizes (n > 6),