Skip to main content


Showing posts from May, 2009

Simplest example : 1-methylcyclobutane

This is the simplest possible hydrocarbon example (there being none with four carbons) of multiple isomorphic solutions:

The only difference in numbering is the swap of 2 and 4. Oh well.

Doesn't quite work

Sadly, using the degree sequence as a kind of 'mask' on canonically numbered matrices isn't enough. I should have guessed, it was too simple to be true :(
The smiles for all matrices in the (2, 2, 2, 2, 2, 3, 3) partition are:
and (ignoring the third one, which is disconnected) there are only 3 or 4 different simple graphs in the list.
Perhaps I would have known all this if I knew what automorphisms were, or orbits of elements in sets, or any of that...

Bridged Cyclohexanes - same partition, different PVR sequences

I wanted to find an example of two non-isomorphic molecules with the same degree sequence, but different PVR sequences. The simplest I could think of was this pair:

although there might be a similar pair of bridged 5-membered rings, I suppose.

The meaning of this is quite simple - for a particular degree sequence (partition), there are multiple (different) molecules with a valid PVR sequence. I thought that this must be true, but there are none in the G4 set.

Rows and Columns

Aha! So, a comment by Egon (on my last post) showed the benefits of showing people what you do. He suggested summing the columns - not converting from binary to integers, as with the rows - to remove the last traces of redundancy. So this seems to work for graphs with 4 vertices:

Sorry for the extremely detailed diagram, but it is necessary to show my point. These matrix/graph pairs are all the PVR numbered adjacency matrices for n=4. There are isomorphic structures here, but note the column sums along the bottom of each matrix.
These column sums form another sequence - which can be used to select only one of the isomorphs. Arbitrarily, we choose sequences that are partially ordered - i.e. no number in the sequence is less than the previous number in the sequence. This seems to work...

PVR numbering scheme not solution to all woes : film at 11

On a whim, I decided to try generating all adjacency matrices with the property that they are PVR numbered. The short summary of that link is that a matrix can be expressed as a sequence of positive integers by considering each row of the matrix as a binary number.
The point of doing this (I thought) was that you can number a molecule in such a way that the adjacency matrix is PVR-numbered, and that this is canonical. So my cunning plan was to generate all sequences of n numbers that are partially ordered, choosing them from [1, 2n] to give all non-redundant (simple) graphs with n vertices.
Unfortunately, it seems like this can't work:

This image shows all adjacency matrices for n = 3 which are PVR-numbered. They are made by backtracking through all sequences of integers with a partial order, pruning the solutions using the symmetry of the matrix as a constraint.
Anyway, the point is that the first two graphs are clearly isomorphic! More simply, they both represent propane. Maybe this…

Generation : Overview

To sum up the previous post flood; generation of constitutional isomers from the elemental formula can be done by generating all partitions of the total 'free' valence of the heavy atoms. The overall scheme is shown here:

(click for bigger, as usual). So, for each formula, multiple partitions can be made, and each of these makes multiple sub-partitions, and each of these correspond to one or more molecules.
Now, I won't pretend that any of this is particularly novel. I am no doubt re-expressing the problem of generating all possible molecules in a slightly different way. Having tried (and failed) to implement published methods, this was the best I could come up with.
I suspect that there are many improvements that could be made to the algorithm, and the implementation of it. Getting something that works, even in a limited way, seems like progress, however :)

Final step : nested partition thingies to actual molecules

So, the last step:

One thing to note about this is that the algorithm again has to backtrack to get all the molecules for any sub-partition list (another name for the things like [[3, 1], [3], [1, 1], [1]]).
Anyway, on the right hand side is the final (only) molecule made for this nested partition. It has the [4, 3, 2, 1] partition structure, naturally, and the correct constitutional formula.
Now what would be nice, would be to combine the second and third steps, so that only those nested partitions that produce valid molecules were tried. However, as the saying goes : "First, make it work, then make it work fast".

Partitions to er..'nested partitions'

So I don't have a good name for the objects that I create half-way through this process : the code uses 'solution', which is confusing. Anyway, here is the process:

The left hand side is clear enough, I think, and follows on from the image in the previous post. Conceptually, it is similar to 'gathering' the attachment points into half-bonds in all possible ways. So, the 4 attachment points on the bare carbon fragment can become a triple (half) bond, and a single half bond. This is shown, as [3, 1].
Of course, there are other possibilities, and each combination at each atom fragment has to be paired with each other possibility at each other fragment! If this sounds like a backtracking problem, then you might understand why I did exactly this in the code.
What would be nice, would be to prune the solutions - for example, [[3, 1], [2, 1], [1, 1], [1]] is generated, but is clearly impossible. Neither the triple half-bond, nor the double half-bond have partners. Pruning th…

Formula to Partitions

For the benefit of my (2) readers, here is the process of going from the chemical formula to the partitions, and what this actually means.
So, the list of numbers at the bottom (the partition) is the simplest possible representation of the attachment points for each atom fragment.
Oh, and the python code for generating partitions for is here - it is basically a straight copy of the code from the book "Combinatorial Algorithms : Generation, Enumeration, and Search", which I highly recommend - a good balance of maths and computer science.


Hmm. Not my favourite solution, but I think that the isomorphism checks can be done in batches, resulting in actual isomorph spaces, like this (for C5H10):
I should probably check that these are right...

Step one : catch them all

Well, this is the C4H6 space again. Or <10,4> as I have taken to calling it, for no obvious reason.
There are duplicates, yes. But there are nine (unique) structures here, which is good.