Skip to main content


Showing posts from April, 2009

Exploring the wild beasts of the layout jungle

There was a bug submitted to the CDK sourceforge tracker (bug number 2783741) with a list of molecules that are laid out badly. I had a look at some of them with the help of bioclipse. For example, this calix[4]arene:
or this:
which is a clearer case of something going wrong. More difficult is structures that are fully 3D, like:

Can you guess what it is? :) Try the 3D version (also made with bioclipse, using the CDK 3D layout):
It's a paracyclophane! The phenyl rings are lost in the 2D layout because there is a bigger 'ring'. Perhaps a chemist would look at the 3D structure and think that those chains are linkers, not parts of a ring, but the algorithm doesn't know this.

I think that it is difficult to have general rules for this. Of course, any fully 3D structure will be difficult to lay out in 2D (if it is not embeddable in the plane then it is impossible) so things like this:

are truly awful.

Bioclipse : Safe When Used As Directed

Finally used bioclipse for a real purpose, and to good effect, too:
what this shows (the images do get larger if you click on them! :) is the following basic workflow:
1) Exploring manager functions in the js console (bottom). 2) Writing a script in the js editor (top left). 3) Running and getting feedback in the rhino console (far right). 4) Viewing the results in the sdf viewer (top right).
What I was doing was searching through an sdf file (C10H16_filtered.sdf) for all structures with a cyclohexane ring as a substructure, then writing those out to a file.
Probably could have been done 5 other ways, but, well, it was more fun this way.
Oh, and it is a gist here.

ChEBI in Bioclipse

Says it all, really. The scrolling can hang if you move too fast, which might be due to garbage collection? It is a very large file....

Portable whiteboard : deployed

Proposed CDK changes related to PDBReader and BioPolymer

This is expanding on one of the points that Rajarshi made in his blog (which he followed up here) on the PDB file handling capabilities of the CDK. There are two related topics : reading of PDB format files (the ancient, fixed column-width ATOM files) and the model that these are read into.
The old PDB format is being replaced with mmCIF and/or PDBML formats. Only there are lots of programs that write out this format, so it makes sense to still support it for a while at least.
However, it is a quite nasty format, in some ways. Not so much the fixed column width, but the fact that crystallographers abuse the file format in all sorts of ways. Even simple things like expecting that atom numbers will always increase, may not be true.
So it is not easy making a good reader for PDB files. The current CDK one won't read a file with just ATOM records, for example. Think that's reasonable? Well, tough luck for people that made programs that produce simple files like this.
A more …

Generated Wallpaper, anyone?

This is exhaustive generation of all C4Hn (where n=any number of hydrogens), badly smashed together onto one canvas.

Structure zoo

So, I'm getting better at generating structures :
but, not quite there yet. Oh, and the [1.1.1] propellane is drawn oddly (with one of its carbons at 0,0) due to a mac-specific bug in the layout. Irritating, but difficult to fix.
edit: Oh, and for interests sake, here is an auto-generated tree of graphs
Even more complex looking is this! :
which is a set of structures produced by descending to depth 6 for 5-carbon graphs.

Numbering atoms, numbering vertices

Further to the similarities between numbering atoms in a structure, and generating unique graphs here is this:

which shows the same molecule with two different numberings on the left, and the resulting graphs on the right. The double bond is not shown on the graphs; but it would probably have to be a labelling of the edge, rather than an actual multiple edge, to still be a simple graph.
So, this quickly shows how - if you start with the vertices of the graph and connect 'all possible ways' - you get molecules that are isomorphic, but numbered differently. Therefore (perhaps) the numbering of the vertices and edges is one of he keys to not creating all the isomorphs and then having to expensively check them all.

Step 3 : Extend all possible ways

The title of this post refers to the tendency of algorithms in papers to have detailed explanation of every step except the most crucial one. Right now I am rediscovering this peculiar pleasure in structure generation.
As an example - or more as visual decoration - have this image:

which looks nice, but needs some explanation. The diagrams in boxes that look like parachutes (as one of my colleagues put it :) are simple representations of atoms connected by bonds. Each point is an atom, and a curved line connecting them is a bond.
These are grouped together by what structure they correspond to; which is shown on the right of each set of diagrams. What this shows, then, is the redundancy you get from a simple generator. If you connect all atoms like this: for atomA in atoms:
for atomB in atoms greater than atomA:
connect(atomA, atomB)You quickly get a very large number of isomorphic structures. And then your process runs out of memory, in my experience.
Oh, and the numbers below each g…