This blog post won "Best Title" in the award for "Random Combinations of Words" category.
Anyway, the atomtypes that were not found by the CDKAtomMatcher are now down to just 1,400. There were various errors on my part (such as using element symbols like "CL" rather than "Cl') and CIF files are trickier to parse than I thought (an atom name can be "HOH'" or 'P"' - look closely!).
Just to make life even harder, and because Egon suggested it, I tried to get the output in some kind of RDF using Jena. In practice, this seems to mean a sort of vague tree of nodes, some of which are data and some are types.
Well, its in the same repository as before, with source code. Although looking at the N3 file it still looks like it would be impossible to work out which atom belongs to which hetgroup...
Comments