Skip to main content

Posts

Showing posts from February, 2011

Null Atomtypes in PDB Het Dictionary now in some sort of RDF thing

This blog post won "Best Title" in the award for "Random Combinations of Words" category. Anyway, the atomtypes that were not found by the CDKAtomMatcher are now down to just 1,400. There were various errors on my part (such as using element symbols like "CL" rather than "Cl') and CIF files are trickier to parse than I thought (an atom name can be "HOH'" or 'P"' - look closely!). Just to make life even harder, and because Egon suggested it, I tried to get the output in some kind of RDF using Jena . In practice, this seems to mean a sort of vague tree of nodes, some of which are data and some are types. Well, its in the same repository as before, with source code. Although looking at the N3 file it still looks like it would be impossible to work out which atom belongs to which hetgroup...

Atom-typing the Hetgroup Dictionary

So I posted this to CDK-devel, but probably this is the better place... I've been trying to make a map between the atom IDs used in the HET dictionary (which is in CIF format) and atom types of some sort. To see what this looks like, here is a tail of the file: ZZZ.O6A:O.sp2 ZZZ.H7C1:H ZZZ.H7C2:H ZZZ.H8:H ZZZ.H2N1:H ZZZ.H2N2:H ZZZ.H3:H ZZZ.H5:H ZZZ.H6:H ZZZ.H6A:H 'Zzzz', you may be thinking, but although many atom ids are quite obvious (like H8 is a hydrogen), some are probably not. One annoying aspect of this process was that the CIF file format is not especially friendly, and particularly, the file has 'loop_'s that don't terminate in octothorpes ('#'), as I thought they would. Probably the parser (an IteratingCIFReader) could be much better written - in fact, it will probably only parse this one CIF! So my initial estimates of 13,000 typing failures is now down to only 3,508. What are the atoms that fail? There are some that are bound to like TBR , w