Skip to main content

Posts

Showing posts from August, 2008

CDK dependancy analysis again

Took another look at the CDK using classcycle , but this time I used the complete package that Taverna downloaded as part of the CDK plugin (which isn't working at the moment, for some reason). The easiest way to look at the results, really, is to export them as a CSV file, and look at it with a spreadsheet. Interestingly, the results show that it is CML that has the most layers. If I am understanding the idea of layers in classcycle correctly, it means something like the longest dependency path that ends in that class. The top hit is the DictionaryTool from org.xmlcml.cml.tools, with a layer index of 32. The first CDK class with a high layer index is CMLWriter (29) then RSSWriter but most of the top 50 are CML classes. I guess it's not a bad thing, it's just a thing. A measurement. edit: heh. The eclipse plugin for classcycle made this image for CDK:

Bioclipse and Taverna

After a short time with Taverna, I'm starting to see how it compares and contrasts with Bioclipse. As is my wont, here is a diagram illustrating this: What is meant here is that mostly Bioclipse is used to open-edit-save documents while Taverna is mostly used to get-process-save data. There are plugins for Bioclipse that do more processing style stuff, while you can sort of edit single documents in a workflow. Still. In the Taverna docs it talks about using Bioclipse as a results viewer for the output of a workflow. I suppose it would be nice to do the reverse and send documents to Taverna from Bioclipse. Perhaps that's already possible; I'm still learning :) EDIT: ola pointed out to me this page in the bioclipse wiki, although apparently it never worked right.

Taverna-cdk

The Taverna project is very interesting, in my not so humble opinion, because of the potential that workflows have. A workflow is a complete description of an experiment; that can now be shared through the myexperiment site. The central point of a scientific experiment is that it should be repeatable, by the researcher and by others. Many bioinformatics journal papers describe experiments of a sort that will not be repeatable years down the line, by anyone. A concrete example is this  paper by CH Robert and PS Ho, which describes an analysis of water bridges in proteins. A crucial line in the methods section is this : "All programs were written with FORTRAN 77 on a Silicon Graphics Iris workstation or with MATHEMATICA...on a Macintosh IIci computer." Which is really great if, 10 years later, you want to re-run their method on more than the 100 high-resolution structures that were available at the time. Do their programs still exist? Do I have access to an Iris machine (I us...

Classfile family trees

Looking into dependency analysis tools, these are three that I've seen: eUML2 - a very slick tool with a free version, and a commercial version. JDepend - a free, open source tool for numerically analysing dependencies. classcycle - another open source (&free) tool for reporting on cycles/dependencies. I like the last one the best, as eUML2 adds @UML annotations to your code, and JDepend's stats are a bit opaque. Especially nice is that classcycle's report tells you the 'Layer' of a class. This is the number of other classes that it builds on top of. So, in cdk, there are various classes in layer 14 (some of the undoredo Edit classes). StabilizationCharges is in layer 13. ChemObject is layer 15! Perhaps some sort of diagram would be in order...

Current status : Working

I've been 'roundtripping' molecule-spectra correlation from the nmrshiftdb to a database of theoretical spectra generated for PubChem data. Something like this: Where 'experimental' is the nmrshiftdb data, and 'theoretical' is the PubChem data. Currently, my laptop is running 7,000 or so searches for the best spectrum match, and then calculating the Tanimoto similarity of the fingerprints of the search and hit molecules. Phew!

A rolling stone

So there's a plugin in bioclipse called "moss" (or, I suppose MoSS). It has an Atom class. It has a Bond class. It has graphs for chemicals, smiles parsers, ... Just like the CDK, in fact. Perhaps I should check one of my projects ( tailor ) which also has Atom, Bond, Chain, PDBReader. You can't have too many implementations of the same thing :)

On models

So, after chatting some stuff about the PDBReader in CDK versus the Biojava PDBReader on CDK-devel ( devel archive ) I realised that I don't actually know how these different frameworks model biopolymers. This is my attempt to understand them: it shows a partial hierarchy for each framework, without the detail of which are classes, and which are interfaces. Notably, biojava has an interface and an implementing class for each of the things shown. One thing I should point out, is that "Strand" in CDK should really (really) be called "Chain" as in Biojava and Jmol (and probably every other known framework). A kind of spelling mistake, I think. I also don't understand what the PhosporousMonomer and PhosporousPolymer are in Jmol.

Turns out I don't remember SMILES format very well

Testing out the Java2DRenderer (in the org.openscience.cdk.renderer package), I tried to make this: Using this smile: c1(c2ccc(c8ccccc8)cc2) c(c3ccc(c9ccccc9)cc3) c(c4ccc(c%10ccccc%10)cc4) c(c5ccc(c%11ccccc%11)cc5) c(c6ccc(c%12ccccc%12)cc6) c1(c7ccc(c%13ccccc%13)cc7) except that I didn't have the % signs, so it came out as: This is because of multiple ring closures, of course. I should have used: C1CCC(CC1) C2CCC(CC2) C13C(C3CCC(CC3)C4CCCCC4) C(C5CCC(CC5)C6CCCCC6) C(C7CCC(CC7)C8CCCCC8) C(C9CCC(CC9)C10CCCCC10) C13(C11CCC(CC11)C12CCCCC12) which is much clearer. Oh yes.

Fixed a bug in PDBReader

When trying to convert PDB files to CML files (to test the CMLFileDescriber...) the underlying jumbo classes complained about adding bonds that had already been added (cdk bug 204663). Turns out, it was a problem with the PDBReader, which wasn't handling CONECT records properly. Oddly, CONECT are fully redundant, so atom 10 and atom 50 are both "CONECT 10 50" and "CONECT 50 10". The fix I used was to store the bonds from CONECT in a list, but checking each time to see if it was alread there. It wasn't possible to use "bondA == bondB", as that checks refs, and each bond is a new object. I should say, though, that maybe it's not worth worrying about problems with reading PDB files (an old format) in CDK (dedicated to chemical data). PDB files are horrible to read, anyway.

How to set IOSettings on cdk Readers

So I wanted to set readConnect and useRebondTool to false on the org.openscience.cdk.io.PDBReader. Turns out, the way to do this is: PDBReader reader = new PDBReader(new FileReader(input)); Properties properties = new Properties(); properties.setProperty("ReadConnectSection", "false"); properties.setProperty("UseRebondTool", "false"); PropertiesListener listener = new PropertiesListener(properties); reader.addChemObjectIOListener(listener); reader.customizeJob(); Hmm. Not so easy. Maybe reflection in the base DefaultChemObjectReader class could allow the Readers to have set(String propertyName, String value) methods?

Initial Import

I've started working on the Bioclipse project, as part of a post-doc at the EBI in the Steinbeck group . This will be a place to put some notes on what I'm doing there.