Skip to main content

Posts

Showing posts from 2016

Tailor : Descriptions

Tailor (https://github.com/gilleain/tailor) is a project that grew out of my attempts to search for catmats   niches  with Prof. Milner-White. The goal of the project was to allow users to define protein structural patterns (called 'descriptions') along with a set of associated measures . More on measures later, but first what is a description? Here is a very simple example: The lines don't have arrowheads, but this is implicitly a tree/DAG rather than a graph. There is a root ProteinDescription  and the leaves are AtomDescriptions  - the DistanceCondition  is referencing the two atoms. Basically, this just defines a pattern of two amino acids (GLY, ALA) with a distance of less than 3Ã… between the N and O atoms. There are still a lot of details to be worked out here. Can the groups be separated along the chain? If they can, should that require the description to be explicit as to the relationship between sibling nodes? How do we define any number of matches (as wit

WTF is a Number Bond?

Not chemistry, as it happens. I was searching for similar images to one of my line drawings (always fun) and came across these 'number bond things' : The one on the left - hilariously - is just "1 + 1 = 2". Ok, so that's a deliberately jokey example; real ones have larger numbers and one of the three numbers is for the student to fill in. On the right is a more complex example, drawn as a DAG  (directed acyclic graph) although at least one of the example I saw had a node at the bottom with three parents! In any case, what these things really are representing is partitions of numbers - which are usually drawn as Ferrer's diagrams (or Young tableaux) which I'll refer to as "Ferrer's-Young diagrams". These have a superior feature as shown here: So one FY-diagram can represent two  different number bonds. Note that I've made the crazy leap of making number bonds with more than two parts (or 'addends'). Clearly 1+1+3+4 = 9 =

Submultisets and Graphs

The previous post mentioned the restricted weak composition (RWC), but didn't expand on it at all! Basically, I found this excellent paper : " Generalized Algorithm for Restricted Weak Composition Generation " by Daniel R. Page. It even gave some java code in an appendix - good stuff :) Anyway, a RWC is a composition (which is a partition where order matters) that is weak  - has zeros in it - and the parts are restricted . So [1, 0, 1, 1] is a weak composition of 3 into 4 parts and lets say we have restricted the parts to {0, 1, 2}. Here is an overview of the scheme: where we take a degree sequence, convert to a multiset and use a RWC to get a particular sub-multiset. This allows us to take some count of some subset of elements from the multiset. Doing this for all sub-multisets at each round should then allow us to list graphs - although not without redundant examples:  This shows all starting points for 3 -> [3, 2, 2, 1, 1] and

Restricted Weak Compositions, Labelled Partitions, and Trees

So in the last post about listing trees  I outlined a slightly cumbersome method to list trees from degree sequences. Thinking about it a bit more, it would probably be far easier to just list all trees on some number of vertices and filter out by degree sequence. I talked a little about the WROM algorithm in this old post  which is a constant time generator of 'free' (unlabelled) trees. Anyway, that's boring so I was trying out the more complicated approach. It looks like generating a single  tree from a degree sequence is as simple as the Havel-Hakimi method. Connect the largest degree (dn) to the dn next largest degrees.  Also maintain a list of vertices that have already been connected to, and then at the next step connect only to those not already connected to. So, for [3, 3, 2, 1, 1, 1, 1] we get: You might notice that trees a) and c) are isomorphic. Below the trees labelled by degree are the same trees labelled by DFS discovery order, and below that the '

Listing Degree Restricted Trees

Although stack overflow is generally just an endless source of questions on the lines of "HALP plz give CODES!? ... NOT homeWORK!! - don't close :(" occasionally you get more interesting ones. For example this one  that asks about degree-restricted trees. Also there's some stuff about vertex labelling, but I think I've slightly missed something there. In any case, lets look at the simpler problem : listing non-isomorphic trees with max degree 3. It's a nice small example of a general approach that I've been thinking about. The idea is to: Given N  vertices, partition 2(N - 1) into N parts of at most 3 -> D = {d0, d1, ... } For each d_i in D, connect the degrees in all possible ways that make trees. Filter out duplicates within each set generated by some d_i. Hmm. Sure would be nice to have maths formatting on blogger.... Anyway, look at this example for partitioning 12 into 7 parts: At the top are the partitions, in the middle the trees

Equitable Partition Refinement with List Invariants

So the bug with C19H14 and C10H16 formulae seems to have been due to the partition refinement not correctly labelling structures with particular arrangements of multiple bonds. The underlying problem is in the equitable  partition refinement process. This is a short note about the problem. Equitable refinement of a partition for a graph is the formation of a vertex partition where each element of each block of the partition has an equal number of neighbours in the other blocks. This is a little difficult to imagine, but it is - roughly - a generalisation of the Morgan number algorithm which attempts to find labels for sets of vertices which are stable with respect to splitting them by the labels of the neighbours. For example: This image shows a cub-2-ene like molecule (or a cube graph with two of the edges colored). Clearly the orange and green vertices are in 'different' sets in some sense. Precisely, they are in different blocks of the equitable partition ([0,2,5,7|1

From Seed To Leaf

So the previous post  pointed out the problem with a simple extension from a seed : you miss some. In detail, two of the problems are: Difficulties with growth from a seed Firstly, A shows the - slightly obvious - idea that for some (seed, leaf) pairs you can only get from one to the other by adding edges and not vertices. This problem is easy enough. As for B , I show here a detailed (if made up) example of the main problem : augmentation of a seed is not necessarily canonical. Or, to put it another way, the canonical deletion can lead to a sibling of the seed, rather than the seed itself. I think the way round this might be to restrict the candidate atoms (or bonds, even) for canonical deletion to those outside the original seed. In other words, canonically label the augmentation to give the ordering of atoms/bonds then choose the largest labelled one that is not in the seed.

Seeds and Weeds : Good/Bad Lists in Structure Generation

With the recent revival of the moleculegen (AMG) project, I've started to properly think beyond just simple generation of spaces from a formula. For example, there are 4 trillion or so C30H62 structures - which might take ... a while. In any case, one useful feature would be to have good/bad lists of substructures (or 'nice' vs 'naughty' as I've been thinking of them. One simple approach I thought might work is to just start with the largest good substructure and generate from there. This can  work : C9H16 from 6-cycle (By-the-way : all images done with John May's new Depict utility! It's wonderful to use :) Which is great! Except that it doesn't always work. The problem is that leaves on the tree with a particular substructure might not have that substructure as a common parent in the tree. Consider these C6H8 structures generated from a 4-cycle: C6H10 from 4-cycle These are not all of them. Now we filter out using subgraph i

Pictures of Very Wide Trees

Visualising canonical augmentation trees of molecules is not quite as good as I thought: Or perhaps I should use a different tree layout. Probably a circular one. Edit. Like this: