Seeds and Weeds : Good/Bad Lists in Structure Generation

With the recent revival of the moleculegen (AMG) project, I've started to properly think beyond just simple generation of spaces from a formula. For example, there are 4 trillion or so C30H62 structures - which might take ... a while.

In any case, one useful feature would be to have good/bad lists of substructures (or 'nice' vs 'naughty' as I've been thinking of them. One simple approach I thought might work is to just start with the largest good substructure and generate from there. This can work :

C9H16 from 6-cycle

(By-the-way : all images done with John May's new Depict utility! It's wonderful to use :)

Which is great! Except that it doesn't always work. The problem is that leaves on the tree with a particular substructure might not have that substructure as a common parent in the tree. Consider these C6H8 structures generated from a 4-cycle:

C6H10 from 4-cycle
These are not all of them. Now we filter out using subgraph isomorphism (Asad's SMSD code) from the whole space:

Whole C6H10 space filtered by 4-cycle 
So we get half of them by growing from a 'seed'. Now to think about filtering out the weeds...