Now that AMG's results are getting better, I've started to look at improvements to the speed. At the moment, it's about 10 times as slow as OMG - probably due to my use of an inferior canonical checker (ie: not nauty).
One obvious improvement has been to avoid extending molecules with too many or too few hydrogens. Previously, the hydrogens were only checked for the leaves - that is, at the final step. It is possible to prune the generation tree earlier, if you make two simple assumptions (see image).
The min extension checks the smallest possible way to add all the remaining atoms to the molecule - which is just a tree. This gives the maximum number of hydrogens that could be added at this point. The max extension does the opposite, checking the case where all remaining atoms are added (at their maximum valence) which has the effect of effectively removing hydrogens.
So, the number of implicit hydrogens for a partial structure is added or removed by these two bounds, and the target hydrogen count is checked to see if it lies in this range. The effect is to prune branches of the generation tree before they even need to be checked for canonicity or other expensive tests:
The tree here is arranged so that the number of hydrogens decreases from left to right - although this may not be the order the structures are actually visited. The dashed lines indicate how the side branches are pruned away.
One obvious improvement has been to avoid extending molecules with too many or too few hydrogens. Previously, the hydrogens were only checked for the leaves - that is, at the final step. It is possible to prune the generation tree earlier, if you make two simple assumptions (see image).
The min extension checks the smallest possible way to add all the remaining atoms to the molecule - which is just a tree. This gives the maximum number of hydrogens that could be added at this point. The max extension does the opposite, checking the case where all remaining atoms are added (at their maximum valence) which has the effect of effectively removing hydrogens.
So, the number of implicit hydrogens for a partial structure is added or removed by these two bounds, and the target hydrogen count is checked to see if it lies in this range. The effect is to prune branches of the generation tree before they even need to be checked for canonicity or other expensive tests:
The tree here is arranged so that the number of hydrogens decreases from left to right - although this may not be the order the structures are actually visited. The dashed lines indicate how the side branches are pruned away.
Comments