David Thornley, Stavros Petridis
DNA sequence basecalling is commonly regarded as a solved problem, despite significant error rates being reflected in inaccuracies in databases and genome annotations. This has made measures of confidence of basecalls important, and fuzzy methods have recently been used to approximate confidence by responding to data quality at the calling position. We have demonstrated that variation in contextual sequencing trace data peak heights actively encodes novel information which can be used for basecalling and confidence estimation. Using neuro-fuzzy classifiers we are able to decode much of the hidden contextual information in two fuzzy rules per base and partially reveal its underlying behaviour. Those two fuzzy rules can satisfactory explain over 74% of data samples. The error rate is 6-7% higher on individual bases than when using classification trees, but the number of rules is reduced by a factor of 100. Compact comprehensible knowledge representation is achieved with the use of SANFIS which allows us to easily interpret the embedded knowledge. Finally, we propose a hybrid architecture based on SANFIS which achieves slightly better performance than a classification tree with significantly improved knowledge representation.
Information from pubs.doc.ic.ac.uk/neuro-fuzzy-dna-basecaller.