longest common sequence
I needed a way to match banking statements -- exported from my bank and inserted into hledger -- into correct accounts. The idea was to find common strings between a few statements of the same account and test against ones that go into a different account. Spoiler: I was overthinking the problem. A simple "Aldi" in the cascade of IFs is enough. I am proofreading this automatic matches either way, so some errors are ok and can be fixed in the matching code later.
Still I implemented some longest common sequence algorithm and a small webservice to find them and test against a negative list. The full code is in https://github.com/mfa/longest-common-sequence and is currently hosted at https://lcs.madflex.de/. I will probably undeploy this in a few weeks.
Things I learned:
There is a SequenceMatcher
readily available in difflib in Python.
This matches two strings and gives the longest matching blocks by calling get_matching_blocks.
I used combinations
from itertools (also Python core) to find the matches for every combination of string pairs.
The whole idea is to get a list of common strings between all string pair combinations, then test the match_strings against the positive and then against the negative examples given.
The positive matches are reduced by checking every matched_string with all(map(lambda x: match_string in x, positives_examples))
and the negatives by an any(...)
.
The full code is in algorithm.py.
The website looks like this: