longest common sequence

I needed a way to match banking statements -- exported from my bank and inserted into hledger -- into correct accounts. The idea was to find common strings between a few statements of the same account and test against ones that go into a different account. Spoiler: I was overthinking the problem. A simple "Aldi" in the cascade of IFs is enough. I am proofreading this automatic matches either way, so some errors are ok and can be fixed in the matching code later.

Still I implemented some longest common sequence algorithm and a small webservice to find them and test against a negative list. The full code is in https://github.com/mfa/longest-common-sequence and is currently hosted at https://lcs.madflex.de/. I will probably undeploy this in a few weeks.

Things I learned:

There is a SequenceMatcher readily available in difflib in Python. This matches two strings and gives the longest matching blocks by calling get_matching_blocks. I used combinations from itertools (also Python core) to find the matches for every combination of string pairs. The whole idea is to get a list of common strings between all string pair combinations, then test the match_strings against the positive and then against the negative examples given. The positive matches are reduced by checking every matched_string with all(map(lambda x: match_string in x, positives_examples)) and the negatives by an any(...). The full code is in algorithm.py.

The website looks like this:

img1