The package is registered in the
General registry and so can be installed at the REPL with
] add StringDistances.
Distances are defined for
AbstractStrings, and any iterator that define
The available distances are:
qin each string.
The package also defines Distance "modifiers" that can be applied to any distance.
TokenSetmodifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.
TokenMax(Levenshtein())corresponds to the distance defined in fuzzywuzzy
You can always compute a certain distance between two strings using the following syntax:
evaluate(dist, s1, s2) dist(s1, s2)
For instance, with the
evaluate(Levenshtein(), "martha", "marhta") Levenshtein()("martha", "marhta")
pairwise returns the matrix of distance between two
AbstractVectors of AbstractStrings
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
It is particularly fast for QGram-distances (each element is processed once).
compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
Levenshtein()("martha", "martha") #> 0.0 compare("martha", "martha", Levenshtein()) #> 1.0
findnearest returns the value and index of the element in
itr with the highest similarity score with
s. Its syntax is:
findnearest(s, itr, dist::StringDistance)
findall returns the indices of all elements in
itr with a similarity score with
s higher than a minimum value (default to 0.8). Its syntax is:
findall(s, itr, dist::StringDistance; min_score = 0.8)
findall are particularly optimized for
DamerauLevenshtein distances (as well as their modifications via
19 days ago