Hi Grayon,
Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:
let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>
Cheers, Christian
On Thu, Nov 12, 2020 at 12:53 AM Graydon Saunders graydonish@gmail.com wrote:
Hi Christian --
The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.
So with input of:
<document> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> <p>It is important to dis-connect the device from all power.</p> <p>You will need a number two phillips screwdriver.</p> <p>It is important to disconnect the devices from all power.</p> <p>You will need a #2 Phillips screwdriver.</p> <p>It is important to disconnect the devices from ALL power.</p> <p>Graphics card; do not eat.</p> </document>
I'd want to be able to get output like:
<bucket> <similar-paragraphs> <p>It is important to dis-connect the device from all power.</p> <p>It is important to disconnect the devices from all power.</p> <p>It is important to disconnect the devices from ALL power.</p> </similar-paragraphs> <similar-paragraphs> <p>You will need a number two phillips screwdriver.</p> <p>You will need a #2 Phillips screwdriver.</p> </similar-paragraphs> <similar-paragraphs> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> </similar-paragraphs> <similar-paragraphs> <p>Graphics card; do not eat.</p> </similar-paragraphs> </bucket>
Thanks! Graydon
On Wed, Nov 11, 2020 at 6:38 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Could you add some exemplary input and the output you’d be expecting?
Thanks in advance Christian
Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon