Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
Hi Graydon,
Could you add some exemplary input and the output you’d be expecting?
Thanks in advance Christian
Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
Hi Christian --
The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.
So with input of:
<document> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> <p>It is important to dis-connect the device from all power.</p> <p>You will need a number two phillips screwdriver.</p> <p>It is important to disconnect the devices from all power.</p> <p>You will need a #2 Phillips screwdriver.</p> <p>It is important to disconnect the devices from ALL power.</p> <p>Graphics card; do not eat.</p> </document>
I'd want to be able to get output like:
<bucket> <similar-paragraphs> <p>It is important to dis-connect the device from all power.</p> <p>It is important to disconnect the devices from all power.</p> <p>It is important to disconnect the devices from ALL power.</p> </similar-paragraphs> <similar-paragraphs> <p>You will need a number two phillips screwdriver.</p> <p>You will need a #2 Phillips screwdriver.</p> </similar-paragraphs> <similar-paragraphs> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> </similar-paragraphs> <similar-paragraphs> <p>Graphics card; do not eat.</p> </similar-paragraphs> </bucket>
Thanks! Graydon
On Wed, Nov 11, 2020 at 6:38 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Could you add some exemplary input and the output you’d be expecting?
Thanks in advance Christian
Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
Hi Grayon,
Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:
let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>
Cheers, Christian
On Thu, Nov 12, 2020 at 12:53 AM Graydon Saunders graydonish@gmail.com wrote:
Hi Christian --
The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.
So with input of:
<document> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> <p>It is important to dis-connect the device from all power.</p> <p>You will need a number two phillips screwdriver.</p> <p>It is important to disconnect the devices from all power.</p> <p>You will need a #2 Phillips screwdriver.</p> <p>It is important to disconnect the devices from ALL power.</p> <p>Graphics card; do not eat.</p> </document>
I'd want to be able to get output like:
<bucket> <similar-paragraphs> <p>It is important to dis-connect the device from all power.</p> <p>It is important to disconnect the devices from all power.</p> <p>It is important to disconnect the devices from ALL power.</p> </similar-paragraphs> <similar-paragraphs> <p>You will need a number two phillips screwdriver.</p> <p>You will need a #2 Phillips screwdriver.</p> </similar-paragraphs> <similar-paragraphs> <p>Under no circumstances should you rig an antenna during a thunderstorm.</p> </similar-paragraphs> <similar-paragraphs> <p>Graphics card; do not eat.</p> </similar-paragraphs> </bucket>
Thanks! Graydon
On Wed, Nov 11, 2020 at 6:38 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Could you add some exemplary input and the output you’d be expecting?
Thanks in advance Christian
Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:
Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:
let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>
I shall certainly give this a try!
Thank you, Christian! I continue to be astonished by the power and utility of this tool you've built.
-- Graydon
Graydon, spread the word!
Am Donnerstag, 12. November 2020, 13:59:12 MEZ hat Graydon graydonish@gmail.com Folgendes geschrieben:
On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:
Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:
let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>
I shall certainly give this a try!
Thank you, Christian! I continue to be astonished by the power and utility of this tool you've built.
-- Graydon
This is probably difficult since in BaseX, fuzzy matching is implemented using the Levenshtein distance between two strings [1]. Therefore similarity is a relation between pairs of paragraphs rather than an intrinsic property of an individual paragraph.
You should look for content fingerprinting/clustering techniques.
[1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying
On 12.11.2020 00:00, Graydon Saunders wrote:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
Useful keywords; thank you!
Also more of a development effort than this project will support, alas. (Unless someone's willing to provide a pointer to their public release of such a solution, free for commercial use? Which doesn't seem a whole lot more likely than someone throwing a gold brick through my window.)
On Wed, Nov 11, 2020 at 6:42 PM Imsieke, Gerrit, le-tex < gerrit.imsieke@le-tex.de> wrote:
This is probably difficult since in BaseX, fuzzy matching is implemented using the Levenshtein distance between two strings [1]. Therefore similarity is a relation between pairs of paragraphs rather than an intrinsic property of an individual paragraph.
You should look for content fingerprinting/clustering techniques.
[1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying
On 12.11.2020 00:00, Graydon Saunders wrote:
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or
impossible.
Thanks! Graydon
Maybe OpenRefine and particularly its clustering feature [1] can be useful. I don't have any first-hand experience with it though.
[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
On 12.11.2020 00:57, Graydon Saunders wrote:
Useful keywords; thank you!
Also more of a development effort than this project will support, alas. (Unless someone's willing to provide a pointer to their public release of such a solution, free for commercial use? Which doesn't seem a whole lot more likely than someone throwing a gold brick through my window.)
On Wed, Nov 11, 2020 at 6:42 PM Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de mailto:gerrit.imsieke@le-tex.de> wrote:
This is probably difficult since in BaseX, fuzzy matching is implemented using the Levenshtein distance between two strings [1]. Therefore similarity is a relation between pairs of paragraphs rather than an intrinsic property of an individual paragraph. You should look for content fingerprinting/clustering techniques. [1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying On 12.11.2020 00:00, Graydon Saunders wrote: > Hello -- > > Is there some way to assign the abstraction of a fuzzy match to a > variable, so that something like > > for $x in //p > let $key := get-fuzzy-match-value($x) > group by $key > return <similar-paragraphs>{$x}</similar-paragraphs> > > would be possible? > > I'm supposing this is one of those things that's either easy or impossible. > > Thanks! > Graydon
On Thu, Nov 12, 2020 at 01:21:56AM +0100, Imsieke, Gerrit, le-tex scripsit:
Maybe OpenRefine and particularly its clustering feature [1] can be useful. I don't have any first-hand experience with it though.
[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
I shall be careful about standing close to the window!
Thank you; that looks like a good candidate for a long term solution.
-- Graydon
On Wed, 2020-11-11 at 18:57 -0500, Graydon Saunders wrote:
Useful keywords; thank you!
The late Gerald Salton of Cornell (i think Cornell) pioneered a lot of ideas in text similarity & clustering, using vector cosines - his idea was to consider each text as a point in an n-dimensional space, where the dimensions are given by the set of distinct words in the corpus, and then to be able to measure the hypothetical angle between lines from the origin to any two given texts.
Similarity done this way has a lot of problems, one of which is that "dictionary.txt" turns out ot be "similar" to every other document.
In the past i've done something similar to your problem using an algorithn like, for each text t_i for each word w in t_i (in order) for each document d in the collection that contains w link { from: t_i, to: d, value: 1)
THen repeat for phrases of two words, three words, four words, where value is the square of the number of words in the phrase, and then add the values for each t_i, d pair, and take the biggest.
But this is not a fast algorithm.
Faster might be just to take each of your input paragraphs as an "all words" query - "Candidate similar paragraphs: ... [see more]"
Liam
Hello Graydon,
These blogposts discuss various algorithms to find near-duplicate documents, performance, and xquery (marklogic dialect) implementations :
https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-... https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-...
depending on your constraints, maybe some ideas could help ?
Victor
Le 12/11/2020 à 00:00, Graydon Saunders a écrit :
Hello --
Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like
for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>
would be possible?
I'm supposing this is one of those things that's either easy or impossible.
Thanks! Graydon
On Thu, Nov 12, 2020 at 09:30:47AM +0100, Victor / tokiop scripsit:
Hello Graydon,
These blogposts discuss various algorithms to find near-duplicate documents, performance, and xquery (marklogic dialect) implementations :
https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-... https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-...
depending on your constraints, maybe some ideas could help ?
Thank you; I'll take a look at those.
-- Graydon
basex-talk@mailman.uni-konstanz.de