grouping by fuzzy match?

List overview All Threads
Download

newer

older

Bind context to stdin: -i-

Progress bar

Graydon Saunders

11 Nov 2020 11 Nov '20

6 p.m.

Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Christian Grün

11 Nov 11 Nov

6:37 p.m.

Hi Graydon,

Could you add some exemplary input and the output you’d be expecting?

Thanks in advance Christian

Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:

...

Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Graydon Saunders

6:52 p.m.

Hi Christian --

The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.

So with input of:

<document> Under no circumstances should you rig an antenna during a thunderstorm. It is important to dis-connect the device from all power. You will need a number two phillips screwdriver. It is important to disconnect the devices from all power. You will need a #2 Phillips screwdriver. It is important to disconnect the devices from ALL power. Graphics card; do not eat. </document>

I'd want to be able to get output like:

<bucket> <similar-paragraphs> It is important to dis-connect the device from all power. It is important to disconnect the devices from all power. It is important to disconnect the devices from ALL power. </similar-paragraphs> <similar-paragraphs> You will need a number two phillips screwdriver. You will need a #2 Phillips screwdriver. </similar-paragraphs> <similar-paragraphs> Under no circumstances should you rig an antenna during a thunderstorm. </similar-paragraphs> <similar-paragraphs> Graphics card; do not eat. </similar-paragraphs> </bucket>

Thanks! Graydon

On Wed, Nov 11, 2020 at 6:38 PM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Graydon,

Could you add some exemplary input and the output you’d be expecting?

Thanks in advance Christian

Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:

...
Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Christian Grün

12 Nov 12 Nov

5:58 a.m.

Hi Grayon,

Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:

let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>

Cheers, Christian

On Thu, Nov 12, 2020 at 12:53 AM Graydon Saunders graydonish@gmail.com wrote:

...

Hi Christian --

The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.

So with input of:

<document> Under no circumstances should you rig an antenna during a thunderstorm. It is important to dis-connect the device from all power. You will need a number two phillips screwdriver. It is important to disconnect the devices from all power. You will need a #2 Phillips screwdriver. It is important to disconnect the devices from ALL power. Graphics card; do not eat. </document>

I'd want to be able to get output like:

<bucket> <similar-paragraphs> It is important to dis-connect the device from all power. It is important to disconnect the devices from all power. It is important to disconnect the devices from ALL power. </similar-paragraphs> <similar-paragraphs> You will need a number two phillips screwdriver. You will need a #2 Phillips screwdriver. </similar-paragraphs> <similar-paragraphs> Under no circumstances should you rig an antenna during a thunderstorm. </similar-paragraphs> <similar-paragraphs> Graphics card; do not eat. </similar-paragraphs> </bucket>

Thanks! Graydon

On Wed, Nov 11, 2020 at 6:38 PM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Graydon,

Could you add some exemplary input and the output you’d be expecting?

Thanks in advance Christian

Graydon Saunders graydonish@gmail.com schrieb am Do., 12. Nov. 2020, 00:00:

...
Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Graydon

7:59 a.m.

On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:

...

Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:

let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>

I shall certainly give this a try!

Thank you, Christian! I continue to be astonished by the power and utility of this tool you've built.

-- Graydon

Hans-Juergen Rennau

8:38 a.m.

Graydon, spread the word!

Am Donnerstag, 12. November 2020, 13:59:12 MEZ hat Graydon graydonish@gmail.com Folgendes geschrieben:

On Thu, Nov 12, 2020 at 11:58:29AM +0100, Christian Grün scripsit:

...

Gerrit has already mentioned fingerprinting techniques. If your time is limited, it may be sufficient to apply full-text tokenization and Soundex to your strings:

let $get-fuzzy-match-value := function($x) { $x => ft:tokenize(map { 'stemming': true() }) => distinct-values() => string-join() => strings:soundex() } for $x in //p group by $key := $get-fuzzy-match-value($x) return <similar-paragraphs key='{ $key }'>{ $x }</similar-paragraphs>

I shall certainly give this a try!

Thank you, Christian! I continue to be astonished by the power and utility of this tool you've built.

-- Graydon

Imsieke, Gerrit, le-tex

11 Nov 11 Nov

6:42 p.m.

This is probably difficult since in BaseX, fuzzy matching is implemented using the Levenshtein distance between two strings [1]. Therefore similarity is a relation between pairs of paragraphs rather than an intrinsic property of an individual paragraph.

You should look for content fingerprinting/clustering techniques.

[1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying

On 12.11.2020 00:00, Graydon Saunders wrote:

...

Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Graydon Saunders

6:57 p.m.

Useful keywords; thank you!

Also more of a development effort than this project will support, alas. (Unless someone's willing to provide a pointer to their public release of such a solution, free for commercial use? Which doesn't seem a whole lot more likely than someone throwing a gold brick through my window.)

On Wed, Nov 11, 2020 at 6:42 PM Imsieke, Gerrit, le-tex < gerrit.imsieke@le-tex.de> wrote:

...

This is probably difficult since in BaseX, fuzzy matching is implemented using the Levenshtein distance between two strings [1]. Therefore similarity is a relation between pairs of paragraphs rather than an intrinsic property of an individual paragraph.

You should look for content fingerprinting/clustering techniques.

[1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying

On 12.11.2020 00:00, Graydon Saunders wrote:

...
Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or

impossible.

...
Thanks! Graydon

Imsieke, Gerrit, le-tex

7:21 p.m.

Maybe OpenRefine and particularly its clustering feature [1] can be useful. I don't have any first-hand experience with it though.

[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

On 12.11.2020 00:57, Graydon Saunders wrote:

...

Useful keywords; thank you!

Also more of a development effort than this project will support, alas. (Unless someone's willing to provide a pointer to their public release of such a solution, free for commercial use? Which doesn't seem a whole lot more likely than someone throwing a gold brick through my window.)

On Wed, Nov 11, 2020 at 6:42 PM Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de mailto:gerrit.imsieke@le-tex.de> wrote:
This is probably difficult since in BaseX, fuzzy matching is
implemented
using the Levenshtein distance between two strings [1]. Therefore
similarity is a relation between pairs of paragraphs rather than an
intrinsic property of an individual paragraph.

You should look for content fingerprinting/clustering techniques.

[1] https://docs.basex.org/wiki/Full-Text#Fuzzy_Querying


On 12.11.2020 00:00, Graydon Saunders wrote:
 > Hello --
 >
 > Is there some way to assign the abstraction of a fuzzy match to a
 > variable, so that something like
 >
 > for $x in //p
 >    let $key := get-fuzzy-match-value($x)
 >    group by $key
 >    return <similar-paragraphs>{$x}</similar-paragraphs>
 >
 > would be possible?
 >
 > I'm supposing this is one of those things that's either easy or
impossible.
 >
 > Thanks!
 > Graydon

Graydon

12 Nov 12 Nov

7:56 a.m.

On Thu, Nov 12, 2020 at 01:21:56AM +0100, Imsieke, Gerrit, le-tex scripsit:

...

Maybe OpenRefine and particularly its clustering feature [1] can be useful. I don't have any first-hand experience with it though.

[1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

I shall be careful about standing close to the window!

Thank you; that looks like a good candidate for a long term solution.

-- Graydon

Liam R. E. Quin

11 Nov 11 Nov

9:25 p.m.

On Wed, 2020-11-11 at 18:57 -0500, Graydon Saunders wrote:

...

Useful keywords; thank you!

The late Gerald Salton of Cornell (i think Cornell) pioneered a lot of ideas in text similarity & clustering, using vector cosines - his idea was to consider each text as a point in an n-dimensional space, where the dimensions are given by the set of distinct words in the corpus, and then to be able to measure the hypothetical angle between lines from the origin to any two given texts.

Similarity done this way has a lot of problems, one of which is that "dictionary.txt" turns out ot be "similar" to every other document.

In the past i've done something similar to your problem using an algorithn like, for each text t_i for each word w in t_i (in order) for each document d in the collection that contains w link { from: t_i, to: d, value: 1)

THen repeat for phrases of two words, three words, four words, where value is the square of the number of words in the phrase, and then add the values for each t_i, d pair, and take the biggest.

But this is not a fast algorithm.

Faster might be just to take each of your input paragraphs as an "all words" query - "Candidate similar paragraphs: ... [see more]"

Liam

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Victor / tokiop

12 Nov 12 Nov

3:30 a.m.

Hello Graydon,

These blogposts discuss various algorithms to find near-duplicate documents, performance, and xquery (marklogic dialect) implementations :

https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-... https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-...

depending on your constraints, maybe some ideas could help ?

Victor

Le 12/11/2020 à 00:00, Graydon Saunders a écrit :

...

Hello --

Is there some way to assign the abstraction of a fuzzy match to a variable, so that something like

for $x in //p let $key := get-fuzzy-match-value($x) group by $key return <similar-paragraphs>{$x}</similar-paragraphs>

would be possible?

I'm supposing this is one of those things that's either easy or impossible.

Thanks! Graydon

Graydon

7:57 a.m.

On Thu, Nov 12, 2020 at 09:30:47AM +0100, Victor / tokiop scripsit:

...

Hello Graydon,

These blogposts discuss various algorithms to find near-duplicate documents, performance, and xquery (marklogic dialect) implementations :

https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-... https://stuartmyles.blogspot.com/2012/10/longest-common-substring-in-xquery-...

depending on your constraints, maybe some ideas could help ?

Thank you; I'll take a look at those.

-- Graydon

1709

Age (days ago)

1710

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

12 comments

7 participants

tags (0)

participants (7)

Christian Grün
Graydon
Graydon Saunders
Hans-Juergen Rennau
Imsieke, Gerrit, le-tex
Liam R. E. Quin
Victor / tokiop