Hello --
Is there some way to iterate full-text matches to mark up in the node every found member of a sequence of phrases?
I have a use case where I have a long list of phrases which may appear in the content set; if they do appear in the content set, these should be marked.
The order of the list is significant; the longest phrases should be marked first.
The idea would be to iterate through the list, marking up the node with any matches.
However, so far as I can tell, full text only works directly on a database node. If I try to pass the node in a function, I get "No database node" errors from attempting full-text operations.
(Making a copy of the DB with a full text index, applying the changes, updating the indexes, and continuing seems possible but also inherently inefficient.)
Is there a straightforward way to do this that I'm missing?
Thanks! Graydon
On Fri, 2020-05-08 at 14:52 -0400, Graydon Saunders wrote:
The idea would be to iterate through the list, marking up the node with any matches.
Can you instead use standoff markup? E.g. store positions of start and end as word counts, and then merge them later?
On Sun, May 10, 2020 at 03:35:45AM -0400, Liam R. E. Quin scripsit:
On Fri, 2020-05-08 at 14:52 -0400, Graydon Saunders wrote:
The idea would be to iterate through the list, marking up the node with any matches.
Can you instead use standoff markup? E.g. store positions of start and end as word counts, and then merge them later?
In principle, yes. But then I would have to be smart and extract the positions correctly somehow and then get all the positional arithmetic correct.
The attraction of the full-text index was a combination of speed and being able to let some other smarter person handle the "does the match still work if there's a line break? bunches of tabs?" issues.
I now think this just isn't a full-text use case; I was trying to think of a way to use something optimized for single-pass search to support recursion on the changed content and that loses all the attractive optimizations. Nothing says I can't use analyze-string and recursion.
Thanks!
-- Graydon
Take a look at exist-Stanford-nlp in my GitHub. Take a look at the code for the named entity recognition
https://github.com/lcahlander/exist-stanford-nlp/blob/master/src/main/xquery...
Loren Cahlander
Sent from my iPhone
On May 10, 2020, at 10:13 AM, Graydon graydonish@gmail.com wrote:
On Sun, May 10, 2020 at 03:35:45AM -0400, Liam R. E. Quin scripsit:
On Fri, 2020-05-08 at 14:52 -0400, Graydon Saunders wrote: The idea would be to iterate through the list, marking up the node with any matches.
Can you instead use standoff markup? E.g. store positions of start and end as word counts, and then merge them later?
In principle, yes. But then I would have to be smart and extract the positions correctly somehow and then get all the positional arithmetic correct.
The attraction of the full-text index was a combination of speed and being able to let some other smarter person handle the "does the match still work if there's a line break? bunches of tabs?" issues.
I now think this just isn't a full-text use case; I was trying to think of a way to use something optimized for single-pass search to support recursion on the changed content and that loses all the attractive optimizations. Nothing says I can't use analyze-string and recursion.
Thanks!
-- Graydon
On Sun, 2020-05-10 at 10:12 -0400, Graydon wrote:
I now think this just isn't a full-text use case;
In the past i used a text retrival package i wrote to solve the problem of inserting links automatically, choosing the longest & avoiding overlaps.
I use some multi-threaded procedural code i wrote years ago in Perl to do it on e.g. https://words.fromoldbooks.org/Chalmers-Biography/w/walsingham-sir-francis.h...
Recently i was thinking about rewriting thism perhaps in XSLT and/or XQuery to try and keep the most "relevant" link rather than the longest, with a different UI. The Perl script takes maybe two minutes to run on approx. 200 MBytes of HTML (10,000 files). But i'd need a good definition of relevant.
I regret that my efforts to get more full text researchers interested in joining the XQuery full text work failed - but then i think one of them may have been Sergey Brin, and he had other interests :) - as markup-informed ranking of results ought to be really interesting. On the other hand maybe Full Text would have become even more complex :)
Liam
Hi Graydon,
Thanks for sharing your use case.
However, so far as I can tell, full text only works directly on a database node. If I try to pass the node in a function, I get "No database node" errors from attempting full-text operations.
You can convert an XML node to the internal “database node” representation by applying a dummy operation:
let $xml := <xml>hello world</xml> update {} return ft:mark($xml[text() contains text 'hello'])
Does this already help? See the Wiki articles [1,2] for some revised information.
I have already asked myself in the past if we shouldn’t include a function that expose internal result positions to the user? Suggestions are welcome.
Best, Christian
[1] https://docs.basex.org/wiki/Full-Text_Module#ft:mark [2] https://docs.basex.org/wiki/Database_Module#Database_Nodes
On Mon, May 11, 2020 at 07:51:02PM +0200, Christian Grün scripsit:
Hi Graydon,
Hi Christian --
Thanks for sharing your use case.
However, so far as I can tell, full text only works directly on a database node. If I try to pass the node in a function, I get "No database node" errors from attempting full-text operations.
You can convert an XML node to the internal “database node” representation by applying a dummy operation:
let $xml := <xml>hello world</xml> update {} return ft:mark($xml[text() contains text 'hello'])
Does this already help? See the Wiki articles [1,2] for some revised information.
That helps and I will check, but not this week. (This part of the current deadline has been addressed by clubbing the problem with a rock, er, xsl:iterate, and that'll do for now. The general pattern is something I need to do a lot so I'll be coming back to this.)
Thank you very much for the docs update!
I have already asked myself in the past if we shouldn’t include a function that expose internal result positions to the user? Suggestions are welcome.
The thing that I would most want to see is some way to capture multi-word matches using full-text search; "full phrase search", in effect. I can see that as the start and end of a range of internal result positions but will admit to wanting something less at risk of my arithmetic errors.
-- Graydon
The thing that I would most want to see is some way to capture multi-word matches using full-text search; "full phrase search", in effect. I can see that as the start and end of a range of internal result positions but will admit to wanting something less at risk of my arithmetic errors.
Providing access to the starts and ends may be difficult due to all the logical operators that can be used (ftor, ftand, ftnot, not in). A simple example:
let $xml := <_>a b c d</_> update {} return ft:mark($xml[text() contains text 'b c' ftand 'c d'])
We could possibly make the full data structures available that need to be internally generated. I fear people wouldn’t really work with it as they are fairly complex (a look into the specification may give you an impression of that [1]).
But thanks for your thoughts, I’ll let them grow.
[1] https://www.w3.org/TR/xpath-full-text-10/#FTOperatorsSemanticsSec
On Mon, 2020-05-11 at 22:29 +0200, Christian Grün wrote:
Providing access to the starts and ends may be difficult due to all the logical operators that can be used
A way to go from ($input, $phrases) to a $input autmented with db:milestone elements each containing starts="0 7 23" ends="2 6 18" attributes (where the numbers are positional in the sequene of phrases) might be good. Or the mileston element could iclude the phrase,
I saw his db:milestone <db:start ref="3">naked hooves</db:start> <db:start ref="6">unshod</db:start> </db:milestonr>baredb:milestone <db:end ref="6" /></db:mileston> feet....
as two problems are (1) overlapping results, and (2) query expansion using a thesaurus and/or stemming.
Liam
(ftor, ftand, ftnot, not in). A simple example:
let $xml := <_>a b c d</_> update {} return ft:mark($xml[text() contains text 'b c' ftand 'c d'])
We could possibly make the full data structures available that need to be internally generated. I fear people wouldn’t really work with it as they are fairly complex (a look into the specification may give you an impression of that [1]).
But thanks for your thoughts, I’ll let them grow.
[1] https://www.w3.org/TR/xpath-full-text-10/#FTOperatorsSemanticsSec
basex-talk@mailman.uni-konstanz.de