I’m searching for short phrases where I may want to respect order or not and where the phrases may cross element boundaries.
For example, I have the phrase “Amazon Alexa Spoke” and I want to find any DITA topic whose title text includes “Amazon Alexa Spoke” in that order, or maybe I want those words in any order, depending on my search requirements.
When I run this query against my database I find occurrences where all three words are in the same parent element, i.e.:
<title>Create a connection record for the <ph>Amazon Alexa spoke</ph> </title> <title>Create a credential record for the <ph>Amazon Alexa spoke</ph> </title> <title>Set up the <ph>Amazon Alexa spoke</ph> </title>
But I do not find it where one of the words is not in the same parent:
This title is *not* found (even though this is the one I actually want to have found):
<title><ph id="alexa">Amazon Alexa</ph> Spoke</title>
Reading the docs on ft:search(), it is clear that it is searching on text nodes:
“Returns all text nodes from the full-text index…”
So I think the behavior here is as documented.
Short of creating a separate database that removes the subelements within <title> elements, is there a way to use full text indexing to do the search I want? In particular, I want to be able to turn the ordered/unordered check on or off.
If I always wanted ordered I could just use a regular expression match—it wouldn’t be that efficient but efficiency is not a concern in this particular case (but I can see where it would be in a more general search support situation).
Or am I missing a more obvious solution to this requirement?
Note that in this case I don’t care about finding different word forms—for this particular search I only care about exact word matches.
Cheers,
E. _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
I found at least a partial solution, which is to search for the phrase using “any word” to get a set of candidate title elements and then use string matching on the candidates to find the ones that match. This works for ordered matches but an unordered match requires more sophisticated regex fu or some other way of matching the string. It’s not as fast as a pure full text search would be but is acceptably fast for my current application (which is an ad-hoc text analysis rather than a query done through a web application).
In the case of searching over DITA content specifically, it probably makes sense to have a dedicated title index or maybe a “title and block elements” index. I’ll have to think on that more.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Eliot Kimber eliot.kimber@servicenow.com Date: Monday, December 4, 2023 at 6:00 PM To: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Using ft:search() across element boundaries: possible? I’m searching for short phrases where I may want to respect order or not and where the phrases may cross element boundaries.
For example, I have the phrase “Amazon Alexa Spoke” and I want to find any DITA topic whose title text includes “Amazon Alexa Spoke” in that order, or maybe I want those words in any order, depending on my search requirements.
When I run this query against my database I find occurrences where all three words are in the same parent element, i.e.:
<title>Create a connection record for the <ph>Amazon Alexa spoke</ph> </title> <title>Create a credential record for the <ph>Amazon Alexa spoke</ph> </title> <title>Set up the <ph>Amazon Alexa spoke</ph> </title>
But I do not find it where one of the words is not in the same parent:
This title is *not* found (even though this is the one I actually want to have found):
<title><ph id="alexa">Amazon Alexa</ph> Spoke</title>
Reading the docs on ft:search(), it is clear that it is searching on text nodes:
“Returns all text nodes from the full-text index…”
So I think the behavior here is as documented.
Short of creating a separate database that removes the subelements within <title> elements, is there a way to use full text indexing to do the search I want? In particular, I want to be able to turn the ordered/unordered check on or off.
If I always wanted ordered I could just use a regular expression match—it wouldn’t be that efficient but efficiency is not a concern in this particular case (but I can see where it would be in a more general search support situation).
Or am I missing a more obvious solution to this requirement?
Note that in this case I don’t care about finding different word forms—for this particular search I only care about exact word matches.
Cheers,
E. _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
basex-talk@mailman.uni-konstanz.de