Hi,
Sorry if this is too basic, but I’m trying to get the positions of the matched tokens in a full-text query, and I can’t find the way to do it. I imagine something like:
for $sentence in //sentence where $sentence[text() contains text { ‘DNA', ‘oxidation' }] return <positions>ft:SOME-FUNCTION-FOR-TOKENS-POSITIONS($sentence[text() contains text { ‘DNA', ‘oxidation' }])</positions>
Is this possible?
Thank you in advance,
Javier
Hi Javier,
Thanks for your mail.
It's currently not possible to directly access the position information that is internally used for computing the results. The reasons are manifold:
* The positions do not reflect the actual substring anymore. Instead, we enumerate all tokens that remain after normalizing the input (i.e., after the removal of stopwords, stemming, etc.). So, in practice, it is difficult to assign those positional information to the original input.
* The positions can stretch over several elements (for example, the following query yields true: <x>X<y/>Z</x> contains text "XZ")
* The data structures containing the positions can potentially consume lots of space, so they are usually discarded after the result is returned.
What would you like to do with the information? Maybe you have seen the ft:mark and ft:extract functions; are they helpful a bit?
Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:mark
On Wed, Nov 26, 2014 at 12:35 PM, Javier Couto javier.couto.fr@gmail.com wrote:
Hi,
Sorry if this is too basic, but I’m trying to get the positions of the matched tokens in a full-text query, and I can’t find the way to do it. I imagine something like:
for $sentence in //sentence where $sentence[text() contains text { ‘DNA', ‘oxidation' }] return <positions>ft:SOME-FUNCTION-FOR-TOKENS-POSITIONS($sentence[text() contains text { ‘DNA', ‘oxidation' }])</positions>
Is this possible?
Thank you in advance,
Javier
Hi Christian,
Many thanks for your answer. I already use ft:mark to bold the terms, and it works great, but I need to sort the answers (“sentence” element) according to the distance between the match and the beginning of the sentence (or a specific word at some position). So if I search “DNA” in the following sentences:
1. <sentence id="1.1.122.1.122">The translated protein showed weak DNA binding with a specificity for the kappa B binding motif.</sentence> 2. <sentence id="54.1.5.1.698">Using this assay system, we have evaluated the contributions of ligand binding and heat activation to DNA binding by these glucocorticoid receptors.</sentence> 3. <sentence id="2.1.17.1.79”>2.5 Mesocosm DNA extraction and purification</sentence>
I need the results order to be: 3, 1, 2. The sentence element is always a text. I was going to implement a function to do something like:
for $sentence in //sentence where $sentence[text() contains text ‘DNA’] order by local:distance($sentence, ‘DNA') return $sentence
The distance function could also be called as local:distance($sentence, ‘DNA’, position_to_compare) (by default position_to_compare=1). If there are several matches, I consider the min distance.
Do you have any idea if there is a possible approach to do this with BaseX?
Thank you again.
Best,
Javier
El 26/11/2014, a las 14:02, Christian Grün christian.gruen@gmail.com escribió:
Hi Javier,
Thanks for your mail.
It's currently not possible to directly access the position information that is internally used for computing the results. The reasons are manifold:
The positions do not reflect the actual substring anymore. Instead, we enumerate all tokens that remain after normalizing the input (i.e., after the removal of stopwords, stemming, etc.). So, in practice, it is difficult to assign those positional information to the original input.
The positions can stretch over several elements (for example, the following query yields true: <x>X<y/>Z</x> contains text "XZ")
The data structures containing the positions can potentially consume lots of space, so they are usually discarded after the result is returned.
What would you like to do with the information? Maybe you have seen the ft:mark and ft:extract functions; are they helpful a bit?
Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:mark
On Wed, Nov 26, 2014 at 12:35 PM, Javier Couto javier.couto.fr@gmail.com wrote:
Hi,
Sorry if this is too basic, but I’m trying to get the positions of the matched tokens in a full-text query, and I can’t find the way to do it. I imagine something like:
for $sentence in //sentence where $sentence[text() contains text { ‘DNA', ‘oxidation' }] return <positions>ft:SOME-FUNCTION-FOR-TOKENS-POSITIONS($sentence[text() contains text { ‘DNA', ‘oxidation' }])</positions>
Is this possible?
Thank you in advance,
Javier
Hi Javier,
One function you could try is ft:tokenize. Please have a look at the attached example .
Hope this helps?
Christian ________________________________________
let $term := ft:tokenize('DNA') for $sentence in <sentences> <sentence id="1.1.122.1.122">The translated protein showed weak DNA binding with a specificity for the kappa B binding motif.</sentence> <sentence id="54.1.5.1.698">Using this assay system, we have evaluated the contributions of ligand binding and heat activation to DNA binding by these glucocorticoid receptors.</sentence> <sentence id="2.1.17.1.79">2.5 Mesocosm DNA extraction and purification</sentence> </sentences>/sentence order by index-of(ft:tokenize($sentence), $term)[1] return $sentence
Hi Christian,
Yes, it helps, tank you! I will try this approach. Two last questions:
1. The ft:tokenize function tokenizes on-the-fly or tokens are stored in the full text index ? It seems that they are stored for the whole document, but for each text element ? I’m wondering if I can speed up performance if I pre-compute, for each sentence, its tokenized version and store it in the database.
2. I guess that if I search something like { “DNA", “oxidation” }, I need to compute the distance for each term using index-of, isn’t it ?
Best,
Javier
El 26/11/2014, a las 16:18, Christian Grün christian.gruen@gmail.com escribió:
Hi Javier,
One function you could try is ft:tokenize. Please have a look at the attached example.
Hope this helps? Christian ________________________________________
let $term := ft:tokenize('DNA') for $sentence in <sentences> <sentence id="1.1.122.1.122">The translated protein showed weak DNA binding with a specificity for the kappa B binding motif.</sentence> <sentence id="54.1.5.1.698">Using this assay system, we have evaluated the contributions of ligand binding and heat activation to DNA binding by these glucocorticoid receptors.</sentence> <sentence id="2.1.17.1.79">2.5 Mesocosm DNA extraction and purification</sentence> </sentences>/sentence order by index-of(ft:tokenize($sentence), $term)[1] return $sentence
Hi Javier,
- The ft:tokenize function tokenizes on-the-fly or tokens are stored in the full text index?
Tokenization is done on-the-fly. It would actually take much longer to find the correspondent tokens for a text in the index. Moreover, you can tokenize arbitrary input strings. The following examples return true:
ft:tokenize("Naïve") = "naive"
deep-equal( ft:tokenize(<div><b>H</b>ello! (Everyone)</div>), ('hello', 'everyone') )
Tokenization is very fast in BaseX. The following query takes appr. 200 ms on my machine:
prof:time(prof:void( for $i in 1 to 1000000 return ft:tokenize(" Amidst the vogue enjoyed by existentialism and positivism in early 20th-century Europe, Adorno advanced a dialectical conception of natural history that critiqued the twin temptations of ontology and empiricism through studies of Kierkegaard and Husserl." ) ))
But you are completely right that the post-processing may be too slow if you need to order thousands or millions of index results. In this case, you could play around with the internally computed score value:
for $sentence score $score in //sentence [text() contains text { 'DNA', 'oxidation' }] order by $score descending return $sentence
The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be (cited from [1]). Distances between words are not considered so far, though (volunteer implementors are welcome ;).
- I guess that if I search something like { “DNA", “oxidation” }, I need to compute the distance for each term using index-of, isn’t it ?
Exactly, that's one way. You can do all kinds of things with the returned tokens and, consequently, their positions in the sequence. Please check out the attached example for some more complex distance computations (it uses fold-left etc., which may all not be required, so don't be frightened.. ;).
If you only want to retrieve results in which the queried words occur in a maximum distance, you could as well try the 'distance' and 'windows' keywords [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text#Scoring [2] http://docs.basex.org/wiki/Full-Text#Positional_Filters
Hi Christian,
I have a problem and I cannot understand what’s happening.
I have set UPDINDEX to true, then I have created a database with some XML files in the BaseXGUI. When I look at the DB properties, both Up-to-date and UPDINDEX are true (this is what I want). But if I add some more XML files, using the Add Ressources at the Properties window, the Up-to-date flag becomes false (!) and I cannot understand why.
Any hint?
I had this problem first using the Java API, so I have decided to test it using the GUI, and I got the same behavior.
Thank you in advance,
Javier
PS : I am using Basex 7.9. Since all is working fine (besides the UPDINDEX issue) I hesitate about upgrading.
Hi Javier,
But if I add some more XML files, using the Add Ressources at the Properties window, the Up-to-date flag becomes false (!) and I cannot understand why.
The reason is that UPDINDEX won't update all available index and statistics in BaseX. I have added an explanatory line in [1].
However, I assume that your queries will still take advantage of the value index structures. Have you checked the query info?
Christian
[1] http://docs.basex.org/wiki/Index#Updates
Any hint?
I had this problem first using the Java API, so I have decided to test it using the GUI, and I got the same behavior.
Thank you in advance,
Javier
PS : I am using Basex 7.9. Since all is working fine (besides the UPDINDEX issue) I hesitate about upgrading.
Hi Christian,
Thank you for your quick answer, and sorry for the question, I had read the documentation though (clearly not very carefully).
I have a full-text index. Now I can see why the flag was set to false after each update. And, indeed, the queries still take advantage of the other indexes.
Best,
Javier
Hi,
Sorry if this is too basic but I can’t find specific information to resolve it.
I am trying to install BaseX version 7.9 in Debian (wheezy) using apt-get or aptitude but the only version available seems to be the 7.6? Is this correct?
Thank you in advance,
Javier
Hi Javier,
you can find version 8.1.1 (as well as 7.9) in testing ( https://packages.debian.org/source/testing/basex), 7.9 should be in stable ( https://packages.debian.org/source/stable/basex).
Please note that due to the package management the versions in the repositories will always lack behind our own releases. We just released version 8.2., so you might simply want to start with this. Simply download the zip distribution and start bin/basexgui (or bin/basex for CLI only) - it really does not require much setting up.
Cheers Dirk
On Fri, May 22, 2015 at 3:19 PM, Javier Couto javier.couto.fr@gmail.com wrote:
Hi,
Sorry if this is too basic but I can’t find specific information to resolve it.
I am trying to install BaseX version 7.9 in Debian (wheezy) using apt-get or aptitude but the only version available seems to be the 7.6? Is this correct?
Thank you in advance,
Javier
basex-talk@mailman.uni-konstanz.de