Hi,
I would like to set up a collection of TEI-annotated texts (novels, dramas, poems, etc.). In total, it would be around 3 GB XML data in some 1000 files, the text size varies from 29 KB to 94 MB. I have a server running with Java 1.6.0_07 on CentOS 5.7 on a Virtual Machine with 1 GB RAM.
I started to add files to the database and wrote a preliminary query interface (http://oldphras.unibas.ch/cgi-bin/basex-client.pl). Since we want to look for examples of multi-word units, I would like to use queries like:
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
(In the end, queries will be more complex to allow users to search for several words in different word order within a sentence using stemming or fuzzy)
To make inspection of results easier, I added ft:mark. A collection with only a dozen of texts of about 71 MB with full text index for German, optimized, etc. works quite well. However, the example query needs more than 9s, which is rather slow.
What is worse: Adding more files, resulting in about 323 MB, causes a timeout when running the query. I already set the memory for the Java VM to 1024, but it does not help.
I tried it with the GUI on my iMac with 4 GB RAM and got a time out when the collection size is above 900 MB (which is still only a small part of my data).
Is there any recommendation for size of RAM or specific settings when processing collections of about 3 GB?
Is there a better way to write queries when looking for inflected forms of several words and allowing for spelling errors?
Thank you in advance
Cerstin
Dear Cerstin,
thanks for your e-mail, and the detailed information on your use case.
To make inspection of results easier, I added ft:mark. A collection with only a dozen of texts of about 71 MB with full text index for German, optimized, etc. works quite well. However, the example query needs more than 9s, which is rather slow.
First of all, it might be interesting to hear what the query compiler does. Have you looked at the QueryInfo panel to check out if the full-text index is applied? If yes, you should find something like..
Compiling: - ... - applying full-text index - ...
..in the info panel.
Another hint: to enable index optimizations, your query...
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
..may have to be rewritten as follows:
//*[text() contains text "Korb geben"][self::p or self::l]
Please note that the available main memory shouldn't make a big difference. During the execution of a query, you can click on the memory indicator in the lower right corner of the GUI in order to get some feedback how much memory is currently needed.
Feel free to ask for more, Christian
Dear Christian,
thanks for your quick answer.
Zitat von Christian Grün christian.gruen@gmail.com:
To make inspection of results easier, I added ft:mark. A collection with only a dozen of texts of about 71 MB with full text index for German, optimized, etc. works quite well. However, the example query needs more than 9s, which is rather slow.
First of all, it might be interesting to hear what the query compiler does. Have you looked at the QueryInfo panel to check out if the full-text index is applied? If yes, you should find something like..
Compiling:
- ...
- applying full-text index
- ...
..in the info panel.
The interesting thing is:
for my original query
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
there is no information on compiling, only information on Timing, Result (number of results) and Queryplan.
If I change '(p|l)' to 'p', I get information on Compiling, but only:
Compiling: - optimizing descendant-or-self step(s) Result: root()/descendant::{http://www.tei-c.org/ns/1.0%7Dp%5Btext() contains text "Korb geben"]
Apparently, the index is not used.
Another hint: to enable index optimizations, your query...
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
..may have to be rewritten as follows:
//*[text() contains text "Korb geben"][self::p or self::l]
OK, this results in reducing total time by half, but I see only:
Compiling: - optimizing descendant-or-self step(s) Result: root()/descendant::*[text() contains text "Korb geben"][self::{http://www.tei-c.org/ns/1.0%7Dp or self::{http://www.tei-c.org/ns/1.0%7Dl]
Also memory used is reduced a bit, so this definitely helps. However, if I include 'using stemming using language "de"', total time is almost the same.
I see no possibility to enforce using the index. I use BaseX 7.0.2, maybe this is a bug? I will try the Beta 7.1.
Best regards
Cerstin
Zitat von Cerstin Mahlow cerstin.mahlow@unibas.ch:
I will try the Beta 7.1.
And now everything runs smoothly: I had to use the bigger machine for creating and indexing the collection (now consisting of 677 documents with 1.9 GB input size resulting in a 2.1 GB collection).
Opening the collection in the GUI on my small server takes 208941 ms -- probably because it has 1 GB RAM only.
But now I can run queries in reasonable time (http://oldphras.unibas.ch/cgi-bin/basex-client.pl)
I would be interested -- mainly for development and debugging -- to access information concerning query processing (i.e., information displayed in the "Query Info" buffer in the gui) with the client. I use the Perl API.
Best Regards
Cerstin
And now everything runs smoothly: I had to use the bigger machine for creating and indexing the collection (now consisting of 677 documents with 1.9 GB input size resulting in a 2.1 GB collection).
Fine; does it mean that the index is now recognized by the optimizer? As Maximiliam stated (thanks!), the index won't be used if some match options are specified that don't match the index options. Since Version ~6.5, the default database index options will be assumed as default if the query doesn't contain any explicit options, so you'll just be fine by omitting the index options.
I would be interested -- mainly for development and debugging -- to access information concerning query processing (i.e., information displayed in the "Query Info" buffer in the gui) with the client. I use the Perl API.
That can be done by activating the QUERYINFO option and calling info(). A little example:
print $session->execute("set queryinfo on")."\n"; ... print $session->execute("xquery 1")."\n"; print $session->info()."\n"; ...
Hope this helps, Christian
Am 14.01.2012 um 00:50 schrieb Christian Grün:
And now everything runs smoothly: I had to use the bigger machine for creating and indexing the collection (now consisting of 677 documents with 1.9 GB input size resulting in a 2.1 GB collection).
Fine; does it mean that the index is now recognized by the optimizer?
It's a bit strange: I tried to index and re-index and remove the index etc. with BaseX 7.0.2 several times -- not changing my original query -- and one time out of maybe 20 the index was used. Then I switched to BaseX 7.1 and without changing anything, the index is used always. I tried Linux CenOS and Mac OSX with the same result.
I would be interested -- mainly for development and debugging -- to access information concerning query processing (i.e., information displayed in the "Query Info" buffer in the gui) with the client. I use the Perl API.
That can be done by activating the QUERYINFO option and calling info(). A little example:
print $session->execute("set queryinfo on")."\n"; ... print $session->execute("xquery 1")."\n"; print $session->info()."\n"; ...
I have to admit, I don't quite get it and I didn't find relevant things in the examples.
The usecase is this: The user types something in a textfield, which is then used as querytext -- I will extend this part to allow users to select several options, but for now it's only one. As a result I would like to see - the number of results - the time needed for executing the query and - the results themselves (preferable one by one for adding additional information, reformating, etc., later I will add a second application to annotate each match and update the collection)
Here is my example code used on http://oldphras.unibas.ch/cgi-bin/basex-client.pl:
if ($querytext) { eval { # create session my $session = Session->new("localhost", 1984, "admin", "admin"); # open database $session->execute("open Digibib-DTA"); print $session->info()."\n"; # create query instance $querytext = 'declare default element namespace "http://www.tei-c.org/ns/1.0"; ft:mark(//*[text() contains text "'.$querytext.'" using stemming using language "de"][self::p or self::l])'; my $xquery = $session->query($querytext);
# loop through all results my $count = 0; while ($xquery->more()) { $count++; my $find = $xquery->next(); $find =~ s/</mark>(\s*)<mark.*?>/$1/g; print "<div><b>$count</b>: ".$find."</div>\n"; }
# close query $xquery->close(); # close session $session->close(); }; }
I tried 'print $xquery->info(), but this seems to work with execute() only, not with more().
$session->execute("set queryinfo on"); my $xquery = $session->query($querytext); print $xquery->execute(); print $xquery->info();
I would be very thankful for any hint.
Is there a possibility to store the namespace information somewhere else and not have to write it into every query?
Best regards
Cerstin
It's a bit strange: I tried to index and re-index and remove the index etc. with BaseX 7.0.2 several times -- not changing my original query -- and one time out of maybe 20 the index was used. Then I switched to BaseX 7.1 and without changing anything, the index is used always. I tried Linux CenOS and Mac OSX with the same result.
Good to hear that the problems are resolved in the latest version.
print $session->execute("set queryinfo on")."\n"; ... print $session->execute("xquery 1")."\n"; print $session->info()."\n"; ...
I have to admit, I don't quite get it and I didn't find relevant things in the examples.
No problem, I'll try to give some more details: by executing the QUERYINFO option [1], you get detailed information on the query process, such as the compilation steps or the number of results. If you only want to know the time needed for evaluating the query, you can call $session-info() or $query-info() without setting the option mentioned above.
As a result I would like to see
- the number of results
- the time needed for executing the query and
- the results themselves (preferable one by one for adding additional information, reformating, etc., later I will add a second application to annotate each match and update the collection)
The attached perl client may give you the requested results. It creates the result representation directly within XQuery. This way, you'll get better performance, and may be more flexible when working with the evaluated results. Next, I'm also using $xquery->bind() for assigning the query terms, as this reduces the danger of risky query strings (e.g. including quotes) that could break your query.
I tried 'print $xquery->info(), but this seems to work with execute() only, not with more().
True, $xquery->info() won't give you the compilation steps. In future, we may include an extra databae command that does nothing else than returning information on query execution.
Is there a possibility to store the namespace information somewhere else and not have to write it into every query?
Currently, the easiest way is to use wildcards instead of explicitly specifying the namespace. (... /*:element).
Hope this helps, Christian
Hi Christian,
thanks for your quick answer.
Am 16.01.2012 um 14:04 schrieb Christian Grün:
As a result I would like to see
- the number of results
- the time needed for executing the query and
- the results themselves (preferable one by one for adding additional information, reformating, etc., later I will add a second application to annotate each match and update the collection)
The attached perl client may give you the requested results. It creates the result representation directly within XQuery. This way, you'll get better performance, and may be more flexible when working with the evaluated results.
Thanks. I guess, I cannot do everything directly within XQuery, e.g., extending marked elements to continuous marking, to make "<mark>Korb</mark> <mark>geben</mark>" to be "<mark>Korb geben</mark>" -- it will be more important for queries with ftand or ftor.
Next, I'm also using $xquery->bind() for assigning the query terms, as this reduces the danger of risky query strings (e.g. including quotes) that could break your query.
Thanks for this example. I would like to ask some further questions:
How do I create alternatives for the query? If a user types "A B C" and ticks "process as STRING", the query would be:
ft:mark(//*[text() contains text "A B C" using stemming using language "de"][self::*:p or self::*:l])
If the user ticks "process as AND", the query should be:
ft:mark(//*[text() contains text ("A" ftand "B" ftand "C") using stemming using language "de" distance at most 10 words][self::*:p or self::*:l])
I don't know how to process the input-string to create the correct string for 'terms' and how to toggle queries (I would nee the "distance at most 10 words) for discontinuous queries only). Can I handle this by binding other variables, say 'distance' to a value dependent on user input, like:
ft:mark(//*[text() contains text { $term } using stemming using language "de" { $distance }][self::*:p or self::*:l])
How do I display the whole xquery?
Is there a possibility to store the namespace information somewhere else and not have to write it into every query?
Currently, the easiest way is to use wildcards instead of explicitly specifying the namespace. (... /*:element).
Ah, great, this makes everything a lot shorter.
Best
Cerstin
Thanks. I guess, I cannot do everything directly within XQuery, e.g., extending marked elements to continuous marking, to make "<mark>Korb</mark> <mark>geben</mark>" to be "<mark>Korb geben</mark>" -- it will be more important for queries with ftand or ftor.
Currently, the ft:mark() and ft:extract() functions are mainly used to highlight hits in search results, but we are always interested in extending our XQuery modules with helpful functions/additional arguments, so feel free to suggest new features (..but I cannot give any guarantee when a particular request will be implemented). For example, the latest snapshot contains two new functions ft:tokens() and ft:tokenize() [1], which have recently been requested.
How do I create alternatives for the query? If a user types "A B C" and ticks "process as STRING", the query would be:
ft:mark(//*[text() contains text "A B C" using stemming using language "de"][self::*:p or self::*:l])
If the user ticks "process as AND", the query should be:
ft:mark(//*[text() contains text ("A" ftand "B" ftand "C") using stemming using language "de" distance at most 10 words][self::*:p or self::*:l])
A query could look as follows:
declare variable $mode := 'STRING'; declare variable $input := 'This is a c b text'; declare variable $terms := 'A B C';
if($mode = 'STRING') then $input contains text { $terms } phrase else if($mode = 'AND') then $input contains text { $terms } all words else error((), 'Unknown search mode')
I don't know how to process the input-string to create the correct string for 'terms' and how to toggle queries (I would nee the "distance at most 10 words) for discontinuous queries only). Can I handle this by binding other variables, say 'distance' to a value dependent on user input, like:
ft:mark(//*[text() contains text { $term } using stemming using language "de" { $distance }][self::*:p or self::*:l])
You can use variables to dynamically choose a distance; see e.g. here:
let $dist := 0 return 'a b c' contains text 'a' ftand 'c' distance at most $dist words
All the best, Christian
Hi Christian,
thank you very much for your helpful answers, I will split my answer:
Zitat von Christian Grün christian.gruen@gmail.com:
A query could look as follows:
declare variable $mode := 'STRING'; declare variable $input := 'This is a c b text'; declare variable $terms := 'A B C';
if($mode = 'STRING') then $input contains text { $terms } phrase else if($mode = 'AND') then $input contains text { $terms } all words else error((), 'Unknown search mode')
In principle, this works very nicely. I chose to put together the query (i.e., the part starting with ft:mark) in perl and use the xquery for defining the result display.
There are two reasons: I will offer users to search by using stemming or by using fuzzy and for the "all words" using a certain distance, but for "phrase" this option would not be set. For me it is easier to handle these cases in perl :)
The second reason includes a further question:
I noticed, that it makes no difference for searching if the query is:
contains text {"A B C"} all words
or
contains text ("A" ftand "B" ftand "C")
However, applying ft:mark, the first (using "all words") results in marking only occurences of "C", whereas the second (using "ftand") results in marking occurences of "A", "B", and "C". Is this a feature/bug of ft:mark?
Best regards
Cerstin
I noticed, that it makes no difference for searching if the query is:
contains text {"A B C"} all words or contains text ("A" ftand "B" ftand "C")
However, applying ft:mark, the first (using "all words") results in marking only occurences of "C", whereas the second (using "ftand") results in marking occurences of "A", "B", and "C". Is this a feature/bug of ft:mark?
It might be surprising that the internal position representation of both queries is indeed different, but I completely agree that it's irritating that the "A" and "B" tokens are not highlighted by the first query. I've added a GitHub issue in order to remember this issue:
https://github.com/BaseXdb/basex/issues/337
To be continued, Christian
Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
Thanks. I guess, I cannot do everything directly within XQuery, e.g., extending marked elements to continuous marking, to make "<mark>Korb</mark> <mark>geben</mark>" to be "<mark>Korb geben</mark>" -- it will be more important for queries with ftand or ftor.
Currently, the ft:mark() and ft:extract() functions are mainly used to highlight hits in search results, but we are always interested in extending our XQuery modules with helpful functions/additional arguments, so feel free to suggest new features (..but I cannot give any guarantee when a particular request will be implemented). For example, the latest snapshot contains two new functions ft:tokens() and ft:tokenize() [1], which have recently been requested.
I noticed the tonizing features and will probably use them as well.
Highlighting occurences of search terms, is probably the perfect solution for most XML data. However, I use BaseX as a substitute for a corpus query workbench: The texts I have to deal with, are TEI annotated, but lack linguistic annotation -- most of the texts are non-modern German, so applying state-of-the-art NLP tools is impossible. Therefore I cannot apply queries based on part-of-speech, syntactical structures, or lemmas.
The users will look for evidence of idiomatic phrases by trying to search for the main parts of such phrases. "den Kopf (nicht) in den Sand stecken" results in a query like "Kopf ftand Sand ftand stecken" -- since I don't have information on sentence boundaries, I use the "distance" option for controlling that the query terms probably appear within a sentence. For this usecase I would be interested to highlight the potential "phrase", i.e., starting with the first match until the last match.
I don't program in Java, so I cannot help in implementing such functionality, but I could help specifying and testing. The project I am working for, is located at the University of Basel, if there is a need to have an "official" cooperation, we could do this ;-)
Best
Cerstin
Hi Cerstin,
for testing purposes you could use the ft:search function (see http://docs.basex.org/wiki/Full-Text_Module#ft:search ). This automatically applies the correct options.
@Christian: I could not find it on the wiki but if I remember correctyl, the full-text would not be used, if the options used in the query do not match the options used when creating the database (wildcards, stemming etc.).
Regards,
Maximilian
Am 13. Januar 2012 16:05 schrieb Cerstin Mahlow cerstin.mahlow@unibas.ch:
Dear Christian,
thanks for your quick answer.
Zitat von Christian Grün christian.gruen@gmail.com:
To make inspection of results easier, I added ft:mark. A collection with
only a dozen of texts of about 71 MB with full text index for German, optimized, etc. works quite well. However, the example query needs more than 9s, which is rather slow.
First of all, it might be interesting to hear what the query compiler does. Have you looked at the QueryInfo panel to check out if the full-text index is applied? If yes, you should find something like..
Compiling:
- ...
- applying full-text index
- ...
..in the info panel.
The interesting thing is:
for my original query
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
there is no information on compiling, only information on Timing, Result (number of results) and Queryplan.
If I change '(p|l)' to 'p', I get information on Compiling, but only:
Compiling:
- optimizing descendant-or-self step(s)
Result: root()/descendant::{http://**www.tei-c.org/ns/1.0%7Dp%5Btext()http://www.tei-c.org/ns/1.0%7Dp%5Btext%28%29contains text "Korb geben"]
Apparently, the index is not used.
Another hint: to enable index optimizations, your query...
//(p|l) [text() contains text "Korb geben" using stemming using language
"de"]
..may have to be rewritten as follows:
//*[text() contains text "Korb geben"][self::p or self::l]
OK, this results in reducing total time by half, but I see only:
Compiling:
- optimizing descendant-or-self step(s)
Result: root()/descendant::*[text() contains text "Korb geben"][self::{ http://www.tei-**c.org/ns/1.0%7Dp http://www.tei-c.org/ns/1.0%7Dp or self::{http://www.tei-c.org/**ns/1.0%7Dl http://www.tei-c.org/ns/1.0%7Dl]
Also memory used is reduced a bit, so this definitely helps. However, if I include 'using stemming using language "de"', total time is almost the same.
I see no possibility to enforce using the index. I use BaseX 7.0.2, maybe this is a bug? I will try the Beta 7.1.
Best regards
Cerstin
-- Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
------------------------------**------------------------------**---- This message was sent using IMP, the Internet Messaging Program.
______________________________**_________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-**konstanz.de BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.**de/mailman/listinfo/basex-talkhttps://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de