Hi,
I would like to set up a collection of TEI-annotated texts (novels, dramas, poems, etc.). In total, it would be around 3 GB XML data in some 1000 files, the text size varies from 29 KB to 94 MB. I have a server running with Java 1.6.0_07 on CentOS 5.7 on a Virtual Machine with 1 GB RAM.
I started to add files to the database and wrote a preliminary query interface (http://oldphras.unibas.ch/cgi-bin/basex-client.pl). Since we want to look for examples of multi-word units, I would like to use queries like:
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
(In the end, queries will be more complex to allow users to search for several words in different word order within a sentence using stemming or fuzzy)
To make inspection of results easier, I added ft:mark. A collection with only a dozen of texts of about 71 MB with full text index for German, optimized, etc. works quite well. However, the example query needs more than 9s, which is rather slow.
What is worse: Adding more files, resulting in about 323 MB, causes a timeout when running the query. I already set the memory for the Java VM to 1024, but it does not help.
I tried it with the GUI on my iMac with 4 GB RAM and got a time out when the collection size is above 900 MB (which is still only a small part of my data).
Is there any recommendation for size of RAM or specific settings when processing collections of about 3 GB?
Is there a better way to write queries when looking for inflected forms of several words and allowing for spelling errors?
Thank you in advance
Cerstin