[basex-talk] performance question

12 Jan 2012


      Hi,
I would like to set up a collection of TEI-annotated texts (novels,  
dramas, poems, etc.). In total, it would be around 3 GB XML data in  
some 1000 files, the text size varies from 29 KB to 94 MB. I have a  
server running with Java 1.6.0_07 on CentOS 5.7 on a Virtual Machine  
with 1 GB RAM.
I started to add files to the database and wrote a preliminary query  
interface (http://oldphras.unibas.ch/cgi-bin/basex-client.pl). Since  
we want to look for examples of multi-word units, I would like to use  
queries like:
//(p|l) [text() contains text "Korb geben" using stemming using language "de"]
(In the end, queries will be more complex to allow users to search for  
several words in different word order within a sentence using stemming  
or fuzzy)
To make inspection of results easier, I added ft:mark. A collection  
with only a dozen of texts of about 71 MB with full text index for  
German, optimized, etc. works quite well. However, the example query  
needs more than 9s, which is rather slow.
What is worse: Adding more files, resulting in about 323 MB, causes a  
timeout when running the query.  I already set the memory for the Java  
VM to 1024, but it does not help.
I tried it with the GUI on my iMac with 4 GB RAM and got a time out  
when the collection size is above 900 MB (which is still only a small  
part of my data).
Is there any recommendation for size of RAM or specific settings when  
processing collections of about 3 GB?
Is there a better way to write queries when looking for inflected  
forms of several words and allowing for spelling errors?
Thank you in advance
Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Deutsches Seminar
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mahlow@unibas.ch
Web: http://www.oldphras.net

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] performance question