Re: [basex-talk] big data performance

9 Sep 2020


      Hello list, hello Christian,
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!
My first one consists of about 3.8 mio files in roughly 25GB.
Creating this first database took about 70 minutes, including full-text index.
Searching for "Konstanz" in this dataset yields 6200 hits in 400ms.
Wow, quite impressive! Really.
BTW, this is the corresponding XQuery I tried:
    declare variable $b := 'Konstanz';
    for $t in collection("Korpus01")//*[./text() contains text {$b}] 
    return
    <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
Ok, this is promising, indeed. So I tried to meet my next goal: 10mio. files, ~70GB of disk space.
Bad luck: creating the database fails because of too less memory while building full-text index. 
Since memory is limited, I did not try to increase the java memory option further (which actually 
is "-Xmx3g"). But instead I tried the other way round: creating additional databases. This process 
was as fast as in the first step, for each of them. BaseX is fun...
But now, at this point, the hurdles are too high, at least for me. 
According to https://docs.basex.org/wiki/Databases#Access_Resources%5B1] 
I modified the XQuery:
    declare variable $b := 'Konstanz';
    for $c in ('Korpus01', 'Korpus02')
    for $t in collection($c)//*[./text() contains text {$b}] 
    return
    <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
gives results, but lasts orders of magnitude longer than for just one database:
14000 hits in 690000ms.
What's wrong with my approach: The XQuery I applied? Or my expectation, having comparable 
fast results with full-text searches in multiple databases?
Thanks again
Matthias
...
Hi Matthias,
...
Can I give BaseX a try?
You definitely should ;) Maybe you can simply start off, download
BaseX and import your TEI directories. Some database limits are listed
here [1]. If you encounter problems with creating the full-text index
for your XML data, documents can also be split across multiple
databases.
What’s the total file size of your initial TEI documents?
Best,
Christian
[1] https://docs.basex.org/wiki/Statistics
On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze
matthias.schuetze@web.de wrote:
...
Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki.
So, please forgive my ask for advices to novices.
My question:
Is BaseX capable of handling TEI-XML files under following circumstances.
  # of TEI-files: ~10^7
  # of directories where these are files stored in: ~10^5
  # of words in TEI/body to be indexed: ~5*10^9
  yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant:
users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in
reasonable time, say:

initial indexing may last some days, if necessary
incremental(?) indexing of new data should be an overnight job

Can I give BaseX a try? Or should I look elsewhere?
Cheers,
Matthias
--------
[1] https://docs.basex.org/wiki/Databases#Access_Resources

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] big data performance