Hello list, hello Christian,
since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so! My first one consists of about 3.8 mio files in roughly 25GB.
Creating this first database took about 70 minutes, including full-text index. Searching for "Konstanz" in this dataset yields 6200 hits in 400ms.
Wow, quite impressive! Really.
BTW, this is the corresponding XQuery I tried: declare variable $b := 'Konstanz'; for $t in collection("Korpus01")//*[./text() contains text {$b}] return <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
Ok, this is promising, indeed. So I tried to meet my next goal: 10mio. files, ~70GB of disk space. Bad luck: creating the database fails because of too less memory while building full-text index. Since memory is limited, I did not try to increase the java memory option further (which actually is "-Xmx3g"). But instead I tried the other way round: creating additional databases. This process was as fast as in the first step, for each of them. BaseX is fun...
But now, at this point, the hurdles are too high, at least for me. According to https://docs.basex.org/wiki/Databases#Access_Resources%5B1] I modified the XQuery: declare variable $b := 'Konstanz'; for $c in ('Korpus01', 'Korpus02') for $t in collection($c)//*[./text() contains text {$b}] return <p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>
gives results, but lasts orders of magnitude longer than for just one database: 14000 hits in 690000ms.
What's wrong with my approach: The XQuery I applied? Or my expectation, having comparable fast results with full-text searches in multiple databases?
Thanks again Matthias
Hi Matthias,
Can I give BaseX a try?
You definitely should ;) Maybe you can simply start off, download BaseX and import your TEI directories. Some database limits are listed here [1]. If you encounter problems with creating the full-text index for your XML data, documents can also be split across multiple databases.
What’s the total file size of your initial TEI documents?
Best, Christian
[1] https://docs.basex.org/wiki/Statistics
On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze matthias.schuetze@web.de wrote:
Hello BaseX list,
I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki. So, please forgive my ask for advices to novices.
My question: Is BaseX capable of handling TEI-XML files under following circumstances. # of TEI-files: ~10^7 # of directories where these are files stored in: ~10^5 # of words in TEI/body to be indexed: ~5*10^9 yearly increment: 10^9 words in about 10^6 files
The main concern is full-text search within TEI/body which must be performant: users interact with the database searching full text.
Indexing the aforementioned amount of data should be achievable in reasonable time, say:
- initial indexing may last some days, if necessary
- incremental(?) indexing of new data should be an overnight job
Can I give BaseX a try? Or should I look elsewhere?
Cheers, Matthias
-------- [1] https://docs.basex.org/wiki/Databases#Access_Resources