Hello list, hello Christian,

 

since I "definitely should" build a BaseX database from millions of TEI-XML files, I did so!

My first one consists of about 3.8 mio files in roughly 25GB.

 

Creating this first database took about 70 minutes, including full-text index.

Searching for "Konstanz" in this dataset yields 6200 hits in 400ms.

 

Wow, quite impressive! Really.

 

BTW, this is the corresponding XQuery I tried:

declare variable $b := 'Konstanz';

for $t in collection("Korpus01")//*[./text() contains text {$b}]

return

<p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>

 

Ok, this is promising, indeed. So I tried to meet my next goal: 10mio. files, ~70GB of disk space.

Bad luck: creating the database fails because of too less memory while building full-text index.

Since memory is limited, I did not try to increase the java memory option further (which actually

is "-Xmx3g"). But instead I tried the other way round: creating additional databases. This process

was as fast as in the first step, for each of them. BaseX is fun...

 

But now, at this point, the hurdles are too high, at least for me.

According to https://docs.basex.org/wiki/Databases#Access_Resources

I modified the XQuery:

declare variable $b := 'Konstanz';

for $c in ('Korpus01', 'Korpus02')

for $t in collection($c)//*[./text() contains text {$b}]

return

<p>{ft:extract($t[./text() contains text {$b}]/text(), 'b', 155)}</p>

gives results, but lasts orders of magnitude longer than for just one database:

14000 hits in 690000ms.

 

What's wrong with my approach: The XQuery I applied? Or my expectation, having comparable

fast results with full-text searches in multiple databases?

 

Thanks again

Matthias

 

 

> Hi Matthias,

>

> > Can I give BaseX a try?

>

> You definitely should ;) Maybe you can simply start off, download

> BaseX and import your TEI directories. Some database limits are listed

> here [1]. If you encounter problems with creating the full-text index

> for your XML data, documents can also be split across multiple

> databases.

>

> What’s the total file size of your initial TEI documents?

>

> Best,

> Christian

>

> [1] https://docs.basex.org/wiki/Statistics

>

>

>

> On Thu, Sep 3, 2020 at 7:05 PM Matthias Schütze

> <matthias.schuetze@web.de> wrote:

> >

> > Hello BaseX list,

> >

> > I'm completely new to BaseX and a bit overwhelmed of the resources found so far in the wiki.

> > So, please forgive my ask for advices to novices.

> >

> > My question:

> > Is BaseX capable of handling TEI-XML files under following circumstances.

> > # of TEI-files: ~10^7

> > # of directories where these are files stored in: ~10^5

> > # of words in TEI/body to be indexed: ~5*10^9

> > yearly increment: 10^9 words in about 10^6 files

> >

> > The main concern is full-text search within TEI/body which must be performant:

> > users interact with the database searching full text.

> >

> > Indexing the aforementioned amount of data should be achievable in

> > reasonable time, say:

> > - initial indexing may last some days, if necessary

> > - incremental(?) indexing of new data should be an overnight job

> >

> > Can I give BaseX a try? Or should I look elsewhere?

> >

> > Cheers,

> > Matthias

> >