Hi Christian - I've built a new database, using the same data except that this time I stripped out the OCR'd word elements (called <wd/>).
My estimate of the <wd/> elements representing 85% of the data was wrong. they represent 96.5% of the data. This means the database files have shrunk from 40GB to 1.5GB.
Instead of the database having ~1.5 billion nodes it now has ~78 million.
Reducing the problem space means the following xquery - run in basexgui 8.5 - has gone from an average of 148000ms to 3900ms:
let $start := prof:current-ns() let $void := prof:void(for $book in //book return <result> <book id="{$book/id/text()}"/> { for $page in $book/page return <page id="{$page/id/text()}"> { for $article in $page/article return <article id="{$article/id/text()}"/> } </page> } </result>) let $end := prof:current-ns() let $ms := ($end - $start) div 1000000 return $ms || ' ms'
This is good news. However, doesn't this show an issue in how BaseX maintains it's indexes? What I mean is that the <wd/> elements are two children off each <article/> - i.e. <article/><p/><wd/>. If my xquery doesn't care about the <wd/> and the <p/> elements - why is it still affected by them?
Thanks.
From: christian.gruen@gmail.com Date: Tue, 5 Jul 2016 17:52:40 +0200 Subject: Re: [basex-talk] Improving performance in a 40GB database To: james.hn.sears@outlook.com CC: basex-talk@mailman.uni-konstanz.de
Hi James,
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back:
- How large is the resulting XML document (around 15% of the original document)?
- How do you measure the time?
- Do you store the result on disk?
- How long does your query take if you wrap it into a count(...) or
prof:void(...) function?
Thanks in advance, Christian