Re: [basex-talk] Improving performance in a 40GB database

6 Jul 2016

      Hi Christian - I've built a new database, using the same data except that this time I stripped out the OCR'd word elements (called <wd/>).
My estimate of the <wd/> elements representing 85% of the data was wrong. they represent 96.5% of the data. This means the database files have shrunk from 40GB to 1.5GB.
Instead of the database having ~1.5 billion nodes it now has ~78 million.
Reducing the problem space means the following xquery - run in basexgui 8.5 - has gone from an average of 148000ms to 3900ms:
let $start := prof:current-ns()
let $void := prof:void(for $book in //book
return
  <result>
    <book id="{$book/id/text()}"/>
    {
    for $page in $book/page
    return
      <page id="{$page/id/text()}">
      {
      for $article in $page/article
      return
          <article id="{$article/id/text()}"/>
      }
      </page>
    }
  </result>)
let $end := prof:current-ns()
let $ms := ($end - $start) div 1000000
return $ms || ' ms'
This is good news. However, doesn't this show an issue in how BaseX maintains it's indexes? What I mean is that the <wd/> elements are two children off each <article/> - i.e. <article/><p/><wd/>. If my xquery doesn't care about the <wd/> and the <p/> elements - why is it still affected by them?
Thanks.
...
From: christian.gruen@gmail.com
Date: Tue, 5 Jul 2016 17:52:40 +0200
Subject: Re: [basex-talk] Improving performance in a 40GB database
To: james.hn.sears@outlook.com
CC: basex-talk@mailman.uni-konstanz.de
Hi James,
...
Individual OCR'd words on pages maybe comprise around 85% of the data - and I don't actually care about this data. So maybe if I just don't load these OCR'd words it will help? I haven't tried that yet, but ideally I'd like not to have to do it.
Some (more or less obvious) questions back:

How large is the resulting XML document (around 15% of the original document)?
How do you measure the time?
Do you store the result on disk?
How long does your query take if you wrap it into a count(...) or

prof:void(...) function?
Thanks in advance,
Christian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Improving performance in a 40GB database