[basex-talk] Problem with Wikipedia database (or a more general namespace efficiency problem?)

27 Feb 2011


      As both a stress test and to experiment, I created a database using a recent
complete (current page) dump of English Wikipedia, a hefty file of 30.5 GB.
I don't have enough memory apparently to create a full-text index of all of
that text, so I created the DB without one.
My first testing came up empty until I realized that I needed to deal with
the namespace (ugh). Then I tried:
declare default element namespace "http://www.mediawiki.org/xml/export-0.4/
";
//siteinfo
This contains a small amount of data and occurs only once in the document
(at /mediawiki/siteinfo). However, it's extremely slow (~33 seconds on my
system). The query plan is:
<IterPath>
  <Root/>
  <IterStep axis="child" test="*:mediawiki"/>
  <IterStep axis="child" test="*:siteinfo"/>
</IterPath>
Timing:
 - Parsing:  0.35 ms
 - Compiling:  0.22 ms
 - Evaluating:  33316.32 ms
 - Printing:  0.3 ms
 - Total Time:  33317.19 ms
My surmise is that millions of node names are being checked rather than a
path index being used to rapidly access the appropriate node(s). I don't
think such a simple query should fail to be properly optimized. Another
surmise is that it's related to namespaces not being indexed (?). While
personally I very much dislike namespaces, they are common, and they have to
be efficiently handled.
To see if it made a difference, I also tried an explicitly named namespace
test:
declare namespace w="http://www.mediawiki.org/xml/export-0.4/";
//w:siteinfo
This results in:
<IterPath>
  <Root/>
  <IterStep axis="descendant" test="w:siteinfo"/>
</IterPath>
Timing:
 - Parsing:  0.33 ms
 - Compiling:  0.07 ms
 - Evaluating:  54288.51 ms
 - Printing:  0.3 ms
 - Total Time:  54289.23 ms
So performance is even worse.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] Problem with Wikipedia database (or a more general namespace efficiency problem?)