As both a stress test and to experiment, I created a database using a recent complete (current page) dump of English Wikipedia, a hefty file of 30.5 GB. I don't have enough memory apparently to create a full-text index of all of that text, so I created the DB without one.
My first testing came up empty until I realized that I needed to deal with the namespace (ugh). Then I tried:
declare default element namespace "http://www.mediawiki.org/xml/export-0.4/ "; //siteinfo
This contains a small amount of data and occurs only once in the document (at /mediawiki/siteinfo). However, it's extremely slow (~33 seconds on my system). The query plan is:
<IterPath> <Root/> <IterStep axis="child" test="*:mediawiki"/> <IterStep axis="child" test="*:siteinfo"/> </IterPath>
Timing: - Parsing: 0.35 ms - Compiling: 0.22 ms - Evaluating: 33316.32 ms - Printing: 0.3 ms - Total Time: 33317.19 ms
My surmise is that millions of node names are being checked rather than a path index being used to rapidly access the appropriate node(s). I don't think such a simple query should fail to be properly optimized. Another surmise is that it's related to namespaces not being indexed (?). While personally I very much dislike namespaces, they are common, and they have to be efficiently handled.
To see if it made a difference, I also tried an explicitly named namespace test:
declare namespace w="http://www.mediawiki.org/xml/export-0.4/"; //w:siteinfo
This results in:
<IterPath> <Root/> <IterStep axis="descendant" test="w:siteinfo"/> </IterPath> Timing: - Parsing: 0.33 ms - Compiling: 0.07 ms - Evaluating: 54288.51 ms - Printing: 0.3 ms - Total Time: 54289.23 ms
So performance is even worse.