I have a database with the following characteristics:
Database Properties Name: merge Size: 1165 MB Nodes: 45217101 Resources: 9 Timestamp: 30.10.2011 18:15:04
On my database, executing a query such as count(//summary[@totalDirs=0]) executes in about 30 seconds, returning a count of about 777,000 items. Executing count(//file[@length=0] also takes about 30 seconds, returning about 12,000.
The query plans look like:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0" max="0" type="ATTRIBUTE"/> <IterStep axis="self" test="*:totalDirs"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
It's clear that BaseX goes to the attribute index, scans every entry there for the 0-0 range, then checks for the appropriate attribute and parent types. What this means is that very common attribute values can take a much longer time to run, even when the number of results, when discriminated by attribute name, is relatively small.
The keys for the attribute index are simple values. Since this is an ordered/balanced structure, you can compound the key for free (sort of). Instead of using the attribute value as the key, use the index as a prefix map by compounding attribute value+attribute name (something like index.index(Token.concat(data.text(pre, text), data.name(pre)), pre);). I'm wondering if you've tried and rejected compounding the indexes like this. It's a common technique to use when indexing in triple stores (a few compound indexes allow you to have complete one-step indexing). With the additional discrimination provided you'd probably get a big jump in speed for the vast majority of queries, which won't be searching for an unknown attribute with a certain value.
The query plan would become something like this:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self" max="0.self" type="ATTRIBUTE"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
where 0.self is the binary compound key. There's no reason to stop at compounding in the attribute name. You could also compound in the name of the element that owns the attribute, resulting in:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self.summary" max="0.self.summary" type="ATTRIBUTE"/> </AxisPath> </FNAggr>
The down side to such an index would be increased key sizes in the attribute index, increased complexity handling numeric data, and some additional processing time. The up side would be an index with much more ability to discriminate across common attribute values.