I have a database with the following characteristics:
Database Properties Name: merge Size: 1165 MB Nodes: 45217101 Resources: 9 Timestamp: 30.10.2011 18:15:04
On my database, executing a query such as count(//summary[@totalDirs=0]) executes in about 30 seconds, returning a count of about 777,000 items. Executing count(//file[@length=0] also takes about 30 seconds, returning about 12,000.
The query plans look like:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0" max="0" type="ATTRIBUTE"/> <IterStep axis="self" test="*:totalDirs"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
It's clear that BaseX goes to the attribute index, scans every entry there for the 0-0 range, then checks for the appropriate attribute and parent types. What this means is that very common attribute values can take a much longer time to run, even when the number of results, when discriminated by attribute name, is relatively small.
The keys for the attribute index are simple values. Since this is an ordered/balanced structure, you can compound the key for free (sort of). Instead of using the attribute value as the key, use the index as a prefix map by compounding attribute value+attribute name (something like index.index(Token.concat(data.text(pre, text), data.name(pre)), pre);). I'm wondering if you've tried and rejected compounding the indexes like this. It's a common technique to use when indexing in triple stores (a few compound indexes allow you to have complete one-step indexing). With the additional discrimination provided you'd probably get a big jump in speed for the vast majority of queries, which won't be searching for an unknown attribute with a certain value.
The query plan would become something like this:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self" max="0.self" type="ATTRIBUTE"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
where 0.self is the binary compound key. There's no reason to stop at compounding in the attribute name. You could also compound in the name of the element that owns the attribute, resulting in:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self.summary" max="0.self.summary" type="ATTRIBUTE"/> </AxisPath> </FNAggr>
The down side to such an index would be increased key sizes in the attribute index, increased complexity handling numeric data, and some additional processing time. The up side would be an index with much more ability to discriminate across common attribute values.
Dear Ross,
I'll be bold and start with some advertising: we are currently looking for sponsors to priorize the implementation of a real range index, which will be much faster than the current solution. 60% of the estimated costs have already been compiled, so if anyone is interested in the prioritization of this feature, you are very welcome to give us personal feedback!
But back to your actual concern: thanks for your suggestion how to potentially speed up our existing index! Indeed we didn't think of such an extension yet; if we tried to realize it, we might have to clarify some XML specific issues, e.g. how namespaces would have to be handled. As an example, the following XML document contains two attribute names that refer to different namespace URIs:
<a> <b xmlns:x='http://ns1' x:id='0'/> <b xmlns:x='http://ns2' x:id='0'/> </a>
In your special case, you might speed up your queries a lot by just looking for the string representation..
count(//summary[@totalDirs="0"])
..but this will only work if the string representation of numbers will always be identical.
Thanks again, Christian _______________________________
On Mon, Oct 31, 2011 at 3:50 AM, Ross Judson rossjudson@gmail.com wrote:
I have a database with the following characteristics:
Database Properties Name: merge Size: 1165 MB Nodes: 45217101 Resources: 9 Timestamp: 30.10.2011 18:15:04
On my database, executing a query such as count(//summary[@totalDirs=0]) executes in about 30 seconds, returning a count of about 777,000 items. Executing count(//file[@length=0] also takes about 30 seconds, returning about 12,000.
The query plans look like:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0" max="0" type="ATTRIBUTE"/> <IterStep axis="self" test="*:totalDirs"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
It's clear that BaseX goes to the attribute index, scans every entry there for the 0-0 range, then checks for the appropriate attribute and parent types. What this means is that very common attribute values can take a much longer time to run, even when the number of results, when discriminated by attribute name, is relatively small.
The keys for the attribute index are simple values. Since this is an ordered/balanced structure, you can compound the key for free (sort of). Instead of using the attribute value as the key, use the index as a prefix map by compounding attribute value+attribute name (something like index.index(Token.concat(data.text(pre, text), data.name(pre)), pre);). I'm wondering if you've tried and rejected compounding the indexes like this. It's a common technique to use when indexing in triple stores (a few compound indexes allow you to have complete one-step indexing). With the additional discrimination provided you'd probably get a big jump in speed for the vast majority of queries, which won't be searching for an unknown attribute with a certain value.
The query plan would become something like this:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self" max="0.self" type="ATTRIBUTE"/> <IterStep axis="parent" test="*:summary"/> </AxisPath> </FNAggr>
where 0.self is the binary compound key. There's no reason to stop at compounding in the attribute name. You could also compound in the name of the element that owns the attribute, resulting in:
<FNAggr name="count(item)"> <AxisPath> <RangeAccess data="merge" min="0.self.summary" max="0.self.summary" type="ATTRIBUTE"/> </AxisPath> </FNAggr>
The down side to such an index would be increased key sizes in the attribute index, increased complexity handling numeric data, and some additional processing time. The up side would be an index with much more ability to discriminate across common attribute values.
-- RJ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de