very common attribute values/custom indexing - BaseX-Talk - mailman.uni-konstanz.de

30 Oct 2011


      I have a database with the following characteristics:
Database Properties
 Name: merge
 Size: 1165 MB
 Nodes: 45217101
 Resources: 9
 Timestamp: 30.10.2011 18:15:04
On my database, executing a query such as
count(//summary[@totalDirs=0]) executes in about 30 seconds, returning
a count of about 777,000 items. Executing count(//file[@length=0] also
takes about 30 seconds, returning about 12,000.
The query plans look like:
<FNAggr name="count(item)">
  <AxisPath>
    <RangeAccess data="merge" min="0" max="0" type="ATTRIBUTE"/>
    <IterStep axis="self" test="*:totalDirs"/>
    <IterStep axis="parent" test="*:summary"/>
  </AxisPath>
</FNAggr>
It's clear that BaseX goes to the attribute index, scans every entry
there for the 0-0 range, then checks for the appropriate attribute and
parent types. What this means is that very common attribute values can
take a much longer time to run, even when the number of results, when
discriminated by attribute name, is relatively small.
The keys for the attribute index are simple values. Since this is an
ordered/balanced structure, you can compound the key for free (sort
of). Instead of using the attribute value as the key, use the index as
a prefix map by compounding attribute value+attribute name (something
like index.index(Token.concat(data.text(pre, text), data.name(pre)),
pre);). I'm wondering if you've tried and rejected compounding the
indexes like this. It's a common technique to use when indexing in
triple stores (a few compound indexes allow you to have complete
one-step indexing). With the additional discrimination provided you'd
probably get a big jump in speed for the vast majority of queries,
which won't be searching for an unknown attribute with a certain
value.
The query plan would become something like this:
<FNAggr name="count(item)">
  <AxisPath>
    <RangeAccess data="merge" min="0.self" max="0.self" type="ATTRIBUTE"/>
    <IterStep axis="parent" test="*:summary"/>
  </AxisPath>
</FNAggr>
where 0.self is the binary compound key. There's no reason to stop at
compounding in the attribute name. You could also compound in the name
of the element that owns the attribute, resulting in:
<FNAggr name="count(item)">
  <AxisPath>
    <RangeAccess data="merge" min="0.self.summary"
max="0.self.summary" type="ATTRIBUTE"/>
  </AxisPath>
</FNAggr>
The down side to such an index would be increased key sizes in the
attribute index, increased complexity handling numeric data, and some
additional processing time. The up side would be an index with much
more ability to discriminate across common attribute values.
-- 
RJ