New subject: very common attribute values/custom indexing

4 Nov 2011


      Ross,
...
The DSI Utilities (http://dsiutils.dsi.unimi.it/) are a very> interesting set of classes.
Thanks; I'll soon have a look at that package.
Just some quick feedback, regarding the extension of index entries:
we'd get even better results by extending the index entries with id's
to the unique location paths.. e.g.:
<a>
    <b>dig</b>
    <c>dig</c>
    <c>diz</c>
  </a>
Path Index:
  0: a/b
  1: a/c
Value Index:
  dig.0 -> /a/b/text()
  dig.1 -> /a/c/text()[.='bla']
  diz.1 -> /a/c/text()[.='blu']
This way, we could completely give up path inversions, at least in a
number of trivial cases. In both cases, I guess that most effort has
to be put into the adaptation of the query optimizer. Volunteers are
welcome.. ;)
Christian___________________________
On Wed, Nov 2, 2011 at 11:29 PM, Ross Judson rossjudson@gmail.com wrote:
...
You guys are wizards at making the JVM run fast, without a doubt. I
haven't looked to see how namespaces are handled right now, within the
indexing scheme. In my particular case I do not have namespaces, but
clearly they need to be handled, generally.
The DSI Utilities (http://dsiutils.dsi.unimi.it/) are a very
interesting set of classes. I have previously used these to construct
indexes of RDF-type triples. They are _extremely_ high performance,
able to process at average rates of hundreds of thousands of tuples
per second.
The ImmutableExternalPrefixMap was very useful. I created compound
strings (subject-property-object) and stored them. I could then do a
very fast lookup on any prefix, like subject, or subject-property. The
prefix map would supply the applicable interval. By creating another
index compounding (object-property-subject), I could lookup all usages
of a given property value (object), restricting by property name, or
all the way to a specific subject.
The ImmutableExternalPrefixMap uses an in-memory structure to compute
approximate intervals on any given query; it then reads disk to refine
and answer the query. The net result is that you generally have access
to the results of any given query in one sector read (more or less
like a hash structure), but the data is _ordered_, which means you can
perform range queries, at the cost of reading additional sectors as
necessary.
Within BaseX you have already converted tokens to byte arrays.
ImmutableExternalPrefixMap does this as well.
A useful experiment would be to add a new type of index to BaseX that
incorporates one of these prefix maps.
Creating the compound index looks something like this (altering
ValueBuilder.build()). Replace:
index.index(data.text(pre, text), pre);
with:
byte [] attrName = data.name(pre, Data.ATTR);
     int parentElement = data.parent(pre, Data.ATTR);
     byte [] elementName = data.name(parentElement, Data.ELEM);
     TokenBuilder tb = new TokenBuilder(data.text(pre, text));
     tb.add(';');
     tb.add(attrName);
     tb.add(';');
     tb.add(elementName);
     index.index(tb.finish(), pre);
Not sure if I have all of that right, but it shows the idea. When I
run with this the indexes on my data do expand a little, but not too
much (may a 10% growth?). Of course, it will be different for
everyone, depending on their content.
By checking the database properties you can see that the index has
more keys in it. Next step would be to add a prefix-scanning mode to
the index iteration, in addition to equality scanning. In the example
above I use the ';' character to separate the composite token; (byte)0
might be more appropriate.

Re: [basex-talk] very common attribute values/custom indexing