Ross,
The DSI Utilities (http://dsiutils.dsi.unimi.it/) are a very> interesting set of classes.
Thanks; I'll soon have a look at that package. Just some quick feedback, regarding the extension of index entries: we'd get even better results by extending the index entries with id's to the unique location paths.. e.g.:
<a> <b>dig</b> <c>dig</c> <c>diz</c> </a>
Path Index: 0: a/b 1: a/c
Value Index: dig.0 -> /a/b/text() dig.1 -> /a/c/text()[.='bla'] diz.1 -> /a/c/text()[.='blu']
This way, we could completely give up path inversions, at least in a number of trivial cases. In both cases, I guess that most effort has to be put into the adaptation of the query optimizer. Volunteers are welcome.. ;)
Christian___________________________
On Wed, Nov 2, 2011 at 11:29 PM, Ross Judson rossjudson@gmail.com wrote:
You guys are wizards at making the JVM run fast, without a doubt. I haven't looked to see how namespaces are handled right now, within the indexing scheme. In my particular case I do not have namespaces, but clearly they need to be handled, generally.
The DSI Utilities (http://dsiutils.dsi.unimi.it/) are a very interesting set of classes. I have previously used these to construct indexes of RDF-type triples. They are _extremely_ high performance, able to process at average rates of hundreds of thousands of tuples per second.
The ImmutableExternalPrefixMap was very useful. I created compound strings (subject-property-object) and stored them. I could then do a very fast lookup on any prefix, like subject, or subject-property. The prefix map would supply the applicable interval. By creating another index compounding (object-property-subject), I could lookup all usages of a given property value (object), restricting by property name, or all the way to a specific subject.
The ImmutableExternalPrefixMap uses an in-memory structure to compute approximate intervals on any given query; it then reads disk to refine and answer the query. The net result is that you generally have access to the results of any given query in one sector read (more or less like a hash structure), but the data is _ordered_, which means you can perform range queries, at the cost of reading additional sectors as necessary.
Within BaseX you have already converted tokens to byte arrays. ImmutableExternalPrefixMap does this as well.
A useful experiment would be to add a new type of index to BaseX that incorporates one of these prefix maps.
Creating the compound index looks something like this (altering ValueBuilder.build()). Replace:
index.index(data.text(pre, text), pre);
with:
byte [] attrName = data.name(pre, Data.ATTR); int parentElement = data.parent(pre, Data.ATTR); byte [] elementName = data.name(parentElement, Data.ELEM); TokenBuilder tb = new TokenBuilder(data.text(pre, text)); tb.add(';'); tb.add(attrName); tb.add(';'); tb.add(elementName); index.index(tb.finish(), pre);
Not sure if I have all of that right, but it shows the idea. When I run with this the indexes on my data do expand a little, but not too much (may a 10% growth?). Of course, it will be different for everyone, depending on their content.
By checking the database properties you can see that the index has more keys in it. Next step would be to add a prefix-scanning mode to the index iteration, in addition to equality scanning. In the example above I use the ';' character to separate the composite token; (byte)0 might be more appropriate.
Interesting -- since the path index already provides a unique, ordered constraint over the set of available paths, it makes perfect sense to use this as the second part of the compound index on attribute values. It will both provide additional discrimination information AND be smaller than storing the symbolic information.
Agreed that the large part of the effort is in the optimizer. I haven't looked at that part of your code yet; I should be able to look at it this week.
On Fri, Nov 4, 2011 at 12:16 AM, Christian Grün christian.gruen@gmail.com wrote:
Just some quick feedback, regarding the extension of index entries: we'd get even better results by extending the index entries with id's to the unique location paths.. e.g.:
<a> <b>dig</b> <c>dig</c> <c>diz</c> </a>
Path Index: 0: a/b 1: a/c
Value Index: dig.0 -> /a/b/text() dig.1 -> /a/c/text()[.='bla'] diz.1 -> /a/c/text()[.='blu']
basex-talk@mailman.uni-konstanz.de