Hi Wendell,

Am 28.01.2013 um 22:27 schrieb Wendell Piez:

1. Unless I learn better, I'm going to prefer [B] or [C], because in
my world, mixed content is common; is there any reason (performance or
otherwise) to prefer [A] in cases where I know it will be robust? Is
there any reason to prefer [B] or prefer [C]?

My world is a world of mixed content, too.  So with questions like [A], you miss a lot of things you want to retrieve.  However, [A] is the only possibility of making use of the index.  So with [B] or [C] you might get all hits you are interested in, but you will never get them because of performance issues.

Flattening the structure in the first place, i.e., getting rid of all non-structural information not really relevant for your particular query, and then applying [A] would be a bad idea when your user scenario involves inspecting the hits in the original context, i.e., including all formatting, and annotating hits back into the original text.

As I see it, the handling of mixed content is the biggest obstacle when working with BaseX in the Humanities.  

For some reason, eXist seems capable of handling mixed content AND using the index.  But when I experimented with it, it wasn't that stable, so I came back to BaseX and my users know that it is very likely some hits will be missed when querying the corpus.  However, for every "query", they are interested in, they formulate various xqueries including different search terms -- this way they get hold of almost everything, eXist was capable to find.  I can show some examples at the BaseX user meeting in Prague.

Best regards

Cerstin
--
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mahlow@unibas.ch
Web: http://www.oldphras.net