Re: [basex-talk] Full-text search and mixed content

14 May 2012


      On 2012-05-13, Christian Grün christian.gruen@gmail.com wrote:
...
...
If I want to get whitespaces back, do I have to re-create the collection?
Yes; sorry for that. The database does not contain any information on
chopped whitespaces, which is why you'll indeed have to reimport the
documents.
...
Would this result in any change concerning the node-ids?  We already have
some data depending on node-ids.  Is there some other way to get the
original whitespaces back?
The node ids will change if the documents include pure whitespace
texts. The following example represents such a document; it contains
three text nodes ("X", and two text nodes with a single newline
character):
<hello>
<world>X</world>
</hello>
I'll be working with Cerstin on this issue, so here's a brief comment.
Thanks for the example, that's what I feared ... I think we're lucky
that we're only dealing with node IDs of elements, so we can annotate
the elements with ID attributes, associate the node IDs with the XML
IDs, and then translate them again to the node IDs of the "unchopped"
database.  If we were dealing with node IDs of text nodes, we'd hosed ...
...
...
How would I display the selected text snippet to the user, when I store the
node-id and the text (as mixed content)?  ft:mark will not work, I think.
I'm not quite sure what you refer to here; could you attach a small example?
Christian
I *think* what she means is: Since
ft:mark(//p[. contains text 'real'])
will not highlight anything if . contains mixed content with multiple
text nodes, what is the best approach to highlight the results of a
search, given a query and a matching node?
...
PS@Michael and Gerrit: thanks for your opinion. One of the reasons for
the chopping whitespaces by default is that whitespace texts in
structured documents consume a lot of space in a database, although
they will never need to be processed.
Yes, I figured that it was intended for data-oriented documents.
...
However, I see that this solution may cause more confusion than be
helpful, which is why we'll think about switching the default
behavior.
This would be very welcome!  Your example above also nicely illustrates
the problem.  As the significance of whitespace in XML can only be
determined when there's a schema, chopping whitespace by default means
that, strictly speaking, documents are altered semantically on import
unless you take special precautions--it should definitely be the other
way round.
Best regards
-- 
Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
*          http://www.springeronline.com/978-3-642-23137-7

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text search and mixed content