On 2012-05-13, Christian Grün christian.gruen@gmail.com wrote:
If I want to get whitespaces back, do I have to re-create the collection?
Yes; sorry for that. The database does not contain any information on chopped whitespaces, which is why you'll indeed have to reimport the documents.
Would this result in any change concerning the node-ids? We already have some data depending on node-ids. Is there some other way to get the original whitespaces back?
The node ids will change if the documents include pure whitespace texts. The following example represents such a document; it contains three text nodes ("X", and two text nodes with a single newline character):
<hello> <world>X</world> </hello>
I'll be working with Cerstin on this issue, so here's a brief comment. Thanks for the example, that's what I feared ... I think we're lucky that we're only dealing with node IDs of elements, so we can annotate the elements with ID attributes, associate the node IDs with the XML IDs, and then translate them again to the node IDs of the "unchopped" database. If we were dealing with node IDs of text nodes, we'd hosed ...
How would I display the selected text snippet to the user, when I store the node-id and the text (as mixed content)? ft:mark will not work, I think.
I'm not quite sure what you refer to here; could you attach a small example? Christian
I *think* what she means is: Since
ft:mark(//p[. contains text 'real'])
will not highlight anything if . contains mixed content with multiple text nodes, what is the best approach to highlight the results of a search, given a query and a matching node?
PS@Michael and Gerrit: thanks for your opinion. One of the reasons for the chopping whitespaces by default is that whitespace texts in structured documents consume a lot of space in a database, although they will never need to be processed.
Yes, I figured that it was intended for data-oriented documents.
However, I see that this solution may cause more confusion than be helpful, which is why we'll think about switching the default behavior.
This would be very welcome! Your example above also nicely illustrates the problem. As the significance of whitespace in XML can only be determined when there's a schema, chopping whitespace by default means that, strictly speaking, documents are altered semantically on import unless you take special precautions--it should definitely be the other way round.
Best regards