Hi,
Thomas Goossens wrote:
By default, whitespace nodes are chopped by the BaseX XML parser; that's why snippets like... <SPEAKER>ROMEO</SPEAKER><LINE>Is the day so young?</LINE> ..are tokenized to "romeois", "the", "day", etc. This may look pretty weird, but it makes sense if you look at examples like.. "<b>T</b>his is funny" contains text "This is funny"
Well this is funny indeed. If I am not mistaken, that means that BaseX would find "This" in the 2nd example but not "Romeo" in the first example. I guess that words crossing an element tag is something very rare. So in other terms BaseX works well in a very uncommon situation, but fails in much more likely cases... Well, it is your business.
Perhaps it would be better if an option would let the user decide which behaviour the BaseX XML parser should apply.
To go further, an adaptative behaviour would be usefull for widely-used XML languages, such as XHTML or Docbook: <p>, <div>, and block-level elements : tokenize with that boundaries <b>, <i>, and other inline level elements: tokenize without that boundaries