Re: [basex-talk] Full-text speed

16 Feb 2010

      Hi,
Thomas Goossens wrote:
...
By default, whitespace nodes are chopped by the BaseX XML parser;
that's why snippets like...

<SPEAKER>ROMEO</SPEAKER><LINE>Is the day so young?</LINE>

..are tokenized to "romeois", "the", "day", etc. This may look pretty
weird, but it makes sense if you look at examples like..

 "<b>T</b>his is funny" contains text "This is funny"

Well this is funny indeed. If I am not mistaken, that means that BaseX 
would find "This" in the 2nd example but not "Romeo" in the first example.
I guess that words crossing an element tag is something very rare.
So in other terms BaseX works well in a very uncommon situation, but 
fails in much more likely cases... Well, it is your business.
Perhaps it would be better if an option  would let the user decide which 
behaviour the BaseX XML parser should apply.
To go further, an adaptative behaviour would be usefull for widely-used 
XML languages, such as XHTML or Docbook:
<p>, <div>, and block-level elements : tokenize with that boundaries
<b>, <i>, and other inline level elements: tokenize without that boundaries
-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text speed