Hello
I encountered some strange things when tokenizing text. Sample runnable code is added below. Here is my list of problems:
1) the regular expression "(.){3}" doesn't match the same as "(...)". Shouldn't they be equal?
2) a very annoying whitespace is placed text to the newline of out:nl(). It is placed before out:nl() if it is called in the beginning of an element, or it is placed after the newline if out:nl() is called in the end of an element.
E.g the serialized output is either "<s> this" or ". </s>".
Sample runnable code:
for $text in ("this one... is the first.", "this one is second.") return <s>{ out:nl(), string-join( analyze-string($text, '(.){3}|[\W]')//text()[not(.=" ")], out:nl() ), out:nl() }</s> , "--------- VERSUS --------- " ,
for $text in ("this one... is the first.", "this one is second.") return <s>{ out:nl(), string-join( analyze-string($text, '(...)|[\W]')//text()[not(.=" ")], out:nl() ), out:nl() }</s>