Different interpretation of regex in eXist, Saxon and BaseX

List overview All Threads
Download

newer

older

Huge CSV

Add Command: Resource not found

Andreas Mixich

7 Aug 2018 7 Aug '18

3:38 p.m.

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice regular expression, which groups any URI, including URN, by URI component.

Interesting about this regex is the use of the '?' quantifier which makes every preceding group/component optional, thus matching either an URI or any other(!) string, since anything, that does not match one of the special groups, goes into a catch-all group (no.5), which keeps either the path or the full, arbitrary string. This is neglectable, since the input to this regex is guaranteed to be of the right type (a/@href/string()).

Here is the relevant part from the RFC.

Appendix B

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression <n> as $<n>. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

$1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related

where <undefined> indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the five components as

scheme = $2 authority = $4 path = $5 query = $7 fragment = $9

Going in the opposite direction, we can recreate a URI reference from its components by using the algorithm of Section 5.3.

I tested this regex with Saxon, eXist and BaseX. eXist successfully parsed all the test-cases, I threw at it, into the right groups, Saxon and BaseX did not. The failure is:

[FORX0003] Pattern matches empty string..

And that got me baffled, since all three processors use Java underneath and since the definition of the '?' quantifier, when used like this, seems to be:

Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.

Which means, that *if* any of the group's contents match, they should be included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI parsing regex, that componentizes this way and would be compatible with XQuery.

See, attached, a test-case.

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Attachments:

rfc-rx-test.xq (application/xquery — 3.2 KB)

Show replies by date

Bridger Dyson-Smith

7 Aug 7 Aug

9:31 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Hi Andreas -

wow, that is a pretty nice regex :). I'm not nearly caffeinated enough right now to pick it apart, so I'm only able to ask a question - not provide any answers or help. Unless I'm reading the spec and Walmsley's coverage wrong, isn't the '?' a reluctant quantifier - given two choices it will always match the shorter choice? Or does the hash/octothorp give extra significance to the '?' quantifier?

In any event, thank you for the neat brain teaser! Best, Bridger

On Tue, Aug 7, 2018 at 3:38 PM Andreas Mixich mixich.andreas@gmail.com wrote:

...

Hi

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice regular expression, which groups any URI, including URN, by URI component.

Interesting about this regex is the use of the '?' quantifier which makes every preceding group/component optional, thus matching either an URI or any other(!) string, since anything, that does not match one of the special groups, goes into a catch-all group (no.5), which keeps either the path or the full, arbitrary string. This is neglectable, since the input to this regex is guaranteed to be of the right type (a/@href/string()).

Here is the relevant part from the RFC.

Appendix B

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
 The numbers in the second line above are only to assist
 readability; they indicate the reference points for each
 subexpression (i.e., each paired parenthesis).  We refer to the
 value matched for subexpression <n> as $<n>.  For example, matching
 the above expression to

    http://www.ics.uci.edu/pub/ietf/uri/#Related

 results in the following subexpression matches:

    $1 = http:
    $2 = http
    $3 = //www.ics.uci.edu
    $4 = www.ics.uci.edu
    $5 = /pub/ietf/uri/
    $6 = <undefined>
    $7 = <undefined>
    $8 = #Related
    $9 = Related

 where <undefined> indicates that the component is not present,
 as is the case for the query component in the above example.
 Therefore, we can determine the value of the five components as

    scheme    = $2
    authority = $4
    path      = $5
    query     = $7
    fragment  = $9

 Going in the opposite direction, we can recreate a URI reference
 from its components by using the algorithm of Section 5.3.
I tested this regex with Saxon, eXist and BaseX. eXist successfully parsed all the test-cases, I threw at it, into the right groups, Saxon and BaseX did not. The failure is:
[FORX0003] Pattern matches empty string..
And that got me baffled, since all three processors use Java underneath and since the definition of the '?' quantifier, when used like this, seems to be:
Makes the preceding item optional. Greedy, so the optional item
is included in the match if possible.
Which means, that *if* any of the group's contents match, they should be included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI parsing regex, that componentizes this way and would be compatible with XQuery.

See, attached, a test-case.

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Liam R. E. Quin

10:18 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

On Tue, 2018-08-07 at 21:31 -0400, Bridger Dyson-Smith wrote:

...

isn't the '?' a reluctant quantifier - given two choices it will always match the shorter choice?

b? matches zero or one "b".

b* matches zero or more "b" using the longest match possible

b+ matches one or more "b" using the longest match possible

b*? matches zero or more "b" using the shortest match possible.

b+? matches one or more "b" using the shortest match possible.

See https://www.w3.org/TR/xpath-functions-31/#regex-syntax for examples and more text.

? inside a character class matches a ? so that [#?] matches either "#" or "?".

...

...
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?

This can indeed match the empty string: adding speaces for clarity:

^ -- start of string (([^:/?#]+):)? -- optional because of ? (//([^/?#]*))? -- optional because of ? ([^?#]*) -------- can match the empty string because of * (?([^#]*))? ---- optional because of ? (#(.*))? -------- optional because of ?

[no $ to match the end of the string included]

It's actually hard to construct a string that isn't a valid URI according to the specs, and harder still to determine this from reading the specs.

In XQuery i'd just do soemthing like xs:anyURI($string) and let the XQuery engine work it out.- use try/catch if necessary. It's rare that it makes sense to be more restrictive than, say, fn:doc() or than Web browsers.

Liam

-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.

Andreas Mixich

11:55 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Bridger Dyson-Smith wrote:

...

wow, that is a pretty nice regex :).

Indeed, I found that, too! :-)

...

coverage wrong, isn't the '?' a reluctant quantifier - given two choices it will always match the shorter choice? Or does the hash/octothorp give extra significance to the '?' quantifier?

I found https://www.regular-expressions.info/reference.html to be a brilliant and most complete resource for reference. It even covers the [XSD](https://www.regular-expressions.info/xml.html) and [XPath](https://www.regular-expressions.info/xpath.html) regular expressions.

And while this may sound as advertisement, which it is not, the site *is* just *that* good, for a little tip, around 5 dollars, you can download the whole website as formatted PDF. Best regex reference I read, so far. The guy really knows this stuff and is very passionated about it.

Now, if you go to https://www.regular-expressions.info/floatingpoint.html , you will see a very similar problem to ours.

And since I am already in recommendation mode, http://regex101.com. Just saying... Sadly, it has no XPath coverage. Oh, and also http://rexegg.com, which is less referential, but more tutorial/anectodical.

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Omar Siam

8 Aug 8 Aug

6:58 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

I think the problem is: There are numerous implemetations of regular expressions which have a common subset but are different in the more advanced features.

Using the java regular expression implementation you can use greedy and some other things. The XSL and XQuery implementation according to the standards does not allow this and so misinterpretes the regular expression. See here: https://www.w3.org/TR/xpath-functions-31/#regex-syntax

You can tell Saxon to use a different regexp engine such as the standard Java one: https://www.saxonica.com/html/documentation/functions/fn/matches.html

Best regards

Omar

Am 07.08.2018 um 21:38 schrieb Andreas Mixich:

...

Hi

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice regular expression, which groups any URI, including URN, by URI component.

Interesting about this regex is the use of the '?' quantifier which makes every preceding group/component optional, thus matching either an URI or any other(!) string, since anything, that does not match one of the special groups, goes into a catch-all group (no.5), which keeps either the path or the full, arbitrary string. This is neglectable, since the input to this regex is guaranteed to be of the right type (a/@href/string()).

Here is the relevant part from the RFC.

Appendix B

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
  The numbers in the second line above are only to assist
  readability; they indicate the reference points for each
  subexpression (i.e., each paired parenthesis).  We refer to the
  value matched for subexpression <n> as $<n>.  For example, matching
  the above expression to

     http://www.ics.uci.edu/pub/ietf/uri/#Related

  results in the following subexpression matches:

     $1 = http:
     $2 = http
     $3 = //www.ics.uci.edu
     $4 = www.ics.uci.edu
     $5 = /pub/ietf/uri/
     $6 = <undefined>
     $7 = <undefined>
     $8 = #Related
     $9 = Related

  where <undefined> indicates that the component is not present,
  as is the case for the query component in the above example.
  Therefore, we can determine the value of the five components as

     scheme    = $2
     authority = $4
     path      = $5
     query     = $7
     fragment  = $9

  Going in the opposite direction, we can recreate a URI reference
  from its components by using the algorithm of Section 5.3.
I tested this regex with Saxon, eXist and BaseX. eXist successfully parsed all the test-cases, I threw at it, into the right groups, Saxon and BaseX did not. The failure is:
 [FORX0003] Pattern matches empty string..
And that got me baffled, since all three processors use Java underneath and since the definition of the '?' quantifier, when used like this, seems to be:
 Makes the preceding item optional. Greedy, so the optional item
 is included in the match if possible.
Which means, that *if* any of the group's contents match, they should be included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI parsing regex, that componentizes this way and would be compatible with XQuery.

See, attached, a test-case.

Andreas Mixich

9 Aug 9 Aug

10:32 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Omar Siam wrote:

...

Using the java regular expression implementation you can use greedy and some other things. The XSL and XQuery implementation according to the standards does not allow this and so misinterpretes the regular expression. See here:

I checked

...

https://www.w3.org/TR/xpath-functions-31/#regex-syntax

and also the https://www.w3.org/TR/xmlschema-2/#regexs but did not find any mention of greediness. But then, I am not sure, whether I understood this from latter document:

A ·regular expression· R is a sequence of characters that denote a set of strings L(R). When used to constrain a ·lexical space·, a regular expression R asserts that only strings in L(R) are valid literals for values of that type.

For all ·atom·s S and non-negative integers n, m such that n <= m, valid ·piece·s R are: Denoting the set of strings L(R) containing: S? the empty string, and all strings in L(S).

Now I am not quite sure what L(S) means.

...

You can tell Saxon to use a different regexp engine such as the standard Java one: https://www.saxonica.com/html/documentation/functions/fn/matches.html

The hint is much appreciated, though BaseX is my actual development target. I just mentioned Saxon and eXist, because I cross checked them and found the result to be interesting enough to be taken to the list (and still hope, that Christian chimes in and may find a good reason, to do it the other way around in opposition to the way it is now)

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Omar Siam

11:59 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Hi!

My point was that greediness is *not* part of the XQuery RegExp standard. Java on the other hand has this feature: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#greed... and others. And I don't know about Perl, PHP, Python and so on.

What I want to stress is: A beautiful RegExp from the internet may or may not work with a particular RegExp implementation.

Nevertheless as Saxon is well integrated in BaseX you can use it to do some RegExp work. Just getting data to and from Saxon may be not possible depending on the size of what you want to process. Saxon always works on a in-memory-representation of the data as far as I know and that is not an option with a 2.5 GB XML for example.

Best regards

Omar

Am 09.08.2018 um 16:32 schrieb Andreas Mixich:

...

Omar Siam wrote:

...
Using the java regular expression implementation you can use greedy and some other things. The XSL and XQuery implementation according to the standards does not allow this and so misinterpretes the regular expression. See here:

I checked

...
https://www.w3.org/TR/xpath-functions-31/#regex-syntax

and also the https://www.w3.org/TR/xmlschema-2/#regexs but did not find any mention of greediness. But then, I am not sure, whether I understood this from latter document:
 A ·regular expression· R is a sequence of characters that denote a
 set of strings  L(R). When used to constrain a ·lexical space·, a
 regular expression  R asserts that only strings in L(R) are valid
 literals for values of that type.
For all ·atom·s S and non-negative integers n, m such that n <= m, valid ·piece·s R are: Denoting the set of strings L(R) containing: S? the empty string, and all strings in L(S).

Now I am not quite sure what L(S) means.

...
You can tell Saxon to use a different regexp engine such as the standard Java one: https://www.saxonica.com/html/documentation/functions/fn/matches.html

The hint is much appreciated, though BaseX is my actual development target. I just mentioned Saxon and eXist, because I cross checked them and found the result to be interesting enough to be taken to the list (and still hope, that Christian chimes in and may find a good reason, to do it the other way around in opposition to the way it is now)

Murray, Gregory

12:17 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

In https://www.w3.org/TR/xpath-functions-31/#regex-syntax you won't find the words "greedy" or "greediness" because the term used is "reluctant quantifiers." See section 5.6.1.2.

On 8/9/18, 11:59 AM, "BaseX-Talk on behalf of Omar Siam" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of Omar.Siam@oeaw.ac.at> wrote:

Hi!

What I want to stress is: A beautiful RegExp from the internet may or may not work with a particular RegExp implementation.

Best regards

Omar

Am 09.08.2018 um 16:32 schrieb Andreas Mixich: > Omar Siam wrote: >> Using the java regular expression implementation you can use greedy >> and some other things. The XSL and XQuery implementation according to >> the standards does not allow this and so misinterpretes the regular >> expression. See here: > I checked > >> https://www.w3.org/TR/xpath-functions-31/#regex-syntax > and also the https://www.w3.org/TR/xmlschema-2/#regexs but did not find > any mention of greediness. But then, I am not sure, whether I understood > this from latter document: > > A ·regular expression· R is a sequence of characters that denote a > set of strings L(R). When used to constrain a ·lexical space·, a > regular expression R asserts that only strings in L(R) are valid > literals for values of that type. > > For all ·atom·s S and non-negative integers n, m such that n <= m, valid > ·piece·s R are: > Denoting the set of strings L(R) containing: > S? > the empty string, and all strings in L(S). > > > > Now I am not quite sure what L(S) means. > >> You can tell Saxon to use a different regexp engine such as the >> standard Java one: >> https://www.saxonica.com/html/documentation/functions/fn/matches.html > The hint is much appreciated, though BaseX is my actual development > target. I just mentioned Saxon and eXist, because I cross checked them > and found the result to be interesting enough to be taken to the list > (and still hope, that Christian chimes in and may find a good reason, to > do it the other way around in opposition to the way it is now) >

Omar Siam

12:21 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Sorry I got that wrong. I meant XQuery has greedy (the default) and reluctant. But not possessive.

Christian Grün

10:35 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I think we can include it in BaseX as well.

Omar Siam Omar.Siam@oeaw.ac.at schrieb am Mi., 8. Aug. 2018, 12:58:

...

Hi

I think the problem is: There are numerous implemetations of regular expressions which have a common subset but are different in the more advanced features.

Using the java regular expression implementation you can use greedy and some other things. The XSL and XQuery implementation according to the standards does not allow this and so misinterpretes the regular expression. See here: https://www.w3.org/TR/xpath-functions-31/#regex-syntax

You can tell Saxon to use a different regexp engine such as the standard Java one: https://www.saxonica.com/html/documentation/functions/fn/matches.html

Best regards

Omar

Am 07.08.2018 um 21:38 schrieb Andreas Mixich:

...
Hi

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice regular expression, which groups any URI, including URN, by URI

component.

...
Interesting about this regex is the use of the '?' quantifier which makes every preceding group/component optional, thus matching either an URI or any other(!) string, since anything, that does not match one of the special groups, goes into a catch-all group (no.5), which keeps either the path or the full, arbitrary string. This is neglectable, since the input to this regex is guaranteed to be of the right type (a/@href/string()).

Here is the relevant part from the RFC.

Appendix B

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
  The numbers in the second line above are only to assist
  readability; they indicate the reference points for each
  subexpression (i.e., each paired parenthesis).  We refer to the
  value matched for subexpression <n> as $<n>.  For example, matching
  the above expression to

     http://www.ics.uci.edu/pub/ietf/uri/#Related

  results in the following subexpression matches:

     $1 = http:
     $2 = http
     $3 = //www.ics.uci.edu
     $4 = www.ics.uci.edu
     $5 = /pub/ietf/uri/
     $6 = <undefined>
     $7 = <undefined>
     $8 = #Related
     $9 = Related

  where <undefined> indicates that the component is not present,
  as is the case for the query component in the above example.
  Therefore, we can determine the value of the five components as

     scheme    = $2
     authority = $4
     path      = $5
     query     = $7
     fragment  = $9

  Going in the opposite direction, we can recreate a URI reference
  from its components by using the algorithm of Section 5.3.
I tested this regex with Saxon, eXist and BaseX. eXist successfully parsed all the test-cases, I threw at it, into the right groups, Saxon and BaseX did not. The failure is:
 [FORX0003] Pattern matches empty string..
And that got me baffled, since all three processors use Java underneath and since the definition of the '?' quantifier, when used like this, seems to be:
 Makes the preceding item optional. Greedy, so the optional item
 is included in the match if possible.
Which means, that *if* any of the group's contents match, they should be included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI parsing regex, that componentizes this way and would be compatible with XQuery.

See, attached, a test-case.

Andreas Mixich

12:16 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Am 09.08.2018 um 16:35 schrieb Christian Grün:

...

Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I think we can include it in BaseX as well.

Very good news! Thanks a lot!

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Andy Bunce

12:35 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

+1 for the Java flag as this enables \b for word boundaries as mentioned here [1]

/Andy

[1] https://stackoverflow.com/questions/25446314/in-saxon-9- he-java-xml-parser-word-boundaries-b-in-regular-expressions-are-n/25464233# 25464233

On 9 August 2018 at 15:35, Christian Grün christian.gruen@gmail.com wrote:

...

Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I think we can include it in BaseX as well.

Omar Siam Omar.Siam@oeaw.ac.at schrieb am Mi., 8. Aug. 2018, 12:58:

...
Hi

I think the problem is: There are numerous implemetations of regular expressions which have a common subset but are different in the more advanced features.

Using the java regular expression implementation you can use greedy and some other things. The XSL and XQuery implementation according to the standards does not allow this and so misinterpretes the regular expression. See here: https://www.w3.org/TR/xpath- functions-31/#regex-syntax

You can tell Saxon to use a different regexp engine such as the standard Java one: https://www.saxonica.com/html/documentation/functions/fn/matches.html

Best regards

Omar

Am 07.08.2018 um 21:38 schrieb Andreas Mixich:

...
Hi

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a

nice

...
regular expression, which groups any URI, including URN, by URI

component.

...
Interesting about this regex is the use of the '?' quantifier which makes every preceding group/component optional, thus matching either an URI or any other(!) string, since anything, that does not match one of the special groups, goes into a catch-all group (no.5), which keeps either the path or the full, arbitrary string. This is neglectable, since the input to this regex is guaranteed to be of the right type (a/@href/string()).

Here is the relevant part from the RFC.

Appendix B

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9
  The numbers in the second line above are only to assist
  readability; they indicate the reference points for each
  subexpression (i.e., each paired parenthesis).  We refer to the
  value matched for subexpression <n> as $<n>.  For example,
matching

...
  the above expression to

     http://www.ics.uci.edu/pub/ietf/uri/#Related

  results in the following subexpression matches:

     $1 = http:
     $2 = http
     $3 = //www.ics.uci.edu
     $4 = www.ics.uci.edu
     $5 = /pub/ietf/uri/
     $6 = <undefined>
     $7 = <undefined>
     $8 = #Related
     $9 = Related

  where <undefined> indicates that the component is not present,
  as is the case for the query component in the above example.
  Therefore, we can determine the value of the five components as

     scheme    = $2
     authority = $4
     path      = $5
     query     = $7
     fragment  = $9

  Going in the opposite direction, we can recreate a URI reference
  from its components by using the algorithm of Section 5.3.
I tested this regex with Saxon, eXist and BaseX. eXist successfully parsed all the test-cases, I threw at it, into the right groups, Saxon and BaseX did not. The failure is:
 [FORX0003] Pattern matches empty string..
And that got me baffled, since all three processors use Java underneath and since the definition of the '?' quantifier, when used like this, seems to be:
 Makes the preceding item optional. Greedy, so the optional item
 is included in the match if possible.
Which means, that *if* any of the group's contents match, they should be included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI parsing regex, that componentizes this way and would be compatible with XQuery.

See, attached, a test-case.

Christian Grün

1:02 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

...

+1 for the Java flag as this enables \b for word boundaries as mentioned here [1]

True, I missed that one as well more than once.

I’ve just support for Java’s default parser [1,2]. Apart from 'j' (which doesn’t need to be prefixed with a semicolon, as in Saxon), '!' is available as alternative. As it’s not officially documented in Saxon, just keep this one as a secret :)

A new snapshot will be available later tonight.

[1] https://github.com/BaseXdb/basex/issues/1608 [2] http://docs.basex.org/wiki/XQuery_Extensions#Regular_expressions

Christian Grün

1:54 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

...

A new snapshot will be available later tonight.

…which is now.

On Thu, Aug 9, 2018 at 7:02 PM Christian Grün christian.gruen@gmail.com wrote:

...

...
+1 for the Java flag as this enables \b for word boundaries as mentioned here [1]

True, I missed that one as well more than once.

I’ve just support for Java’s default parser [1,2]. Apart from 'j' (which doesn’t need to be prefixed with a semicolon, as in Saxon), '!' is available as alternative. As it’s not officially documented in Saxon, just keep this one as a secret :)

A new snapshot will be available later tonight.

[1] https://github.com/BaseXdb/basex/issues/1608 [2] http://docs.basex.org/wiki/XQuery_Extensions#Regular_expressions

Andy Bunce

3:57 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Great! I believe the "!" option is best ignored...:)

...

Note: On the Java platform, this can also be achieved using the flag "!";

this was never formally supported and is likely to be withdrawn in a future Saxon version. [1]

/Andy [1] https://www.saxonica.com/html/documentation/functions/fn/matches.html%5B1]

On 9 August 2018 at 18:54, Christian Grün christian.gruen@gmail.com wrote:

...

...
A new snapshot will be available later tonight.

…which is now.

On Thu, Aug 9, 2018 at 7:02 PM Christian Grün christian.gruen@gmail.com wrote:

...
...
+1 for the Java flag as this enables \b for word boundaries as

mentioned here [1]

...
True, I missed that one as well more than once.

I’ve just support for Java’s default parser [1,2]. Apart from 'j' (which doesn’t need to be prefixed with a semicolon, as in Saxon), '!' is available as alternative. As it’s not officially documented in Saxon, just keep this one as a secret :)

A new snapshot will be available later tonight.

[1] https://github.com/BaseXdb/basex/issues/1608 [2] http://docs.basex.org/wiki/XQuery_Extensions#Regular_expressions

Andreas Mixich

10 Aug 10 Aug

9:20 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Am 09.08.2018 um 21:57 schrieb Andy Bunce:

...

Great! I believe the "!" option is best ignored...:)

I wonder why Saxon had it there, in the first place!?

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Andreas Mixich

9:16 a.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

Am 09.08.2018 um 19:54 schrieb Christian Grün:

...

...
A new snapshot will be available later tonight.

…which is now.

Installed the new snapshot, all went fine, but later I stumbled upon the following issue:

Error: Improper use? Potential bug? Your feedback is welcome: Contact: basex-talk@mailman.uni-konstanz.de Version: BaseX 9.1 beta Java: Oracle Corporation, 9.0.1 OS: Windows 10, amd64 Stack Trace: java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 13 ([^:]*)://)?(?:([^:@]*)(?::([^@]*))?@)?(?:([^/:]*))?(?::([0-9]*))?/(/[^?#]*(?=.*?/)/)?([^?#]*)?(?:?([^#]*))?(?:#(.*))?/ ^ at java.base/java.util.regex.Pattern.error(Unknown Source) at java.base/java.util.regex.Pattern.compile(Unknown Source) at java.base/java.util.regex.Pattern.<init>(Unknown Source) at java.base/java.util.regex.Pattern.compile(Unknown Source) at org.basex.query.util.regex.parse.RegExParser.parse(RegExParser.java:61) ...

To me it looks like the Java regex circumvents the BaseX error catcher. Full error log and test-case attached.

-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich

Christian Grün

12:08 p.m.

New subject: Different interpretation of regex in eXist, Saxon and BaseX

...

Installed the new snapshot, all went fine, but later I stumbled upon the following issue:

Confirmed and fixed, thanks (the new snapshot is available in around 5 min).

...

I wonder why Saxon had it there, in the first place!?

Feel free to ask Michael Kay.

2534

Age (days ago)

2537

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

17 comments

7 participants

tags (0)

participants (7)

Andreas Mixich
Andy Bunce
Bridger Dyson-Smith
Christian Grün
Liam R. E. Quin
Murray, Gregory
Omar Siam