Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I think we can include it in BaseX as well.Omar Siam <Omar.Siam@oeaw.ac.at> schrieb am Mi., 8. Aug. 2018, 12:58:Hi
I think the problem is: There are numerous implemetations of regular
expressions which have a common subset but are different in the more
advanced features.
Using the java regular expression implementation you can use greedy and
some other things. The XSL and XQuery implementation according to the
standards does not allow this and so misinterpretes the regular
expression. See here: https://www.w3.org/TR/xpath-functions-31/#regex-syntax
You can tell Saxon to use a different regexp engine such as the standard
Java one:
https://www.saxonica.com/html/documentation/functions/fn/ matches.html
Best regards
Omar
Am 07.08.2018 um 21:38 schrieb Andreas Mixich:
> Hi
>
> [rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B ) defines a nice
> regular expression, which groups any URI, including URN, by URI component.
>
> Interesting about this regex is the use of the '?' quantifier which
> makes every preceding group/component optional, thus matching either an
> URI or any other(!) string, since anything, that does not match one of
> the special groups, goes into a catch-all group (no.5), which keeps
> either the path or the full, arbitrary string. This is neglectable,
> since the input to this regex is guaranteed to be of the right type
> (a/@href/string()).
>
> Here is the relevant part from the RFC.
>
> Appendix B
>
> ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
> 12 3 4 5 6 7 8 9
>
> The numbers in the second line above are only to assist
> readability; they indicate the reference points for each
> subexpression (i.e., each paired parenthesis). We refer to the
> value matched for subexpression <n> as $<n>. For example, matching
> the above expression to
>
> http://www.ics.uci.edu/pub/ietf/uri/#Related
>
> results in the following subexpression matches:
>
> $1 = http:
> $2 = http
> $3 = //www.ics.uci.edu
> $4 = www.ics.uci.edu
> $5 = /pub/ietf/uri/
> $6 = <undefined>
> $7 = <undefined>
> $8 = #Related
> $9 = Related
>
> where <undefined> indicates that the component is not present,
> as is the case for the query component in the above example.
> Therefore, we can determine the value of the five components as
>
> scheme = $2
> authority = $4
> path = $5
> query = $7
> fragment = $9
>
> Going in the opposite direction, we can recreate a URI reference
> from its components by using the algorithm of Section 5.3.
>
>
> I tested this regex with Saxon, eXist and BaseX. eXist successfully
> parsed all the test-cases, I threw at it, into the right groups, Saxon
> and BaseX did not. The failure is:
>
> [FORX0003] Pattern matches empty string..
>
> And that got me baffled, since all three processors use Java underneath
> and since the definition of the '?' quantifier, when used like this,
> seems to be:
>
> Makes the preceding item optional. Greedy, so the optional item
> is included in the match if possible.
>
> Which means, that *if* any of the group's contents match, they should be
> included, rather than producing an empty string.
>
> Why is it like that? And what can I do about it? I found no other URI
> parsing regex, that componentizes this way and would be compatible with
> XQuery.
>
> See, attached, a test-case.
>