Hi Andreas -

wow, that is a pretty nice regex :). I'm not nearly caffeinated enough right now to pick it apart, so I'm only able to ask a question - not provide any answers or help. Unless I'm reading the spec and Walmsley's coverage wrong, isn't the '?' a reluctant quantifier - given two choices it will always match the shorter choice? Or does the hash/octothorp give extra significance to the '?' quantifier?

In any event, thank you for the neat brain teaser!
Best,
Bridger

On Tue, Aug 7, 2018 at 3:38 PM Andreas Mixich <mixich.andreas@gmail.com> wrote:
Hi

[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice
regular expression, which groups any URI, including URN, by URI component.

Interesting about this regex is the use of the '?' quantifier which
makes every preceding group/component optional, thus matching either an
URI or any other(!) string, since anything, that does not match one of
the special groups, goes into a catch-all group (no.5), which keeps
either the path or the full, arbitrary string. This is neglectable,
since the input to this regex is guaranteed to be of the right type
(a/@href/string()).

Here is the relevant part from the RFC.

  Appendix B

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
         12            3  4          5       6  7        8 9

     The numbers in the second line above are only to assist
     readability; they indicate the reference points for each
     subexpression (i.e., each paired parenthesis).  We refer to the
     value matched for subexpression <n> as $<n>.  For example, matching
     the above expression to

        http://www.ics.uci.edu/pub/ietf/uri/#Related

     results in the following subexpression matches:

        $1 = http:
        $2 = http
        $3 = //www.ics.uci.edu
        $4 = www.ics.uci.edu
        $5 = /pub/ietf/uri/
        $6 = <undefined>
        $7 = <undefined>
        $8 = #Related
        $9 = Related

     where <undefined> indicates that the component is not present,
     as is the case for the query component in the above example.
     Therefore, we can determine the value of the five components as

        scheme    = $2
        authority = $4
        path      = $5
        query     = $7
        fragment  = $9

     Going in the opposite direction, we can recreate a URI reference
     from its components by using the algorithm of Section 5.3.


I tested this regex with Saxon, eXist and BaseX. eXist successfully
parsed all the test-cases, I threw at it, into the right groups, Saxon
and BaseX did not. The failure is:

    [FORX0003] Pattern matches empty string..

And that got me baffled, since all three processors use Java underneath
and since the definition of the '?' quantifier, when used like this,
seems to be:

    Makes the preceding item optional. Greedy, so the optional item
    is included in the match if possible.

Which means, that *if* any of the group's contents match, they should be
included, rather than producing an empty string.

Why is it like that? And what can I do about it? I found no other URI
parsing regex, that componentizes this way and would be compatible with
XQuery.

See, attached, a test-case.

--
Goody Bye, Minden jót, Mit freundlichen Grüßen,
Andreas Mixich