Using BaseX 11 (but I think the code is the same in BaseX 10).
I’m trying to understand how base-uri() behaves relative to how it should behave when the database path of a document is not a valid URI, i.e., it has a space in it.
First I have this test:
let $doc as document-node() := document { <root><child xml:base="temp/child-uri%20with%20space.xml">child</child></root> }
return $doc/*/child ! base-uri(.)
Which produces:
file:///data/basex/data/.dba/temp/child-uri%20with%20space.xml
Which is the correct result: it’s the value of @xml:base and the escaped spaces make it a valid URI.
Replacing %20 with “ “ in the @xml:base value results in this error:
Invalid URI: Illegal character in path at index 14: temp/child-uri with space.xml.
Also correct as the spaces have to be escaped.
This verifies that base-uri() applied to nodes with explicit @xml:base attributes work per the spec. But this test does not involve database paths.
To try to test things with database paths I then created this pair of test scripts:
Script to put docs in a database:
let $db := 'temp'
let $filename as xs:string := 'with space.xml'
let $doc1 as document-node() := document {<root><child>No xml:base</child></root>}
let $doc2 as document-node() := document { <root><child xml:base="{'/temp/xmlbase/doc2_' || $filename}">With xml:base unescaped</child></root> }
let $doc3 as document-node() := document { <root><child xml:base="{iri-to-uri( '/temp/xmlbase/doc3_' || $filename)}">With xml:base escaped</child></root> }
return (()
,db:put($db, $doc1, 'doc1_' || $filename)
,db:put($db, $doc2, 'doc2_' || $filename)
,db:put($db, $doc3, 'doc3_' || $filename)
)
Script to report on them:
let $db := 'temp'
let $filenameBase as xs:string := 'with space.xml'
return
for $i in 1 to 3
let $filename := 'doc' || $i || '_' || $filenameBase
let $doc := db:get($db, $filename)
let $child as element() := $doc/*/child
let $dbPath := db:path($doc)
let $baseUriDoc := base-uri($doc)
let $baseUriChild :=
try {
base-uri($child)
} catch * {
$err:description
}
return (()
,``[
Doc "`{$dbPath}`":]``
,$doc
,``[xml:base att: "`{$child/@xml:base}`"]``
,``[base URI of doc: "`{$baseUriDoc}`"]``
,``[base URI of child: "`{$baseUriChild}`"]``
)
Which returns this result:
Doc "doc1_with space.xml":
<root>
<child>No xml:base</child>
</root>
xml:base att: ""
base URI of doc: "/temp/doc1_with space.xml"
base URI of child: "/temp/doc1_with space.xml"
Doc "doc2_with space.xml":
<root>
<child xml:base="/temp/xmlbase/doc2_with space.xml">With xml:base unescaped</child>
</root>
xml:base att: "/temp/xmlbase/doc2_with space.xml"
base URI of doc: "/temp/doc2_with space.xml"
base URI of child: "Invalid URI: Illegal character in path at index 23: /temp/xmlbase/doc2_with space.xml."
Doc "doc3_with space.xml":
<root>
<child xml:base="/temp/xmlbase/doc3_with%20space.xml">With xml:base escaped</child>
</root>
xml:base att: "/temp/xmlbase/doc3_with%20space.xml"
base URI of doc: "/temp/doc3_with space.xml"
base URI of child: "Invalid URI: Illegal character in path at index 15: /temp/doc3_with space.xml."
Note the result for doc3: It’s reporting the base URI of the document (/temp/doc3_with space.xml), not the base URI of the child (/temp/xmlbase/doc_with%20space.xml). Why? I think the answer is that under the covers it’s doing resolve-uri(), which also checks the validity of both the base and relative parts.
One observation is that base-uri() is treating the db-provided base URI differently from an xml:base-provided base URI, but only when there is no @xml:base attribute.
In doc 1, the database path has a space but base-uri() does not fail when returning it even though it’s not a valid URI. Why not?
In doc 2, the xml:base-supplied base URI is correctly reported as invalid, but the database-supplied base URI of the root is not reported as invalid.
My expectation would be that the behavior is consistent: Either all URIs must be valid, including those coming from database paths or all are automatically escaped (as though iri-to-uri() had been applied).
Finally, why do I get the result for doc 3, where it’s reporting the database path as the base URI of the child rather than the @xml:base-defined base URI (which is correctly escaped).
In my code, which depends on the use of @xml:base to do DITA link resolution for “resolved” DITA maps, I’ve adjusted my code to escape URIs in @xml:base values and as far as I can tell everything works as it should. But I’m still concerned about the inconsistency in the behavior of base-uri().
I tried to trace through the code that handles base-uri() but it’s pretty twisty and does different things for files and nodes.
It would obviously be very disruptive to have base-uri() start failing on database paths with spaces—I think the current behavior dates back to the very start of BaseX, but it’s still an inconsistency that can lead to trouble with the unawares.
For example, consider this code:
let $topicref := db:get('maps', 'map with space.ditamap')/*/topicref[@href][1]
let $target as element()? := local:resolve-href($topicref)
let $baseUri as xs:string := base-uri($target) ! string(.)
let $newElem as element := <submap xml:base="{$baseURI}"/>
return base-uri($newElem)
The value of $baseUri will be “map with space.ditamap”, not “map%20with%20space.ditamap”, making the value of @xml:base on $newElem: xml:base=”map with space.ditamap”, meaning that base-uri($newElem) will throw an invalid URI exception.
My expectation would be either that base-uri($target) also throws an exception or, more usefully, that it returns the iri-to-uri() result, ensuring that the values will always be treated as valid URIs, consistent with how the database paths are treated.
Cheers,
E.