Hello!
In brief, I’m looking for any advice folks have on accepting XML of various encodings into a single database. Are there any considerations I should take into account with, say, RESTXQ form parameters? Or with storage, indexing, querying, etc.?
For context, I’m currently (re)building a RESTXQ API for a service that publishes TEI. The site, TAPAS, is open to anyone with TEI that might need a place to store it and show it off online. We do some minimal testing on uploaded files [1], but up to this point we haven’t applied any limitations on the character encoding that folks use. While I expect that many people use UTF-8 encoding for TEI, I would like to try to ensure that folks using other encodings can use the service as well. The previous version of the TAPAS-xq API was not tested in this regard (and in other ways as well). I’m trying to do better in writing this new version.
One problem that I’m running into is simply retrieving UTF-16 XML from a multipart form parameter.[2] I think BaseX is serializing the file as a string, but when I try to parse it as XML, the file is flagged as ill-formed (“Content is not allowed in prolog”). Is there a way to be flexible about uploaded XML while giving BaseX any parsing/serialization hints it might need?
I’m about at the limit of my ability to figure out how different encodings might be working within BaseX; I appreciate any and all information or advice!
Warmly, Ash
[1]: Our minimal testing: Is this well-formed XML? Is it in the TEI namespace? Is the outermost element <TEI> rather than <teiCorpus> or anything else? We also try to make sure the XML doesn’t have Javascript that might make it into a reader’s browser.
[2]: I first discovered this might be a problem when creating a unit testhttps://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/test-suite.xql#L120-L155 for this functionhttps://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/tapas-api.xql#L225-L273 — the UTF-16 file-as-string could not be parsed as XML. I then tested a UTF-16 file against the RESTXQ endpoint with curl, and found the same problem (though I do not know if the string has been read as UTF-8 or -16). In contrast, when I tried to create a standalone test modulehttps://gist.github.com/amclark42/9f8d8135a30e4659774a673627a263ac, a UTF-16 file-as-string could be parsed as XML, but not when encoded as binary and then decoded again.
Ash Clark (my pronouns are e/em/eir) XML Applications Developer Digital Scholarship Group Northeastern University Libraries as.clark@northeastern.edu (617) 373-5983
Hi Ash,
The easiest way to accept an XML document, no matter which encoding it has, looks as follows:
(: your RESTXQ endpoint :) declare %rest:POST('{$xml}') %rest:path('/test') function local:xml($xml) { $xml//text() };
# a command-line call to send the request curl -H"Content-Type:application/xml" -XPOST -Ttest.xml " http://localhost:8080/test"
Here is a slightly more complex example to upload the document via an HTML form [1]:
declare %rest:GET %rest:path('/test') %output:method('html') function local:f() { <form action="/test" method="POST" enctype="multipart/form-data"> <input type="file" name="files" multiple="multiple"/> <input type="submit"/> </form> };
declare %rest:POST %rest:path("/test") %rest:form-param("files", "{$files}") function local:upload($files) { <results>{ for $name in map:keys($files) let $content := $files($name) return <result name='{ $name }'>{ try { fetch:binary-doc($content) } catch * { attribute error { $err:description } } }</result> }</results> };
The uploaded binary is converted to an XML document with fetch:binary-doc [2]. An error is output if the input cannot be parsed as XML. Again, the encoding doesn’t matter: It will dynamically be derived from either the byte order mask, the first bytes or the embedded XML declaration.
It seems that’s the solution that would also help you to get the unit test running, right? Maybe the standard function will be enhanced to also accept xs:base64Binary in XQuery 4, but that’s still discussed [3].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/RESTXQ#File_Uploads [2] https://docs.basex.org/wiki/Fetch_Module [3] https://github.com/qt4cg/qtspecs/issues/748
Dear Christian,
Thank you so much, this is really clear and helpful!
Warmly, Ash ________________________________ From: Christian Grün christian.gruen@gmail.com Sent: Wednesday, October 18, 2023 8:58 AM To: Clark, Ash as.clark@northeastern.edu Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Advice on accepting UTF-16 XML (and others) from RESTXQ?
Hi Ash,
The easiest way to accept an XML document, no matter which encoding it has, looks as follows:
(: your RESTXQ endpoint :) declare %rest:POST('{$xml}') %rest:path('/test') function local:xml($xml) { $xml//text() };
# a command-line call to send the request curl -H"Content-Type:application/xml" -XPOST -Ttest.xml "http://localhost:8080/test"
Here is a slightly more complex example to upload the document via an HTML form [1]:
declare %rest:GET %rest:path('/test') %output:method('html') function local:f() { <form action="/test" method="POST" enctype="multipart/form-data"> <input type="file" name="files" multiple="multiple"/> <input type="submit"/> </form> };
declare %rest:POST %rest:path("/test") %rest:form-param("files", "{$files}") function local:upload($files) { <results>{ for $name in map:keys($files) let $content := $files($name) return <result name='{ $name }'>{ try { fetch:binary-doc($content) } catch * { attribute error { $err:description } } }</result> }</results> };
The uploaded binary is converted to an XML document with fetch:binary-doc [2]. An error is output if the input cannot be parsed as XML. Again, the encoding doesn’t matter: It will dynamically be derived from either the byte order mask, the first bytes or the embedded XML declaration.
It seems that’s the solution that would also help you to get the unit test running, right? Maybe the standard function will be enhanced to also accept xs:base64Binary in XQuery 4, but that’s still discussed [3].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/RESTXQ#File_Uploads [2] https://docs.basex.org/wiki/Fetch_Module [3] https://github.com/qt4cg/qtspecs/issues/748
basex-talk@mailman.uni-konstanz.de