Hi,
I came around this, the other day:
(: results in � (some (unknown?) binary char) :) declare function local:test1($string) { xs:hexBinary($string) };
(: results in *[convert:string] String conversion: Decoding error: xff.* :) declare function local:test2($string) { convert:binary-to-string(xs:hexBinary($string)) };
(: results in � :) declare function local:test3($string) { xs:hexBinary($string) ,convert:binary-to-string(xs:hexBinary($string), "UTF-8", true()) };
let $input1 := "c3"
let $input2 := "28"
return ( local:test3($input1), local:test3($input2) )
I came around this, when I wanted to unescape an IRI by converting the 2 digits after the '%' to their character representation. What buffles me the most is, that in local:test1#1 I get the unrecognizable binary char for the xs:hexBinary call. If, however, as done in local:test2#1 the very same expression becomes part of a sequence, then I get back the desired character. And if I use convert:binary-to-string#1 I get an error, while using the 3-arity version, I do not get the error, but the unreadable binary char.
How can I simply get back any character, readable by a human, from a hexadecimal value?
Hi Andreas,
I think what you are observing is the following:
UTF-8 encoded stings can optionally denote a multi-byte sequence, with the number of leading 1-s defining the multibyte pattern length. c.f. https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-8
In your example, c3 decodes to:
xs:hexBinary("c3") => convert:binary-to-integers() => for-each(convert:integer-to-base(?,2)) (: returns: 11000011 :)
And the two leading 1s will tell the UTF-8 decoder to read a second byte — which is missing — hence decoding fails with an error or if you use the fallback-option it will return a �
While decoding ASCII, where only 127 bits are used, this is no problem as UTF-8 shares the same character positions with the ascii table.
Your „C3“ character however is not in ascii but most probably ISO-8859-1 or CP1252? So while a glance at https://tools.ietf.org/html/rfc3986 https://tools.ietf.org/html/rfc3986 says URI Characters should be encoded in UTF-8 in practice chances are you encounter values that are encoded using some „local“ encoding.
If your string is not UTF-8 encoded you may only guess what the correct encoding is.
You may send a predefined string that is known to be of two bytes length in UTF-8, such as: ä that will be either converted to „%C3%A4“ if it is unicode or to a well known single byte such as for example „E4“ in ISO-8859-1. Depending and what you receive by your client for that given string you may assume it encodes its data either utf-8 or latin1.
You can check what your string would be encoded to:
string(convert:string-to-hex('ä',"latin1"))
Sorry for the long mail, hope the explanation is useful for you, even though the solution is not sooo simple and involves guessing :-)
Best Michael
Am 09.06.2019 um 17:09 schrieb Andreas Mixich mixich.andreas@gmail.com:
How can I simply get back any character, readable by a human, from a hexadecimal value?
basex-talk@mailman.uni-konstanz.de