UTF-8 characters in the range 128 to 255 are encoded using 2 octets.
If you can, map the input chars to an 8-bit character set such as CP-1252 and use bin:decode-string(., 'cp-1252')
Gerrit
On 19.07.2023 05:48, Graydon Saunders wrote:
> Hello --
>
> I have some mainframe files which start off in no-known-encoding. Using Basex 10.6, I'm trying to use the bin module to make some character substitutions so the content of these files can be UTF-8.
>
> let $charMap as map(*) := map {
> 33: 93, (: exclamation point ! to close bracket ] :)
> 162: 91, (: cent-sign ¢ to open bracket [ :)
> 124: 33, (: pipe character | to exclamation point ! :)
> 160: 32, (: non-breaking space to plain space :)
> 26: 32 (: U+001A SUBSTITUTION CHARACTER at the end of the file; do not want :)
> }
> let $fromList as xs:integer+ := map:keys($charMap)
>
> let $fileList as xs:string+ := file:children($localPath)
>
> for $x in $fileList
> return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! bin:from-octets(.) ! bin:decode-string(.,'UTF-8')) => string-join('')
>
> Four of the five sample files work; one of them returns "Decoding error: xff"
>
> If I restrict the process to the problematic file and use
> return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .)) => distinct-values() => sort()
>
> I don't find a 255 value. And I'm pretty sure all the codepoints I do have are simple, less than 255, single octet UTF-8 characters.
>
> Any suggestions for what I ought to be looking at?
>
> Thanks!
> Graydon
>
> --
> Graydon Saunders | graydonish@fastmail.com <mailto:graydonish@fastmail.com>
> Þæs oferéode, ðisses swá mæg.
> -- Deor ("That passed, so may this.")
--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930
Geschäftsführer / Managing Directors:
Gerrit Imsieke, Svea Jelonek, Thomas Schmidt
----------------------------------------------
Besuchen Sie uns auf der Frankfurter Buchmesse
in Halle 4.0, G94.