Hello --
I have some mainframe files which start off in no-known-encoding. Using Basex 10.6, I'm trying to use the bin module to make some character substitutions so the content of these files can be UTF-8.
let $charMap as map(*) := map { 33: 93, (: exclamation point ! to close bracket ] :) 162: 91, (: cent-sign ¢ to open bracket [ :) 124: 33, (: pipe character | to exclamation point ! :) 160: 32, (: non-breaking space to plain space :) 26: 32 (: U+001A SUBSTITUTION CHARACTER at the end of the file; do not want :) } let $fromList as xs:integer+ := map:keys($charMap)
let $fileList as xs:string+ := file:children($localPath)
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! bin:from-octets(.) ! bin:decode-string(.,'UTF-8')) => string-join('')
Four of the five sample files work; one of them returns "Decoding error: xff"
If I restrict the process to the problematic file and use return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .)) => distinct-values() => sort()
I don't find a 255 value. And I'm pretty sure all the codepoints I do have are simple, less than 255, single octet UTF-8 characters.
Any suggestions for what I ought to be looking at?
Thanks! Graydon
UTF-8 characters in the range 128 to 255 are encoded using 2 octets.
If you can, map the input chars to an 8-bit character set such as CP-1252 and use bin:decode-string(., 'cp-1252')
Gerrit
On 19.07.2023 05:48, Graydon Saunders wrote:
Hello --
I have some mainframe files which start off in no-known-encoding. Using Basex 10.6, I'm trying to use the bin module to make some character substitutions so the content of these files can be UTF-8.
let $charMap as map(*) := map { 33: 93, (: exclamation point ! to close bracket ] :) 162: 91, (: cent-sign ¢ to open bracket [ :) 124: 33, (: pipe character | to exclamation point ! :) 160: 32, (: non-breaking space to plain space :) 26: 32 (: U+001A SUBSTITUTION CHARACTER at the end of the file; do not want :) } let $fromList as xs:integer+ := map:keys($charMap)
let $fileList as xs:string+ := file:children($localPath)
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! bin:from-octets(.) ! bin:decode-string(.,'UTF-8')) => string-join('')
Four of the five sample files work; one of them returns "Decoding error: xff"
If I restrict the process to the problematic file and use return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .)) => distinct-values() => sort()
I don't find a 255 value. And I'm pretty sure all the codepoints I do have are simple, less than 255, single octet UTF-8 characters.
Any suggestions for what I ought to be looking at?
Thanks! Graydon
-- Graydon Saunders | graydonish@fastmail.com mailto:graydonish@fastmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")
Hi Gerrit -
That was a useful hint; thank you!
I do have characters legitimately past 128, but thankfully I think I can be confident they're all below 256, so I realized I was falling into an expectation of symmetry. Once I've got the codepoints, there's no need to treat the data as binary, and
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! codepoints-to-string(.)) => string-join('')
works.
If the full source DOES have anything past 256 in it I might be in trouble, but so far, so good.
Thank you! Graydon
On Wed, Jul 19, 2023 at 1:52 AM Imsieke, Gerrit, le-tex < gerrit.imsieke@le-tex.de> wrote:
UTF-8 characters in the range 128 to 255 are encoded using 2 octets.
If you can, map the input chars to an 8-bit character set such as CP-1252 and use bin:decode-string(., 'cp-1252')
Gerrit
On 19.07.2023 05:48, Graydon Saunders wrote:
Hello --
I have some mainframe files which start off in no-known-encoding. Using
Basex 10.6, I'm trying to use the bin module to make some character substitutions so the content of these files can be UTF-8.
let $charMap as map(*) := map { 33: 93, (: exclamation point ! to close bracket ] :) 162: 91, (: cent-sign ¢ to open bracket [ :) 124: 33, (: pipe character | to exclamation point ! :) 160: 32, (: non-breaking space to plain space :) 26: 32 (: U+001A SUBSTITUTION CHARACTER at the end of the file; do
not want :)
} let $fromList as xs:integer+ := map:keys($charMap)
let $fileList as xs:string+ := file:children($localPath)
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList)
then $charMap(.) else .) ! bin:from-octets(.) ! bin:decode-string(.,'UTF-8')) => string-join('')
Four of the five sample files work; one of them returns "Decoding error:
xff"
If I restrict the process to the problematic file and use return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList)
then $charMap(.) else .)) => distinct-values() => sort()
I don't find a 255 value. And I'm pretty sure all the codepoints I do
have are simple, less than 255, single octet UTF-8 characters.
Any suggestions for what I ought to be looking at?
Thanks! Graydon
-- Graydon Saunders | graydonish@fastmail.com <mailto:
graydonish@fastmail.com>
Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschäftsführer / Managing Directors: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt
Besuchen Sie uns auf der Frankfurter Buchmesse in Halle 4.0, G94.
Bad error messages are worse than no error messages at all [1]. I’ve tackled that. If you’ll now provide the single hex value F0 as input, you’ll get a more intuitive error message with the latest snapshot:
"Invalid UTF-8 character encoding: F0, ??."
Best, Christian
[1] The 0xff (-1) byte is an internal indicator that no bytes are left to construct a valid Unicode character. The error message suggested that it was provided by the input.
On Wed, Jul 19, 2023 at 4:39 PM Graydon Saunders graydonish@gmail.com wrote:
Hi Gerrit -
That was a useful hint; thank you!
I do have characters legitimately past 128, but thankfully I think I can be confident they're all below 256, so I realized I was falling into an expectation of symmetry. Once I've got the codepoints, there's no need to treat the data as binary, and
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! codepoints-to-string(.)) => string-join('')
works.
If the full source DOES have anything past 256 in it I might be in trouble, but so far, so good.
Thank you! Graydon
On Wed, Jul 19, 2023 at 1:52 AM Imsieke, Gerrit, le-tex gerrit.imsieke@le-tex.de wrote:
UTF-8 characters in the range 128 to 255 are encoded using 2 octets.
If you can, map the input chars to an 8-bit character set such as CP-1252 and use bin:decode-string(., 'cp-1252')
Gerrit
On 19.07.2023 05:48, Graydon Saunders wrote:
Hello --
I have some mainframe files which start off in no-known-encoding. Using Basex 10.6, I'm trying to use the bin module to make some character substitutions so the content of these files can be UTF-8.
let $charMap as map(*) := map { 33: 93, (: exclamation point ! to close bracket ] :) 162: 91, (: cent-sign ¢ to open bracket [ :) 124: 33, (: pipe character | to exclamation point ! :) 160: 32, (: non-breaking space to plain space :) 26: 32 (: U+001A SUBSTITUTION CHARACTER at the end of the file; do not want :) } let $fromList as xs:integer+ := map:keys($charMap)
let $fileList as xs:string+ := file:children($localPath)
for $x in $fileList return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .) ! bin:from-octets(.) ! bin:decode-string(.,'UTF-8')) => string-join('')
Four of the five sample files work; one of them returns "Decoding error: xff"
If I restrict the process to the problematic file and use return (file:read-binary($x) ! bin:to-octets(.) ! (if (. = $fromList) then $charMap(.) else .)) => distinct-values() => sort()
I don't find a 255 value. And I'm pretty sure all the codepoints I do have are simple, less than 255, single octet UTF-8 characters.
Any suggestions for what I ought to be looking at?
Thanks! Graydon
-- Graydon Saunders | graydonish@fastmail.com mailto:graydonish@fastmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschäftsführer / Managing Directors: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt
Besuchen Sie uns auf der Frankfurter Buchmesse in Halle 4.0, G94.
basex-talk@mailman.uni-konstanz.de