Update in case someone finds this helpful:
The mangled special characters are due to the CSV file not being in the proper encoding. I had assumed that when I exported the Excel sheet to CSV and selected UTF that the CSV file would be in UTF, sadly that is not the case. Doing a file command on the CSV tells me that it was ISO-8859 encoded. So what I did was to open the CSV file in Notepad and then saved it again as UTF-8 encoding. When I uploaded it into BaseX, the special characters now displays properly.
Bit
To make sure the file is actually converted to UTF-8, open the file in Notepad and
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On May 18, 2018 8:14 PM, Lizzi, Vincent Vincent.Lizzi@taylorandfrancis.com wrote:
Hi Bit,
The problem may have to do with the character encoding. Try providing the “encoding” option, e.g.
csv:parse($file, map{ "encoding": "windows-1252" })
I’d also like to call your attention to this module which provides a way to read Excel files directly from XQuery without the intermediary step of saving to CSV.
https://github.com/eliudmeza/OOXML-Library-XQuery-BaseXdb
I hope this is of some help.
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of BitRider001 Sent: Thursday, May 17, 2018 10:56 PM To: Eliot Kimber ekimber@contrext.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] about special characters
Update:
I found a way to export the Excel sheet into XML then created a new database and pointed to the XML file. This returned the results with the correct special characters.
My guess is it may have something to do with the CSV Parser.
Thanks,
BIt
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 10:11 AM, BitRider001 bit.rider.001@pm.me wrote:
Hi Eliot,
I loaded it by first creating a new database and pointing to the CSV file as input. The default encoding as far as I can tell is UTF-8 as shown in the attached screenshot. The CSV file was exported from Excel in UTF-8 encoding.
Perplexed,
Bit
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 9:53 AM, Eliot Kimber ekimber@contrext.com wrote:
That mangled string is the result of reading UTF-8 byte sequences as single-byte characters, e.g. ASCII or some Windows code page.
How are you loading it into BaseX? It seems unlikely that BaseX-provided code would make this kind of basic mistake in reading text but it’s possible it applied the incorrect encoding for some reason.
Cheers,
Eliot
--
Eliot Kimber
From: basex-talk-bounces@mailman.uni-konstanz.de on behalf of BitRider001 bit.rider.001@pm.me Reply-To: BitRider001 bit.rider.001@pm.me Date: Thursday, May 17, 2018 at 8:34 PM To: Bridger Dyson-Smith bdysonsmith@gmail.com Cc: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] about special characters
Bridger,
Indeed the file was exported from Excel in UTF-8 encoding. I've tried opening the CSV file using Notepad/Wordpad and in Linux with vi in a terminal and in both situations it displays the correct special character.
Its only when I load it into a BaseX db and query it does it show itself, as you said, as "mangled". Saving the results into a text file also contains the "mangled" string.
Strange.
Bit
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 9:21 AM, Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Bit -
that's odd; it looks like the characters are being decomposed (or whatever the term is) and mangled but I'm not sure, unfortunately. Was the CSV an export from Excel? If so, I suppose this could be a Windows character set problem (cp-1252 or iso-8859-1 or something?).
Bridger
On Thu, May 17, 2018 at 9:11 PM BitRider001 bit.rider.001@pm.me wrote:
Hi Bridger,
Yes that is right. I'm on the latest (9.0.1). Attaching a screenshot here for anyone to take a look.
Bit
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 8:41 AM, Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Bit - are you using the latest version? There was a problem with 9.0 and some Unicode characters. Christian and co. have a fix in v9.0.1.
HTH,
Bridger
On Thu, May 17, 2018, 7:54 PM BitRider001 bit.rider.001@pm.me wrote:
> Hi, > > I just joined the mailing list due to a problem I'm having displaying and storing special characters. > > I started with a CSV and created a database from it and the CSV is in UTF-8. However, when I query the special characters become garbled. I'm using the GUI in Windows 10. > > It starts with this in the CSV: > > <name>Cañelas</name> > > Then ends up with this when I export the query result into a text file: > > <name>Ca�las</name> > > Help please. > > Bit