Re: [basex-talk] CSV and automatic encoding detection

24 May 2021


      Hi Kristian,
With HTML, there are various ways to specify the document encoding
(e.g. the byte order mark, via XML declaration or the Content-Type
meta element). With text files, if fetch:text or file:read-text is
used, only the byte order mark (e.g., EF BB BF for UTF-8) will be
considered, as it’s the only indicator that allows for a unique
identification of the file encoding.
As you may know, it’s often impossible to guess the exact encoding of
a text file. But you can always use external tools for that, such as
chardetect, which performs statistical analysis on the input (it’s
based on Mozilla’s charset detector [1]). The guessed encoding can
then be passed on to fetch:text:
(: sample code, needs to be revised :)
let $file := '/path/to/file.csv'
let $encoding := proc:system('chardetect', $file)
let $string := fetch:text($file, $encoding)
return csv:parse($string)
Hope this helps,
Christian
[1] https://www-archive.mozilla.org/projects/intl/chardet.html
On Mon, May 24, 2021 at 9:23 AM Kristian Kankainen
kristian@keeleleek.ee wrote:
...
Hi folks,
I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
Query
html:parse(fetch:binary("https://en.wikipedia.org"))
But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?
Best regards,
Kristian Kankainen

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] CSV and automatic encoding detection