file:write-binary($temp-file, fetch:binary($csv-link)),

let $encoding := proc:system("file", ("-Ik", $temp-file))

=> substring-after("charset=")

=> normalize-space()

=> replace("unknown-8bit", "ISO-8859-15")

=> replace("binary", "ISO-8859-15")

return

csv:parse(

$temp-file,

map { 'header': true(),

'lax': 'no',

'separator': 'semicolon',

'format': 'attributes',

'encoding': $encoding

})

On 24. May 2021, at 13:41, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Kristian,

With HTML, there are various ways to specify the document encoding
(e.g. the byte order mark, via XML declaration or the Content-Type
meta element). With text files, if fetch:text or file:read-text is
used, only the byte order mark (e.g., EF BB BF for UTF-8) will be
considered, as it’s the only indicator that allows for a unique
identification of the file encoding.

As you may know, it’s often impossible to guess the exact encoding of
a text file. But you can always use external tools for that, such as
chardetect, which performs statistical analysis on the input (it’s
based on Mozilla’s charset detector [1]). The guessed encoding can
then be passed on to fetch:text:

(: sample code, needs to be revised :)
let $file := '/path/to/file.csv'
let $encoding := proc:system('chardetect', $file)
let $string := fetch:text($file, $encoding)
return csv:parse($string)

Hope this helps,
Christian

[1] https://www-archive.mozilla.org/projects/intl/chardet.html

On Mon, May 24, 2021 at 9:23 AM Kristian Kankainen
<kristian@keeleleek.ee> wrote:

Hi folks,

I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:

If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:

Query

html:parse(fetch:binary("https://en.wikipedia.org"))

But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?

Best regards,
Kristian Kankainen