Re: [basex-talk] automatic encoding

11 May 2021


      Hi Philippe,
...
However, I have a subsidiary question: I sometimes have dates (for example 1600) […]
For dates, you could choose another regex. I used '\p{L}+' to match
letters; you could try '[\p{L}\p{N}]+'. See [1] for more information
on Unicode classes.
...
[…] or groups of words.
If you want to support groups of words, you may first need to think
how you want to handle conflicting states. For example, we could have
the following CSV input…
A,persName
A B,placeName
…and the following XML input:
<p>A B</p>
How should the result look like?
You could try the following approach, which iterates over all CSV
records and repeatedly generates modified copies of the document:
let $csv := csv:doc('exemple.csv')
let $doc := doc('exemple1.xml')
return fold-left($csv//record, $doc, function($result, $record) {
  $result update {
    for $text in .//text()
    return replace node $text with (
      for $token in analyze-string($text, $record/entry[1], 'iq')/*
      return if (name($token) = 'match') then (
        element { data($record/entry[2]) } { data($token) }
      ) else (
        text { $token }
      )
)
  }
})
With the third argument of analyze-string, the search will be
case-insensitive, and literal (i.e., the search string will not be
interpreted as regex pattern). For the example above, the following
result will be created:
<p><persName>A</persName> B</p>
If you switch the two CSV records…
A B,placeName
A,persName
…you will get:
<p><placeName><persName>A</persName> B</placeName></p>
Salutations,
Christian
[1] https://www.regular-expressions.info/unicode.html

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] automatic encoding