Hi Philippe,
However, I have a subsidiary question: I sometimes have dates (for example 1600) […]
For dates, you could choose another regex. I used '\p{L}+' to match letters; you could try '[\p{L}\p{N}]+'. See [1] for more information on Unicode classes.
[…] or groups of words.
If you want to support groups of words, you may first need to think how you want to handle conflicting states. For example, we could have the following CSV input…
A,persName A B,placeName
…and the following XML input:
<p>A B</p>
How should the result look like?
You could try the following approach, which iterates over all CSV records and repeatedly generates modified copies of the document:
let $csv := csv:doc('exemple.csv') let $doc := doc('exemple1.xml') return fold-left($csv//record, $doc, function($result, $record) { $result update { for $text in .//text() return replace node $text with ( for $token in analyze-string($text, $record/entry[1], 'iq')/* return if (name($token) = 'match') then ( element { data($record/entry[2]) } { data($token) } ) else ( text { $token } )
) } })
With the third argument of analyze-string, the search will be case-insensitive, and literal (i.e., the search string will not be interpreted as regex pattern). For the example above, the following result will be created:
<p><persName>A</persName> B</p>
If you switch the two CSV records…
A B,placeName A,persName
…you will get:
<p><placeName><persName>A</persName> B</placeName></p>
Salutations, Christian