Dear Christian,
Thank you very much for this solution! It works perfectly. I use it on fairly short texts, so it is very fast.
However, I have a subsidiary question: I sometimes have dates (for example 1600) or groups of words. In these two cases only, the automatic encoding does not work?
Do you have any tips?
Thank you so much. Philippe
Le 06/05/2021 à 16:37, Christian Grün a écrit :
Hi Philippe,
here’s one way to solve your challenge:
let $csv := csv:doc('exemple.csv') let $doc := doc('exemple1.xml') return $doc update { for $text in .//text() return replace node $text with ( for $token in analyze-string($text, '\p{L}+')/* let $record := $csv//record[entry[1] = $token] return if ($record) then ( element { data($record/entry[2]) } { data($token) } ) else ( text { $token } ) ) }
In the update block, all text nodes are processed and replaced by a sequence of new element and text nodes. The text is tokenized with fn:analyze-string, such that the non-letter characters will not get lost. If a CSV entry exists for a single token, it is replaced by a new element that contains the token. Otherwise, the original token will be adopted.
If large texts need to be processed, it may be recommendable to organize all CSV entries in a map:
let $words := map:merge( csv:doc('exemple.csv')/csv/record ! map:entry(data(entry[1]), data(entry[2])) ) return doc('exemple1.xml') update { .//text() ! (replace node . with ( for $token in analyze-string(., '\p{L}+')/* let $name := $words($token) return if ($name) then ( element { $name } { data($token) } ) else ( text { $token } ) )) }
Alternative solutions are appreciated.
Hope this helps, salutations, Christian
On Thu, May 6, 2021 at 3:40 PM Philippe Pons philippe.pons@college-de-france.fr wrote:
Hi,
This may be more of an xquery question than a BaseX question, but I'm trying to ask it here anyway. I can of course remove it if necessary.
However, I have an XML document with a sequence of paragraphs and a CSV spreadsheet (examples 1 and 2 below). The csv gives a list of terms in the first column (present in the XML), and the name of the index in which we want to integrate this term in the second.
I use BaseX and XQuery a lot, and I'm looking for a way to automatically encode the terms in the TEI file without success so far.
I've tried to rely on the ft:mark function to achieve this, but it doesn't completely fit my need.
Maybe you could give me some advice to get the result I'm looking for (example 3)?
With kind regards, Philippe
Exemple 1, XML :
<div> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc quis nisl ultrices, viverra erat id, volutpat magna. Mauris cursus tellus non nisi commodo faucibus. Pellentesque condimentum feugiat sem quis interdum. Vestibulum tempus lectus a augue viverra molestie. Ut condimentum vehicula nisi, vel tincidunt mauris accumsan malesuada. Aliquam quis facilisis justo. Proin convallis eget enim vel eleifend. Nullam faucibus ultricies diam, iaculis feugiat odio condimentum non. Cras vitae dignissim lectus, in pellentesque est. Vivamus pharetra semper magna, sed sodales dui porttitor in. Pellentesque eget sodales quam, et dignissim velit. Aliquam vulputate pulvinar cursus. Phasellus commodo nibh a diam imperdiet cursus. Maecenas dui orci, aliquet quis porttitor non, auctor fringilla leo.</p> <p>Vestibulum venenatis velit in imperdiet iaculis. Vivamus consectetur mollis augue, ac efficitur nisl ultrices eu. Cras commodo eleifend mi a luctus. Vivamus sed odio maximus, laoreet mauris quis, tincidunt lorem. Duis non elementum tortor. Nam hendrerit dolor ac interdum condimentum. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Aenean vitae massa commodo, condimentum quam in, malesuada massa. Maecenas finibus convallis erat at aliquet. Aliquam erat volutpat. Praesent ligula nisi, tempus id arcu id, scelerisque condimentum dolor. Sed quis tincidunt sem. Nulla ac ex hendrerit, ullamcorper leo et, sodales nulla.</p> </div>
Exemple 2, CSV :
Vestibulum, placeName Cras, persName condimentum, persName
Exemple 3, Results :
<div> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc quis nisl ultrices, viverra erat id, volutpat magna. Mauris cursus tellus non nisi commodo faucibus. Pellentesque <persName>condimentum</persName> feugiat sem quis interdum. <placeName>Vestibulum</placeName> tempus lectus a augue viverra molestie. Ut <persName>condimentum</persName> vehicula nisi, vel tincidunt mauris accumsan malesuada. Aliquam quis facilisis justo. Proin convallis eget enim vel eleifend. Nullam faucibus ultricies diam, iaculis feugiat odio <persName>condimentum</persName> non. <persName>Cras</persName> vitae dignissim lectus, in pellentesque est. Vivamus pharetra semper magna, sed sodales dui porttitor in. Pellentesque eget sodales quam, et dignissim velit. Aliquam vulputate pulvinar cursus. Phasellus commodo nibh a diam imperdiet cursus. Maecenas dui orci, aliquet quis porttitor non, auctor fringilla leo.</p> <p><placeName>Vestibulum</placeName> venenatis velit in imperdiet iaculis. Vivamus consectetur mollis augue, ac efficitur nisl ultrices eu. <persName>Cras</persName> commodo eleifend mi a luctus. Vivamus sed odio maximus, laoreet mauris quis, tincidunt lorem. Duis non elementum tortor. Nam hendrerit dolor ac interdum <persName>condimentum</persName>. <placeName>Vestibulum</placeName> ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Aenean vitae massa commodo, <persName>condimentum</persName> quam in, malesuada massa. Maecenas finibus convallis erat at aliquet. Aliquam erat volutpat. Praesent ligula nisi, tempus id arcu id, scelerisque <persName>condimentum</persName> dolor. Sed quis tincidunt sem. Nulla ac ex hendrerit, ullamcorper leo et, sodales nulla.</p> </div>
Hi Philippe,
However, I have a subsidiary question: I sometimes have dates (for example 1600) […]
For dates, you could choose another regex. I used '\p{L}+' to match letters; you could try '[\p{L}\p{N}]+'. See [1] for more information on Unicode classes.
[…] or groups of words.
If you want to support groups of words, you may first need to think how you want to handle conflicting states. For example, we could have the following CSV input…
A,persName A B,placeName
…and the following XML input:
<p>A B</p>
How should the result look like?
You could try the following approach, which iterates over all CSV records and repeatedly generates modified copies of the document:
let $csv := csv:doc('exemple.csv') let $doc := doc('exemple1.xml') return fold-left($csv//record, $doc, function($result, $record) { $result update { for $text in .//text() return replace node $text with ( for $token in analyze-string($text, $record/entry[1], 'iq')/* return if (name($token) = 'match') then ( element { data($record/entry[2]) } { data($token) } ) else ( text { $token } )
) } })
With the third argument of analyze-string, the search will be case-insensitive, and literal (i.e., the search string will not be interpreted as regex pattern). For the example above, the following result will be created:
<p><persName>A</persName> B</p>
If you switch the two CSV records…
A B,placeName A,persName
…you will get:
<p><placeName><persName>A</persName> B</placeName></p>
Salutations, Christian
basex-talk@mailman.uni-konstanz.de