Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter
Hi Guenter,
you should have a look a the matches [1] function and work with regular expressions to perform this task.
Best regards,
Markus
[1] http://www.xqueryfunctions.com/xq/fn_matches.html
Am 26.07.2019 um 17:39 schrieb Günter Dunz-Wolff:
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter
Hi Günter,
You can take advantage of the unicode normalization features of XQuery:
declare function local:normalize($string) { $string => normalize-unicode('NFKD') => replace('\p{IsCombiningDiacriticalMarks}', '') }; for $text in ('Büchſe', 'Buͤchſe') return local:normalize($text) contains text 'Büchse'
In a future version of BaseX, we want to incorporate Unicode decomposition into the XQuery Full Text tokenizer. For now, if you want to speed up your queries with an index, you can create a custom index structure in which all text strings are stored in a normalized representation [1].
Hope this helps Christian
[1] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures
On Fri, Jul 26, 2019 at 5:39 PM Günter Dunz-Wolff guenter.dunzwolff@gmail.com wrote:
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter
basex-talk@mailman.uni-konstanz.de