Re: [basex-talk] Full-text lemmatizing and xml:lang

30 Jun 2017

      Hello
Sorry for being slow in reception, being a full-time father of two kids 
is my only excuse.
Thank you for enlightening answers. At first creating a separate 
database felt wrong and stupid, but after a while it felt just right and 
helping to organize different language elements via aggregation instead 
of composition.
Here is what I came up with:
(:~
This function takes a list of database names and optionally a list of 
language codes.
It creates separate full-text indexed databases for lemmatized searching 
of each language contained in the original database.
If the list of language codes is empty, all existing values of xml:lang 
found in the database is used.
The full-text databases are named 'dbname-ft-langcode'
Another function normalizes the texts, removes duplicate entries and 
inserts xml:id attributes
:)
declare updating function keeleleek:create-ft-indices-for-each-lang(
   $db-names as xs:string*,
   $lang-codes as xs:string*
) {
   for $db-name in $db-names
     let $langs := if( not( empty( $lang-codes )))
                          then( $lang-codes )
                          else( 
distinct-values(db:open($db-name)//@xml:lang) )
     for $lang in $langs
       let $lang-group := db:open($db-name)//*[@xml:lang = $lang]
       let $ft-db-name := concat($db-name, '-ft-', $lang)
(: create full-text db for each language :)
       return
         db:create(
           $ft-db-name,
           <texts>{$lang-group}</texts>,
           $ft-db-name,
           map { 'ftindex': true(), 'language': $lang }
       )
};
Cheers
Kristian K
28.06.2017 09:45 Xavier-Laurent SALVADOR kirjutas:
...
Hi,
After reading Christian answer ( :-) ); I thought it could be 
interesting to sort your docs according to @xml:lang and create a new 
DB next to your corpus :

distinct-values(
 file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang)
 )
!
db:create(
 'db-' || .,
 <root xml:lang="{.}">
  {
   for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]
   return
   <text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text>
    }
  </root>,
  "myfile",
  map { 'ftindex': true(), 'language': . }
  )
----------------------------------
2017-06-27 20:49 GMT+02:00 Christian Grün <christian.gruen@gmail.com 
mailto:christian.gruen@gmail.com>:
Hi Kristian,

It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.

As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
and adopts the language in the name of the database:

  for $path-group in file:children('input-dir')
  where ends-with($path-group, '.xml')
  group by $lang := ($path-group//@xml:lang)[1]
  return db:create(
    'db-' || $lang,
    $path-group,
    (),
    map { 'ftindex': true(), 'language': $lang }
  )

Hope this helps,
Christian

On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
<xavierlaurent.salvador@gmail.com
<mailto:xavierlaurent.salvador@gmail.com>> wrote:
> Hi Kristian,
>
> This is useful for creating automatically databases according to
xml:lang
> attribute
>
> let $dir := '/Users/me/myDesktop/'
> for $file in file:list($dir)[matches(.,'xml')]
>  return
>   let $flag := (data(doc($dir||$file)/div/@xml:lang))
>    return
>     db:create("DB", $dir||$file, (), map { 'ftindex':
> true(),'language':$flag })
>
> Or you can "ft:tokenize" your string mapping {'language':$flag}
into your
> query
>
> Hope I understood the problem :) Else return 'sorry'
>
> 2017-06-27 16:57 GMT+02:00 Kristian Kankainen
<kristian@keeleleek.ee <mailto:kristian@keeleleek.ee>>:
>>
>> Hello
>>
>> I have documents with text in several languages. When creating
a database
>> in BaseX I can choose *one* language for stemming for the
full-text search
>> index. Is there a way BaseX could lemmatize according to the
elements
>> xml:lang attribute?
>>
>> Best regards
>> Kristian K
>>
>
>
>
> --
> Ce message peut contenir des informations réservées
exclusivement à son
> destinataire. Toute diffusion  sans autorisation est interdite.
Si vous n'en
> êtes pas le destinataire, merci de prendre contact avec
l'expéditeur et de
> détruire ce message.
>
> This email may contain material for the sole use of the intended
recipient.
> Any forwarding without express permission is prohibited. If you
are not the
> intended recipient, please contact the sender and delete all copies.

-- 
Ce message peut contenir des informations réservées exclusivement à 
son destinataire. Toute diffusion sans autorisation est interdite. Si 
vous n'en êtes pas le destinataire, merci de prendre contact avec 
l'expéditeur et de détruire ce message.
/This email may contain material for the sole use of the intended 
recipient. Any forwarding without express permission is prohibited. If 
you are not the intended recipient, please contact the sender and 
delete all copies/.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text lemmatizing and xml:lang