Fabrice,

For the time being, this sounds quite nice. I’d to split up the files in some common part and a set of “satellites”, one satellite for each language present in the document.

 

Thanks!

 

Kind regards,

 

Goetz

 

Von: Fabrice Etanchaud [mailto:fetanchaud@questel.com]
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing

 

Dear Goetz,

 

I have the same requirement (patent documents containing text in different languages).

I ended up splitting/filtering each original document in localized parts inserted in different collections (each collection having its own full text index configuration).

BaseX is as flexible as our data !

 

Best regards,

 

 

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

 

I’m working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g.

 

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

(path_a)/LOCALIZED_PART_A[@LANG=$lang],

(path_b)/LOCALIZED_PART_B[@LG=$lang],…

) FOR LANGUAGE $lang IN (

BG,

DN,

DE WITH STOPWORDS filepath_de WITH STEM = YES,

EN WITH STOPWORDS filepath_en,

FR, …

)  [USING language_code_map]

and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file.

Are there any efforts towards such a feature?