Thanks, Fabrice!

I’ll work it out.

Kind regards,

Goetz

Von: Fabrice Etanchaud [mailto:fetanchaud@questel.com]
Gesendet: Mittwoch, 22. April 2015 11:32
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing

Great, Goetz !

A last thing :

If you need to rebuild the original document from parts, be sure to have a way to retrieve them all (by document path, attribute index, or separate index collection with node-id/pre values).

If disk space is not an issue, you could store the original document as it is, and create localized collection for full text indexing purposes.

Hoping it helps,

Best regards,

Fabrice

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 11:20
À : basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] multi-language full-text indexing

Fabrice,

For the time being, this sounds quite nice. I’d to split up the files in some common part and a set of “satellites”, one satellite for each language present in the document.

Thanks!

Kind regards,

Goetz

Von: Fabrice Etanchaud [mailto:fetanchaud@questel.com]
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing

Dear Goetz,

I have the same requirement (patent documents containing text in different languages).

I ended up splitting/filtering each original document in localized parts inserted in different collections (each collection having its own full text index configuration).

BaseX is as flexible as our data !

Best regards,

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

I’m working with documents destined to be consumed anywhere in the European Community. Many of them have the same tags multiple times but with a different language attribute. It does not make sense to create a full-text index for the whole of these documents therefore. It is desirable to have documents indexed by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

(path_a)/LOCALIZED_PART_A[@LANG=$lang],

(path_b)/LOCALIZED_PART_B[@LG=$lang],…

) FOR LANGUAGE $lang IN (

BG,

DN,

DE WITH STOPWORDS filepath_de WITH STEM = YES,

EN WITH STOPWORDS filepath_en,

FR, …

) [USING language_code_map]

and then to write full-text retrieval queries with a clause such as ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller and full-text retrieval therefore much faster. The language codes would be mapped somehow to standard values recognized by BaseX in the language_code_map file.

Are there any efforts towards such a feature?