Hi Bram,
You are welcome, BaseX is such a versatile tool !
Keep index and data content separate could help in various ways. The main one is that, considering you still have to index your index database, you will index only needed data, and you might even set autoupdate to forget about index management. As it seems your data is 'append only', you should build your index on 'pre' node values instead of 'id' node values, in order to gain direct access to data.
When working on XML documents, I try to replace 'database' with 'collection', and that seems to help in making effective decisions.
Best regards,
Fabrice Etanchaud
[1] http://docs.basex.org/wiki/Node_Storage [2] http://docs.basex.org/wiki/Database_Module#db:node-pre [2] http://docs.basex.org/wiki/Database_Module#db:open-pre
De : Bram Vanroy [mailto:Bram.Vanroy@UGent.be] Envoyé : jeudi 14 juin 2018 13:48 À : Fabrice ETANCHAUD Objet : RE: Usage of doc's in BaseX
Hi Fabrice
Thanks for the reply.
The values that I am trying to index are indeed computed from the original XML but not present in it. Thanks for the limitations link, I had seen it but couldn't find it when googling for 'basex limitations' or others.
Considering that my databases will be static and don't need updating, would you still argue for a separation?
Thanks again
Bram
Van: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de Namens Fabrice ETANCHAUD Verzonden: donderdag 14 juni 2018 11:18 Aan: BaseX basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Usage of doc's in BaseX
Hello Bram,
IMHO the main argument for data/index separation is the ease of index recreation, and the ease of reindexation of your index database. Is there still a need for ad hoc indexing, now that BaseX let us index only a node name selection ? I guess you need to index computed values ?
For current BaseX limitations, you will find them in [1], but you might have already read that page. I hit the database node number limit once working with the European Patent Office DOCDB collection. So I had to set up a database naming politics to dispatch the documents.
Hoping it helps,
Best regards,
Fabrice Etanchaud Senior Data Specialist CERFrance PCH
[1] http://docs.basex.org/wiki/Statistics
De : BaseX-Talk [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Bram Vanroy Envoyé : jeudi 14 juin 2018 10:47 À : BaseX Objet : [basex-talk] Usage of doc's in BaseX
Dear BaseX team
I am planning an update on our previous custom indexing system [1]. But to do this I have a couple of questions. The major ones will be how to write an efficient custom indexing query in XQuery, but that'll be for another email. (In fact, we have a dual indexing system, so two index files per main file.) For now I am mainly interested in different documents in a single databases, and the doc() functionality.
Intuitively, I'd say that documents that are related to each other should be put in the same database. E.g. one database with different documents for plants, and one database with different documents for animals. But when I was scrolling through the documentation of BaseX, I noticed that when creating custom indices you do not put those in the same db as the original content, so you have on database for the content and one for the index [2]. Is this the way it's typically done?
More generally, the questions that I have are the following:
* What is the actual difference in BaseX between using separate documents in a single database, or using different databases all together?
* Is there a performance difference when I would put my index file in the same database as the content, vs. when using different databases altogether?
* What is the max allowed size for a document in a database and a database itself respectively? (I have files that are 100's of GB in size. It might not be plausible to have a file and its index file in the same database.)
Thank you in advance Kind regards
Bram Vanroy Doctoral Research at Ghent University, Belgium https://www.lt3.ugent.be/people/bram-vanroy/
[1] https://biblio.ugent.be/publication/8534144 [2] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures
basex-talk@mailman.uni-konstanz.de