Christian,
That is helpful. Basically you've confirmed my initial analysis that because BaseX databases are light weight that keeping things simple is the most appropriate choice.
If I was doing things at scale of course I'd do performance testing to see where the bottlenecks are, but that is not a concern for what I'm doing now.
Cheers,
E. ————— Eliot Kimber, Owner Contrext, LLC http://contrext.com
On 5/16/15, 5:10 AM, "Christian Grün" christian.gruen@gmail.com wrote:
Hi Eliot,
As usual, there is no simple answer to such a question. However, I can say that sounds like a good choice to use one BaseX database per git repository. In contrast to many other dbms, databases in BaseX are pretty light-weight containers, and in some of our own use cases we even create one database per document.
If you have hundreds or thousands of databases, then it may be reasonable to merge them into single units, because it may take too much time to access the database directories in the file system. Some file systems are better than others in handling large amounts of files and directories on the same level. The same observation applies if you frequently write queries that access more than one database: It's always faster to open a single database (but usually you will only notice this when opening a larger number of databases).
Hope this helps, Christian
On Thu, May 14, 2015 at 3:57 PM, Eliot Kimber ekimber@contrext.com wrote:
In the discussion of adding metadata to a bunch of files Christian points out that you can both limit queries to directories within a single database or apply a query to multiple databases.
My question: when or why would you prefer one approach over the other?
In my case I'm using BaseX to reflect the XML contents of git repositories. My current approach is to create a separate database for each repo/branch pair, my reasoning being that that makes it easiest to limit queries to just that branch. Because the BaseX data is intended to be a read-only reflecting of the git-managed source, it also makes it easy to clear the data for a branch if it's gotten out of sync (or I suspect it's gotten out of sync) by simply dropping the database.
I have complete control over the queries (through a library of functions that understand the git nature of the databases), so I could just as easily use a single database with subdirectories that reflect the repos and branches.
In this scenario, as an example, is there any compelling reason to use one approach or the other?
I like having one database per branch because that seems like a natural mapping that generally keeps things simple and more or less obvious (e.g., doing "list" will show the list of databases, which reflect the repo and branch names in their names).
In this application the scale will usually be relatively small: 1000s or 10s of 1000s of individual documents in any given branch but the querying and indexing, which supports maintaining knowledge of the links within the XML content, could get intense.
Cheers,
Eliot
————— Eliot Kimber, Owner Contrext, LLC http://contrext.com