Hi all,
Basex continues to impress. I had been under the (false) impression that a single database should contain all the data/files relevant for one particular application.
I have discovered that this kind of query, which crosses/joins data across databases, works:
for $c in basex:db("chaingang")//person[@key="ccc86809"] let $links := basex:db("fasdb")//linkGrp[link=$c/@target] return ($c, $links, basex:db("fasdb")//person[@key=($links//link/text())])
which gives me (as desired):
<person isVdl="yes" target="ai14506" type="ccc" key="ccc86809"> <sources> <ccc id="ccc86809" n="86809"> <isVDL>yes</isVDL> <aiRef>14506</aiRef> <drupalNodeId>86809</drupalNodeId> </ccc> </sources> </person> <linkGrp id="L016612" size="3" c31a="1" dlm="1" ai="1"> <link type="dlm">dlm18216038</link> <link type="ai">ai14506</link> <link type="c31a">c31a31070390</link> </linkGrp> <person key= ... the 3 person records specified in the linkGrp.
Wow! I had no idea this kind of joining across basex databases was doable. Performance is great, about 64ms. It means I can think about refactoring my rather large single db with 30 documents into far smaller and manageable chunks which become updatable (i.e. the overhead of optimizing becomes tolerable).
Does anyone have any comment about this, and the pros and cons? Perhaps I should be thinking of a database for each record type/major document instead of "one big database". I can see a downside in that my queries get locked into an implementation specific syntax, but I am so pleased with what Basex fulltext querying is giving me, and the general performance and clean design, that with it being open source, I'm happy to wear this risk.
Thoughts anyone?
Hoping this helps someone else discover what this marvellous piece of software is capable of.
Thanks again to the developers -- bravo!!
Cheers,
Sandra
Hi Sandra,
for $c in basex:db("chaingang")//person[@key="ccc86809"] let $links := basex:db("fasdb")//linkGrp[link=$c/@target] return ($c, $links, basex:db("fasdb")//person[@key=($links//link/text())])
True – it's no problem to access several databases within one query. And it should be no problem to split up your db into several smaller ones. It might even get mandatory if you reach the database limits; see
The only drawback you might come across is that the filesystem will cause troubles if too many open files are to be managed at the same time. 30 databases should be completely ok, though.
Some questions, just out of curiosity: – how much XML data (mb/gb) do you currently work with? – how much time needs BaseX in your context for update operations and the optimize command? – do you already have encountered performance limits in everyday use, or do you rather try to prevent potential bottlenecks in future?
In another real-life scenario that might be similar to yours, BaseX is used as backend for a library database with 2 Mio. titles (~1 GB of XML data). The process of updating the data and recreating the indexes is applied once a day/night and takes appr. 2 minutes.
Thanks again to the developers -- bravo!!
…and thanks for always giving instructive feedback! Christian
Hi Sandra,
[snip]
Some questions, just out of curiosity: � how much XML data (mb/gb) do you currently work with?
Currently my largest db looks like this: Size: 4012 MB Nodes: 61552395 Height: 8 Input Size: 913 MB Encoding: UTF-8 Documents: 30 Whitespace Chopping: ON Entity Parsing: OFF +All indexes.
Its kind of a "pigpen" experimental database - no design at all. I am currently putting together a smaller properly designed "production" db designed for online public queries:
Size: 1067 MB Nodes: 36282594 Documents: 23
I would anticipate the volume of data for researchers growing to be 2-4 times that size over the next few years but less growth on the public database.
� how much time needs BaseX in your context for update operations and
the optimize command?
I've barely touched xquery update. I load up as needed from files; add/del documents as I need to. Optimising indexes takes a few minutes. The big problem with this is that online queries stop while this happens. We are transitioning from dev to production so I have to solve this problem. I am thinking of doing nightly updates/rebuilds/reindexing on a separate VM (for big updates), then just pushing the BaseXData/db dir via rsync and stopping/restarting the query basex server with the new database in place of the old. That should not cause any disruption to users. I will use xquery update for the small volume of daily updates and have these update a separate small database which won't need indexes. Hence my excitement at being able to integrate these little changes into the larger online query database results.
� do you already have encountered performance limits in everyday use,
Performance is very very good with well written online queries. Within a few weeks our search interface will go public and you can see for yourself -- I will let you know.
However, I have encountered rather bad problems with long and complex queries with a lot of output. With basex these jobs would just kind of "hang". I think there was an issue, as I recall, with its serialiser -- no output at all until the query completed, so I was running out of ram. Saxon serialisation proved much better for this kind of work because I could track the progress of these jobs by tailing the output as it ran. My apologies for not reporting this at the time (2-3 months back), I should have done so. Since then I note some basex changes in serialiser options but I have not tried again to see if this problem is solved. It was very unfortunate because I could not use the fuzzy/ft querying in these large person name matching jobs, but I do make this facility available to our researchers in online queries.
I am struggling with a very strange memory bug right now when loading but I suspect it could be an issue with my perl client -- let me report that to you separately from this reply.
or do you rather try to prevent potential bottlenecks in future?
I like to prevent!
In another real-life scenario that might be similar to yours, BaseX is
used as backend for a library database with 2 Mio. titles (~1 GB of XML data). The process of updating the data and recreating the indexes is applied once a day/night and takes appr. 2 minutes.
This also aligns with my experience. A complete reload/index build is about 3 mins for us on a 8gb VM on a dell server. I give the JVM ~ 4gb via -Xms and -Mx. Not terribly painful but as I mention above, I will need to pull a few tricks when we are running a public online search -- a 3 minute outage for updating is unnacceptable.
�and thanks for always giving instructive feedback!
Hope this helps.
Sandra
Sandra,
sorry for the late feedback, and thanks for all details on your setup.
I am thinking of doing nightly updates/rebuilds/reindexing on a separate VM (for big updates), then just pushing the BaseXData/db dir via rsync and stopping/restarting the query basex server with the new database in place of the old.
Yes, that's a realistic scenario – and there are enough other BaseX uses cases with similar challenges, so we've already developed (still vague) plans to do something similar within BaseX. Snapshots of the database could be used for read-only queries, while the updates are performed on the main database – and the snapshot is refreshed afer the updates have been finalized. Still work in progress, though.
Performance is very very good with well written online queries. Within a few weeks our search interface will go public and you can see for yourself -- I will let you know.
…nice.
However, I have encountered rather bad problems with long and complex queries with a lot of output. With basex these jobs would just kind of "hang".
If you come across these ones again, just tell us.
This also aligns with my experience. A complete reload/index build is about 3 mins for us on a 8gb VM on a dell server. I give the JVM ~ 4gb via -Xms and -Mx. Not terribly painful but as I mention above, I will need to pull a few tricks when we are running a public online search -- a 3 minute outage for updating is unnacceptable.
No doubt. A mirror of the database could indeed be the best solution, and as soon as our transactions are based on databases – and not processes, one BaseX instance will suffice to update one db and perform the other via read-only. We'll keep you informed (but it might take some time; the todo list is still long enough to keep us busy).
Best, Christian
basex-talk@mailman.uni-konstanz.de