Christian,
Thanks again for this! I still haven't had a chance to do any rigorous testing, but in my initial use everything looks great (and all my unit tests still pass). I think more importantly, this resolves an undocumented inconsistency for users of the low-level API - with the guarantee that the database will behave the same way regardless of state, a lot of conditional code can now be removed. Proof that good things come to those who wait :).
Dave
-----Original message----- From: "Christian Grün" christian.gruen@gmail.com To: Dave Glick dglick@dracorp.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Sent: Tue, Jul 3, 2012 00:13:58 GMT+00:00 Subject: Re: [basex-talk] Empty Initial Document Inconsistencies
Hi Dave,
here's finally some public feedback for your outstanding feature request: I've rewritten our internal database structures, such that an empty database will now really be empty (i.e., contain no dummy document node anymore). As a result, some convenience methods have disappeared from the Data class (Data.isEmpty(), Data.single()), as they were only offered to hide the former inconsistency.
The changes have now been merged into the BaseX master branch. While I believe that we have tested the changes quite well, I am always glad for any feedback.
Christian ___________________________
On Thu, Dec 29, 2011 at 4:36 PM, Dave Glick dglick@dracorp.com wrote:
Hello again,
I've noticed some inconsistencies with regard to the initial document in an empty database. I understand that it is often treated as an indication that the database is "empty". I.e., if the database has only one node and it's a document at the first pre value, then Data.empty() returns true. The problem arises because the database isn't actually empty - it contains one empty document - and some commands/methods view it as an empty database while others treat it as a database with one empty document. Several examples:
- The method Data.doc(name) will return -1 (indicating the document doesn't exist) while Data.docs() will return an array of size 1 with the first value a 0 (indicating there is one document at pre value 0). This is actually my own problem because the API I'm writing relies on these two methods being internally consistent - I can't have one telling me there is one document called "XYZ" but then have the other refuse to give me a pre value for the "XYZ" document I was just told exists.
- The GUI shows a single doc node in the tree view but issuing the command "list [database name]" shows the database as having no resources.
- I can evaluate the query "insert node 'text' into doc('[initial document name]')" and it executes. I can see the new text node under the initial document in the GUI tree view. Now when I issue the command "list [database name]" I see 1 resource. Having the list command report different numbers of resources before and after a XQuery insertion seems odd.
- If I issue the command "ADD TO newdoc <xyz/>", I can see the new doc("newdoc") node with the child <xyz> element in the GUI but the previously empty document is now gone. This also seems odd - I'm not sure how I feel about an ADD command removing data from the database (even if it was just intended to be a placeholder).
In the end, I think BaseX should support the notion of a database with one empty document as being valid - there are probably cases where one might want to start their session in that state. I'm not sure what the solution is or should be (or if it's even something that is worth or needs solving). My own preference would be for the internal indication of an empty database to use a different node kind dedicated for that purpose so there is no confusion whether the single node at pre 0 is an empty initial document or an indication of an empty database.
At the very least, I think the inconsistencies above should be corrected - if a single empty document node is intended to signify an empty database and is not actually intended to be part of the database, then you should not see it in the GUI, not be allowed to insert content to it, etc.
Dave
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hello,
this is a general question as to whether in a given scenario BaseX might be an appropriate instrument.
Per second approximately 1000 XML log messages must be stored and thus made available for querying. The messages are expected to be <= 1MB.
The log messages may be sent by any number of clients simultaneously. The clients are probably not able to specify unique document URIs, as they are working independently of each other. So the database would have to create unique URIs - perhaps concatenating a semantic part supplied by the client and a generated unique identifier - as part of the storage processing.
Would this be possible already now? If not, perhaps in the near future?
Thank you, kind regards,
Hans-Juergen
Hello Hans-Juergen,
here are some details about my use case, which is similar to yours. I'm using BaseX to insert the live public Twitter Stream into databases (see Wiki Entry [1]).
One Twitter message is around 4 kb of size and i'm able to insert about 2000 of them per second using single XQuery Update inserts. So that would probably be working out for you, too. If you use bulk inserts, like caching the items in a item list and running one XQuery Update for all of them, the amount of inserts would also increase.
thus made available for querying
this could be a bigger problem, cause as long as you are writing items into the database (which will never stop in your use case), the readers are blocked. And if one of your readers will be running, the writers are blocked.
Hope this helps, Andreas
[1] http://docs.basex.org/wiki/Twitter
Am 03.07.2012 um 15:30 schrieb Hans-Juergen Rennau:
Hello,
this is a general question as to whether in a given scenario BaseX might be an appropriate instrument.
Per second approximately 1000 XML log messages must be stored and thus made available for querying. The messages are expected to be <= 1MB.
The log messages may be sent by any number of clients simultaneously. The clients are probably not able to specify unique document URIs, as they are working independently of each other. So the database would have to create unique URIs - perhaps concatenating a semantic part supplied by the client and a generated unique identifier - as part of the storage processing.
Would this be possible already now? If not, perhaps in the near future?
Thank you, kind regards,
Hans-Juergen
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hello Andreas,
thank you very much for these informations! Indeed, the use-cases are similar.
I try to understand how exactly you stored the messages. The Wiki says: "the initial database just contained a root node <tweets/>". So my understanding is that the messages are inserted as child elements into this root element - and the end result is one document with one root element and millions of child elements representing the invidual messages, yes? Therefore you do not have to come up with URIs, as there is only one single document. A monster document, but I conclude from your approach that this is no problem, and not worse (or even better) than having a million individual, small documents. Is it correct - would you recommend to store the messages in one single document?
If the loading process cannot concur with queries - would there be any way how one could periodically "shift" packages of messages into a "read only" database? Or perhaps better the other way around, let the server periodically interrupt its loading activity, close the database, rename it, open and initialize a new base and then continue to load? Or is there presently simply no solution available?
Kind regards, Hans-Juergen
________________________________ Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Gesendet: 15:51 Dienstag, 3.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
here are some details about my use case, which is similar to yours. I'm using BaseX to insert the live public Twitter Stream into databases (see Wiki Entry [1]).
One Twitter message is around 4 kb of size and i'm able to insert about 2000 of them per second using single XQuery Update inserts. So that would probably be working out for you, too. If you use bulk inserts, like caching the items in a item list and running one XQuery Update for all of them, the amount of inserts would also increase.
thus made available for querying
this could be a bigger problem, cause as long as you are writing items into the database (which will never stop in your use case), the readers are blocked. And if one of your readers will be running, the writers are blocked.
Hope this helps, Andreas
Hello Hans-Juergen,
So my understanding is that the messages are inserted as child elements into this root element - and the end result is one document with one root element and millions of child elements representing the invidual messages, yes?
Yes that is correct, i have one root element at the beginning and insert the incoming items as child nodes of the root.
Therefore you do not have to come up with URIs, as there is only one single document. A monster document, but I conclude from your approach that this is no problem, and not worse (or even better) than having a million individual, small documents. Is it correct - would you recommend to store the messages in one single document?
In my use case, tweets have unique id attributes, so i don't need any URIs to identify them. Probably, it is a good idea if you describe your further querying process so it is easier to understand what you want to do.
If the loading process cannot concur with queries - would there be any way how one could periodically "shift" packages of messages into a "read only" database? Or perhaps better the other way around, let the server periodically interrupt its loading activity, close the database, rename it, open and initialize a new base and then continue to load? Or is there presently simply no solution available?
Thats exactly what i do after each hour. I rename the current db with the current date_hour and create a new database for the next incoming items. Shifting is not really an alternative, cause it will probably take too long to insert the items into a second database and delete them from the "main" database.
Kind regards, Andreas
Am 03.07.2012 um 23:58 schrieb Hans-Juergen Rennau:
Hello Andreas,
thank you very much for these informations! Indeed, the use-cases are similar.
I try to understand how exactly you stored the messages. The Wiki says: "the initial database just contained a root node <tweets/>". So my understanding is that the messages are inserted as child elements into this root element - and the end result is one document with one root element and millions of child elements representing the invidual messages, yes? Therefore you do not have to come up with URIs, as there is only one single document. A monster document, but I conclude from your approach that this is no problem, and not worse (or even better) than having a million individual, small documents. Is it correct - would you recommend to store the messages in one single document?
If the loading process cannot concur with queries - would there be any way how one could periodically "shift" packages of messages into a "read only" database? Or perhaps better the other way around, let the server periodically interrupt its loading activity, close the database, rename it, open and initialize a new base and then continue to load? Or is there presently simply no solution available?
Kind regards, Hans-Juergen
Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Gesendet: 15:51 Dienstag, 3.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
here are some details about my use case, which is similar to yours. I'm using BaseX to insert the live public Twitter Stream into databases (see Wiki Entry [1]).
One Twitter message is around 4 kb of size and i'm able to insert about 2000 of them per second using single XQuery Update inserts. So that would probably be working out for you, too. If you use bulk inserts, like caching the items in a item list and running one XQuery Update for all of them, the amount of inserts would also increase.
thus made available for querying
this could be a bigger problem, cause as long as you are writing items into the database (which will never stop in your use case), the readers are blocked. And if one of your readers will be running, the writers are blocked.
Hope this helps, Andreas
Hello Andreas,
was this database renaming+creation triggered by the client or autonomously done by the server? The latter alternative would require some scheduling feature, so that the database server could be configured to perform such actions according to a schedule. I think this would be a very desirable feature, especially as long as high frequency storage and querying cannot be done well concurrently. Anyway, do you see a way to implement this periodic renaming+creation behaviour in a multi-client environment?
Concerning the querying tasks to be expected I know very little myself - except for the need to group messages by transaction IDs and then somehow evaluate the execution of transactions. But at the moment I suppose that your "monster document" approach might be appropriate in our case, too.
Kind regards, Hans-Juergen
________________________________ Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: Base X basex-talk@mailman.uni-konstanz.de Gesendet: 9:45 Mittwoch, 4.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
So my understanding is that the messages are inserted as child elements into this root element - and the end result is one document with one root element and millions of child elements representing the invidual messages, yes? Yes that is correct, i have one root element at the beginning and insert the incoming items as child nodes of the root.
Therefore you do not have to come up with URIs, as there is only one single document. A monster document, but I conclude from your approach that this is no problem, and not worse (or even better) than having a million individual, small documents. Is it correct - would you recommend to store the messages in one single document? In my use case, tweets have unique id attributes, so i don't need any URIs to identify them. Probably, it is a good idea if you describe your further querying process so it is easier to understand what you want to do.
If the loading process cannot concur with queries - would there be any way how one could periodically "shift" packages of messages into a "read only" database? Or perhaps better the other way around, let the server periodically interrupt its loading activity, close the database, rename it, open and initialize a new base and then continue to load? Or is there presently simply no solution available?
Thats exactly what i do after each hour. I rename the current db with the current date_hour and create a new database for the next incoming items. Shifting is not really an alternative, cause it will probably take too long to insert the items into a second database and delete them from the "main" database.
Kind regards, Andreas
Hello Hans-Juergen,
was this database renaming+creation triggered by the client or autonomously done by the server?
I have a kind of "manager" client running, which executes the database renaming+creation for each hour. Since we have single writer transactions, all other transactions are blocked during that time. So the next client transactions will automatically use the "new" database.
I have it running like this:
create db tmp <root/> .... insert items ....
close alter db tmp date_hour create db tmp <root/> ..... insert items ....
Concerning the querying tasks to be expected I know very little myself - except for the need to group messages by transaction IDs and then somehow evaluate the execution of transactions. But at the moment I suppose that your "monster document" approach might be appropriate in our case, too.
I guess a "monster document" will be fine. However i would not create too large database, cause you probably need to create indexes afterwards and that could take a long time for *very* large databases.
Let me know if you need more information, Andreas
Am 04.07.2012 um 19:54 schrieb Hans-Juergen Rennau:
Hello Andreas,
was this database renaming+creation triggered by the client or autonomously done by the server? The latter alternative would require some scheduling feature, so that the database server could be configured to perform such actions according to a schedule. I think this would be a very desirable feature, especially as long as high frequency storage and querying cannot be done well concurrently. Anyway, do you see a way to implement this periodic renaming+creation behaviour in a multi-client environment?
Concerning the querying tasks to be expected I know very little myself - except for the need to group messages by transaction IDs and then somehow evaluate the execution of transactions. But at the moment I suppose that your "monster document" approach might be appropriate in our case, too.
Kind regards, Hans-Juergen
Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: Base X basex-talk@mailman.uni-konstanz.de Gesendet: 9:45 Mittwoch, 4.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
So my understanding is that the messages are inserted as child elements into this root element - and the end result is one document with one root element and millions of child elements representing the invidual messages, yes?
Yes that is correct, i have one root element at the beginning and insert the incoming items as child nodes of the root.
Therefore you do not have to come up with URIs, as there is only one single document. A monster document, but I conclude from your approach that this is no problem, and not worse (or even better) than having a million individual, small documents. Is it correct - would you recommend to store the messages in one single document?
In my use case, tweets have unique id attributes, so i don't need any URIs to identify them. Probably, it is a good idea if you describe your further querying process so it is easier to understand what you want to do.
If the loading process cannot concur with queries - would there be any way how one could periodically "shift" packages of messages into a "read only" database? Or perhaps better the other way around, let the server periodically interrupt its loading activity, close the database, rename it, open and initialize a new base and then continue to load? Or is there presently simply no solution available?
Thats exactly what i do after each hour. I rename the current db with the current date_hour and create a new database for the next incoming items. Shifting is not really an alternative, cause it will probably take too long to insert the items into a second database and delete them from the "main" database.
Kind regards, Andreas
Hello Andreas, cordial thanks for the additional information. To summarize my understanding: a feasible way to effect a "rolling database" which is periodically renamed and replaced by a newly created database is feasible in a multi-client environment - when one provides a "manager client" triggering theses actions. I shall gladly come back for more information should the need arise.
Cheers, Hans-Juergen
________________________________ Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: Base X basex-talk@mailman.uni-konstanz.de Gesendet: 8:57 Donnerstag, 5.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
was this database renaming+creation triggered by the client or autonomously done by the server? I have a kind of "manager" client running, which executes the database renaming+creation for each hour. Since we have single writer transactions, all other transactions are blocked during that time. So the next client transactions will automatically use the "new" database.
I have it running like this:
create db tmp <root/> .... insert items ....
close alter db tmp date_hour create db tmp <root/> ..... insert items ....
Concerning the querying tasks to be expected I know very little myself - except for the need to group messages by transaction IDs and then somehow evaluate the execution of transactions. But at the moment I suppose that your "monster document" approach might be appropriate in our case, too.
I guess a "monster document" will be fine. However i would not create too large database, cause you probably need to create indexes afterwards and that could take a long time for *very* large databases.
Let me know if you need more information, Andreas
Hello Hans-Juergen,
thats correct, have fun with your project.
-- Andreas
Am 05.07.2012 um 09:47 schrieb Hans-Juergen Rennau:
Hello Andreas, cordial thanks for the additional information. To summarize my understanding: a feasible way to effect a "rolling database" which is periodically renamed and replaced by a newly created database is feasible in a multi-client environment - when one provides a "manager client" triggering theses actions. I shall gladly come back for more information should the need arise.
Cheers, Hans-Juergen
Von: Andreas Weiler andreas.weiler@uni-konstanz.de An: Hans-Juergen Rennau hrennau@yahoo.de CC: Base X basex-talk@mailman.uni-konstanz.de Gesendet: 8:57 Donnerstag, 5.Juli 2012 Betreff: Re: [basex-talk] BaseX as a log msg store?
Hello Hans-Juergen,
was this database renaming+creation triggered by the client or autonomously done by the server?
I have a kind of "manager" client running, which executes the database renaming+creation for each hour. Since we have single writer transactions, all other transactions are blocked during that time. So the next client transactions will automatically use the "new" database.
I have it running like this:
create db tmp <root/> .... insert items ....
close alter db tmp date_hour create db tmp <root/> ..... insert items ....
Concerning the querying tasks to be expected I know very little myself - except for the need to group messages by transaction IDs and then somehow evaluate the execution of transactions. But at the moment I suppose that your "monster document" approach might be appropriate in our case, too.
I guess a "monster document" will be fine. However i would not create too large database, cause you probably need to create indexes afterwards and that could take a long time for *very* large databases.
Let me know if you need more information, Andreas
Hello, it seems to me that function db:open() and collection() are equivalent.
db:open - in spite of its name - does not change any state, but just returns the document nodes contained in a database. Which is what collection() does (in BaseX). The only differences are differences of parameter style: (a) a database path (if any) is an extra parameter in db:open(), whereas with the collection() function it is appended to the db name (e.g. collection('db/data'); and (b) db:open() lets one specify the database by name or by node, whereas with collection() it is (of course) always only by name.
If something is wrong with this summary, I would appreciate a correction.
Kind regards, Hans-Juergen
Hi Hans-Jürgen,
it seems to me that function db:open() and collection() are equivalent. [...]
the main difference of the collection() and doc() functions is that they can also access local resources: you may as well use them to create a temporary main-memory collection from all files found at the given local file path [1]. In other words, db:open() will be a little bit more efficient, and cause no surprises, if you know that your files are stored in a database.
Hope this helps, Christian
Christian, thank you very much - this difference is what I missed, and it is really important.
So - doc() and collection() work in two steps:
(a) try to interpret the argument URI as a "database resource URI" (defined to be the concatenation of database name and database path), locating a database document or directory
(b) if no database documents were located (that is - the argument is no valid database resource URI): interpret the URI as a file:// URI, locating a file or directory.
Concerning the default collection, I only just now discovered the difference between the function db:open and the command <open name="...">: the command changes the dynamic context for a subsequent query, making the documents of the opened db the default collection - whereas db:open does not change the dynamic context. (Well - otherwise it would not be side-effect free.)
One more question: Does the open-command change the dynamic context of subsequent queries in any other way except for setting the default collection?
Cheers, Hans-Juergen
________________________________ Von: Christian Grün christian.gruen@gmail.com An: Hans-Juergen Rennau hrennau@yahoo.de CC: Base X basex-talk@mailman.uni-konstanz.de Gesendet: 4:07 Samstag, 7.Juli 2012 Betreff: Re: [basex-talk] db:open and collection()
Hi Hans-Jürgen,
it seems to me that function db:open() and collection() are equivalent. [...]
the main difference of the collection() and doc() functions is that they can also access local resources: you may as well use them to create a temporary main-memory collection from all files found at the given local file path [1]. In other words, db:open() will be a little bit more efficient, and cause no surprises, if you know that your files are stored in a database.
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de