Hello BaseX Team,
In reading the documentation and mail archives, it seems to me that BaseX does not support documents in terms of having a single URI referring to a single document node. For example, if I add a document using the db:add function:
db:add("DB", document { <a/> }, "doc.xml")
And subsequently attempt to retrieve it using the doc function:
fn:doc("doc.xml")
I get a FODC0002 error. This is discussed in the mail archives, and so far as I can tell this is because BaseX only works with collections. It does not really address the concept of a documents (excuse me if I am over simplifying the issue here). In other words, I am free to add more document nodes under the name "doc.xml" therefore it is technically a collection as opposed to a document. Is this correct?
A fairly early mailing list thread from 2010-07 addresses a similar and suggests an alternative approach where one might retrieve a document in a collection as follows:
for $doc in collection('DB')
where matches(document-uri($doc), 'doc.xml')
return $doc
But, doesn't this run afoul of the W3C recommendation which states for the fn:docucument-uri function that:
"In the case of a document node $D returned by the fn:doc function, or a document node at the root of a tree containing a node returned by the fn:collection function, it will always be true that either fn:document-uri($D) returns the empty sequence, or that the following expression is true: fn:doc(fn:document-uri($D)) is $D. It is implementation-defined whether this guarantee also holds for document nodes obtained by other means, for example a document node passed as the initial context node of a query or transformation."
In other words, if document-uri is returning 'doc.xml' for a node in collection('DB') then I should be able to get that same document node using the doc function.
I ask these questions because we are interested in being able to maintain XML artifacts as singular documents, and using the collection hierarchy to do things such as storing temporary and archived version of these artifacts. The approach we are taking right now assumes that there is only one document-node under each 'document'. I want to understand the limitations of the collection and document implementation in BaseX so that I don't make any wrong assumptions (as I think I already have).
Finally, assuming the interpretations are correct, are there plans to make any changes in the implementation to support something more in line with the hierarchy of db->collection->document? I have seen in previous threads that you were seeking feedback on this, and I am curious to know if you still are. Is there a major drawback you see to this model that perhaps I have not considered? Any feedback is appreciated.
Thanks,
Jack Gager
Metadata Technology
Dear Jack,
it's true that the fn:doc() function cannot be used in BaseX to access documents in a database - except for the special case that the database is a disk-based representation of a single document. It is possible, however, to access single documents in a database by specifying a full path as fn:collection() argument or adding a second argument to the db:open() function [1]:
- collection("DB/doc.xml") - db:open("DB", "doc.xml")
While the first query will potentially try to also locate the disk in the local file system, the second one restricts the access to documents stored in databases.
As XQuery was never focused on databases, it is sometimes tricky to do justice to all the specification details, in particular because some of the features are implementation defined while others are not. Regarding the details on the document-uri() function (..thanks btw for your diligent lookup..), it would probably be more consistent to return an empty sequence.
If you feel that the Wiki article referenced below is somewhat incomplete, feel free to give us more feedback or (..even better..) feel invited to extend it by yourself ;)
Thanks for your feedback, hope this helps, Christian
PS: BaseX 7.1 is close...
[1] http://docs.basex.org/wiki/Databases#Access_Resources ___________________________
In reading the documentation and mail archives, it seems to me that BaseX does not support documents in terms of having a single URI referring to a single document node. For example, if I add a document using the db:add function:
db:add("DB", document { <a/> }, "doc.xml")
And subsequently attempt to retrieve it using the doc function:
fn:doc("doc.xml")
I get a FODC0002 error. This is discussed in the mail archives, and so far as I can tell this is because BaseX only works with collections. It does not really address the concept of a documents (excuse me if I am over simplifying the issue here). In other words, I am free to add more document nodes under the name "doc.xml" therefore it is technically a collection as opposed to a document. Is this correct?
A fairly early mailing list thread from 2010-07 addresses a similar and suggests an alternative approach where one might retrieve a document in a collection as follows:
for $doc in collection('DB')
where matches(document-uri($doc), 'doc.xml')
return $doc
But, doesn't this run afoul of the W3C recommendation which states for the fn:docucument-uri function that:
"In the case of a document node $D returned by the fn:doc function, or a document node at the root of a tree containing a node returned by the fn:collection function, it will always be true that either fn:document-uri($D) returns the empty sequence, or that the following expression is true: fn:doc(fn:document-uri($D)) is $D. It is implementation-defined whether this guarantee also holds for document nodes obtained by other means, for example a document node passed as the initial context node of a query or transformation."
In other words, if document-uri is returning 'doc.xml' for a node in collection('DB') then I should be able to get that same document node using the doc function.
I ask these questions because we are interested in being able to maintain XML artifacts as singular documents, and using the collection hierarchy to do things such as storing temporary and archived version of these artifacts. The approach we are taking right now assumes that there is only one document-node under each 'document'. I want to understand the limitations of the collection and document implementation in BaseX so that I don't make any wrong assumptions (as I think I already have).
Finally, assuming the interpretations are correct, are there plans to make any changes in the implementation to support something more in line with the hierarchy of db->collection->document? I have seen in previous threads that you were seeking feedback on this, and I am curious to know if you still are. Is there a major drawback you see to this model that perhaps I have not considered? Any feedback is appreciated.
Thanks,
Jack Gager Metadata Technology
Hi Christian,
Thank you for the response.
First, I did not realize that the fn:collection argument may try to locate the disk in the file system. That is good to know as we would probably want to use the db:open to eliminate this possibility.
I have not looked at the referenced wiki article below in quite some time. My confusion mainly arises from the documentation for the Database Module in the XQuery portal (http://docs.basex.org/wiki/Database_Module). Throughout this page, the examples provided for the functions seem to indicate that it is possible to provide a single name which maps to a single document-node. For example the db:delete function states:
- db:delete("DB", "docs/dir/doc.xml") deletes the document docs/dir/doc.xml in the database DB.
...when actually it deletes all nodes under the collection named "docs/dir/doc.xml". I think it is a subtle, but fairly important detail. I think the same principle applies to the add method where the example states:
- db:add("DB", document { <a/> }, "doc.xml") adds the document node to the database DB under the name doc.xml.
I read this (based on my view of a 1 to 1 mapping between a document name and a document node) as assigning this name to the document-node which is added, as opposed to adding the new document-node to the collection under the name "doc.xml". The difference is subtle here, but again important.
I suppose it can all be summarized by stating that all paths in the functions are just that, paths. They refer to collections in terms of the XQuery recommendation and never to documents. When I find more time, I can provide more detailed recommendations for the above wiki page.
This brings me to a new point. Given your clarification, I think the new helper functions are inconsistent with the other functions. For example, the db:is-xml documentation states that it "Checks if the specified resource exists and if it is an XML document". That being the case, I would think it would return false if my path argument actually contained two document-nodes. However, this is not the case. This is further confusing in that if I have a hierarchy in the path (e.g. parent_folder/child_doc.xml) and provide a higher level path in the hierarchy (e.g. parent_folder only) as the argument the function returns false. I assume the reason for this is clarified by the db:exists function which states; "Checks if the specified database or resource exists. false is returned if a database directory is specified". Since the higher level path is a directory, the db:is-xml function returns false because it does not exist (according to the db:exist function). However, it would seem that the above discussion established that all paths are just directories. In fact, it would seem from some quick tests that I am even able to store binary resource and XML under the same path (which I would expect with folders but not with documents). But this makes a lot of these helper functions a bit confusing.
I hope this is useful. I still think that having a true document-node to document mapping would be useful, as it would allow one to use the handy database module functions such as add, delete, rename, and replace confidently.
Jack
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Monday, February 06, 2012 10:08 PM To: J Gager Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Collections and Documents
Dear Jack,
it's true that the fn:doc() function cannot be used in BaseX to access documents in a database - except for the special case that the database is a disk-based representation of a single document. It is possible, however, to access single documents in a database by specifying a full path as fn:collection() argument or adding a second argument to the db:open() function [1]:
- collection("DB/doc.xml") - db:open("DB", "doc.xml")
While the first query will potentially try to also locate the disk in the local file system, the second one restricts the access to documents stored in databases.
As XQuery was never focused on databases, it is sometimes tricky to do justice to all the specification details, in particular because some of the features are implementation defined while others are not. Regarding the details on the document-uri() function (..thanks btw for your diligent lookup..), it would probably be more consistent to return an empty sequence.
If you feel that the Wiki article referenced below is somewhat incomplete, feel free to give us more feedback or (..even better..) feel invited to extend it by yourself ;)
Thanks for your feedback, hope this helps, Christian
PS: BaseX 7.1 is close...
[1] http://docs.basex.org/wiki/Databases#Access_Resources ___________________________
In reading the documentation and mail archives, it seems to me that BaseX does not support documents in terms of having a single URI referring to a single document node. For example, if I add a document using the db:add function:
db:add("DB", document { <a/> }, "doc.xml")
And subsequently attempt to retrieve it using the doc function:
fn:doc("doc.xml")
I get a FODC0002 error. This is discussed in the mail archives, and so far as I can tell this is because BaseX only works with collections. It does not really address the concept of a documents (excuse me if I am over simplifying the issue here). In other words, I am free to add more document nodes under the name "doc.xml" therefore it is technically a collection as opposed to a document. Is this correct?
A fairly early mailing list thread from 2010-07 addresses a similar and suggests an alternative approach where one might retrieve a document in a collection as follows:
for $doc in collection('DB')
where matches(document-uri($doc), 'doc.xml')
return $doc
But, doesn't this run afoul of the W3C recommendation which states for the fn:docucument-uri function that:
"In the case of a document node $D returned by the fn:doc function, or a document node at the root of a tree containing a node returned by the fn:collection function, it will always be true that either fn:document-uri($D) returns the empty sequence, or that the following expression is true: fn:doc(fn:document-uri($D)) is $D. It is implementation-defined whether this guarantee also holds for document nodes obtained by other means, for example a document node passed as the initial context node of a query or transformation."
In other words, if document-uri is returning 'doc.xml' for a node in collection('DB') then I should be able to get that same document node using the doc function.
I ask these questions because we are interested in being able to maintain XML artifacts as singular documents, and using the collection hierarchy to do things such as storing temporary and archived version of
these artifacts.
The approach we are taking right now assumes that there is only one document-node under each 'document'. I want to understand the limitations of the collection and document implementation in BaseX so that I don't make any wrong assumptions (as I think I already have).
Finally, assuming the interpretations are correct, are there plans to make any changes in the implementation to support something more in line with the hierarchy of db->collection->document? I have seen in previous threads that you were seeking feedback on this, and I am curious to know if you still are. Is there a major drawback you see to this model that perhaps I have not considered? Any feedback is
appreciated.
Thanks,
Jack Gager Metadata Technology
Dear Jack,
My confusion mainly arises from the documentation for the Database Module in the XQuery portal (http://docs.basex.org/wiki/Database_Module). Throughout this page, the examples provided for the functions seem to indicate that it is possible to provide a single name which maps to a single document-node.
we have added one introductory paragraph "Commonalities" on that page that is supposed to explain the $db variable, but it may well be that it's not really noticed, or may be misleading.
When I find more time, I can provide more detailed recommendations for the above wiki page.
That would be great; you'll probably be more efficient in rephrasing the relevant snippets than us (maybe it's just one, two sentences that may need to be replaced).
"Checks if the specified resource exists and if it is an XML document". That being the case, I would think it would return false if my path argument actually contained two document-nodes.
Do your documents have the same name?
In fact, it would seem from some quick tests that I am even able to store binary resource and XML under the same path (which I would expect with folders but not with documents).
True, that's currently possible (but may be prohibited in future versions).
I hope this is useful. I still think that having a true document-node to document mapping would be useful, as it would allow one to use the handy database module functions such as add, delete, rename, and replace confidently.
What would have to be changed in your opinion to end up with a true document-node to document mapping?
Thanks, Christian
Dear Christian,
"Checks if the specified resource exists and if it is an XML document". That being the case, I would think it would return false if my path argument actually contained two document-nodes.
Do your documents have the same name?
In the case I was testing, yes. I used the db:add function twice (for example):
db:add('testing',document{<a/>},'parent/doc.xml') db:add('testing',document{<b/>},'parent/doc.xml')
If I then call the following, the result is true:
db:is-xml('testing','parent/doc.xml')
However, this isn't an XML document. It is a collection of XML documents.
It isn't that I disagree with treating paths as collections as that is very useful as well. My issue is that it should be possible to give a unique name to a document and to be explicit as to when that is your intention.
In terms of what would have to be changed, I think the following would suffice:
First document-nodes already have an identifier assigned to them (the db:node-id can be used to see this), so it would seem to me that it becomes a matter of allowing a map of document names to these identifiers. Although the recommendation would seem to allow for multiple names per document-node, I don't see much value in that and it would probably require new functions to allow such a thing. To keep it simple, my recommendation will only focus on what could be done in the existing functions, rather than adding new ones. Since my issue centers around the ambiguity of a path referring to a collection or a single document, I will focus on the functions that deal with these (and only the XML related ones).
- db:open($db as item(), $path as xs:string) as document-node()*: I think this can remain unchanged. If the path is a document, the result would technically be a single document-node, but that is already true.
- db:add($db as item(), $input as item(), $path as xs:string) as empty-sequence(): This should remain unchanged as it shouldn't be required to assign a name to a document in order to add a document-node to an existing collection. However, one must be able to distinguish whether the path identifies a collection or a document within a collection. Further, as laid out in the XQuery recommendation, if the path is a document there should be a relation of that name to the existing collection names. I would think the cleanest approach would be to add an overload method: db:add($db as item(), $input as item(), $path as xs:string, $doc_name as xs:boolean) as empty-sequence: The $doc_name parameter is true if the path is intended to identify a document name, and false if it identifies a collection. The default value is false, so that nothing changes in terms of how the function currently works. As with the existing path, the delimiter character ('/') is significant in that it represents hierarchy. Therefore, the following: db:add('db', document{<a/>}, 'level_1/level_2/my_doc', true) Adds the document-node to the database named 'db' with the document name ('level_1/level_2/my_doc'). This document-node is also available under the collection 'level_1' and 'level_1/level_2' (as it is currently implemented). If the $doc_name parameter is true, and the supplied path already exists, an error should be raised.
-db:rename($db as item(), $path as xs:string, $newpath as xs:string) as empty-sequence(): will raise error if the rename results in a document name conflict. For example if I have the documents 'A/doc_1.xml' and 'B/doc_1.xml' and invoke db:rename('db','A','B') the change would not be allowed since this would result in 2 document-nodes with the name 'B/doc.xml'. Renaming a document name to an existing collection name could simply remove the document name from the document-node (i.e. unmap it).
-db:replace($db as item(), $path as xs:string, $input as item()) as empty-sequence(): this should work as it already does, since it raises an error if the path refers to more than one document node. Basically, you can replace a collection assuming it contains only one document-node. If the path is a document, no further check would be required since you know if contains a single document-node.
I would think that should cover it, but I am sure this is not exhaustive. Basically, we simply need a way to provide a unique name within the scope of a database to a single document-node. I would think that this would also allow you to implement the fn:doc and fn:document-uri functions better.
I hope this helps to clarify.
Jack
P.S. My assumption is also that the changes above would be applied to the other APIs (e.g. Java) as well.
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Wednesday, February 08, 2012 6:25 AM To: J Gager Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Collections and Documents
Dear Jack,
My confusion mainly arises from the documentation for the Database Module in the XQuery portal (http://docs.basex.org/wiki/Database_Module). Throughout this page, the examples provided for the functions seem to indicate that it is
possible to provide a single name which maps to a single document-node.
we have added one introductory paragraph "Commonalities" on that page that is supposed to explain the $db variable, but it may well be that it's not really noticed, or may be misleading.
When I find more time, I can provide more detailed recommendations for the above wiki page.
That would be great; you'll probably be more efficient in rephrasing the relevant snippets than us (maybe it's just one, two sentences that may need to be replaced).
"Checks if the specified resource exists and if it is an XML document". That being the case, I would think it would return false if my path argument actually contained two document-nodes.
Do your documents have the same name?
In fact, it would seem from some quick tests that I am even able to store binary resource and XML under the same path (which I would expect with folders but not with documents).
True, that's currently possible (but may be prohibited in future versions).
I hope this is useful. I still think that having a true document-node to document mapping would be useful, as it would allow one to use the handy database module functions such as add, delete, rename, and replace confidently.
What would have to be changed in your opinion to end up with a true document-node to document mapping?
Thanks, Christian
basex-talk@mailman.uni-konstanz.de