Hi Rainer,
May I suggest XOM? [1].
Also, navigating a DOM tree is quite irksome in comparison to using XQuery/XPath. Perhaps you could consider slicing up your large document into "manageable chunks" (e.g. smaller documents) then inserting the smaller documents into BaseX with a view to then running XQuery to get specific parts of the logical large document when and as required. This approach would use less memory and may well be more efficient.
May I also suggest checking out the BaseX XQJ API [2], where retrieved XML can be obtained as a Java DOM Node (e.g. Element / Document), StaX XMLStreamReader and SAX ContentHandler.
Regards,
Charles
[1] http://www.xom.nu/ [2] http://xqj.net/basex
Hi,
I did some first steps with BaseX, but unfortunately I was not very successful.
I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM. My hope was that I could store it as a BaseX database, retrieve the root element as a org.w3c.dom.Node, and then start navigating down and up the DOM as needed without having to have the whole stuff in memory.
I tried something like this:
QueryProcessor processor = new QueryProcessor("doc('catalog')/*", context); Iter iter = processor.iter(); Item item = iter.next(); Object node = item.toJava();
According to the debugger the item variable indeed denoted my root element. However, when calling item.toJava() not only that very node was returned. Instead BaseX obviously tried to retrieve the whole DOM which was just the very thing I wanted to avoid.
Am I doing something wrong here? Or is this an unforeseen use case?
By the way, I also had trouble getting any result at all. Only with "doc('catalog')/*" I got an item that was not null. When I tried to retrieve all elements named "article" using "doc('catalog')//article" the item was null. I also tried the item.iter() in order to find a node's children. However, it turned out that item().iter().next() == item.
And I had quite a tough time fiddling around with the documentation and with the JavaDoc. While the documentation puts a lot of effort into XQuery, it remains unclear to some extend how to do some basic stuff with BaseX programmatically. This is a hurdle for the BaseX beginner. Some more Java examples and explanations would be nice showing how to connect to the database, submit a query and process the result. Currently the examples show only how to dump a query result to System.out. It would be interesting to learn about processing it further see my troubles with Iter.
--
Best regards Rainer Klute
Hi Charles!
On 13.10.2012 13:25, Charles Foster wrote:
May I suggest XOM? [1].
Had a look at it but don't see how it could solve my challenges.
Also, navigating a DOM tree is quite irksome in comparison to using XQuery/XPath. Perhaps you could consider slicing up your large document into "manageable chunks" (e.g. smaller documents) then inserting the smaller documents into BaseX with a view to then running XQuery to get specific parts of the logical large document when and as required. This approach would use less memory and may well be more efficient.
Not really, because everything is somewhat deeply nested with very different numbers of nodes in the various subtrees. Partitioning would be at least cumbersome and would have to be done each time a new version of the data comes along.
May I also suggest checking out the BaseX XQJ API [2], where retrieved XML can be obtained as a Java DOM Node (e.g. Element / Document), StaX XMLStreamReader and SAX ContentHandler.
Yes, I tried XQJ, but I cannot deploy XQJ on Android because it is in the javax.* namespace. Sure, I could repackage interface and implementation, but I'd rather try to avoid it. And I guess using XQJ would still cause BaseX to build up the whole tree in memory.
May I suggest XOM? [1].
Had a look at it but don't see how it could solve my challenges.
Did you thoroughly investigate XOM?
Your requirement: "I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM."
XOM Website (front page): "XOM is very memory efficient. If you read an entire document into memory, XOM uses as little memory as possible. More importantly, XOM allows you to filter documents as they're built so you don't have to build the parts of the tree you aren't interested in. For instance, you can skip building text nodes that only represent boundary white space, if such white space is not significant in your application. You can even process a document piece by piece and throw away each piece when you're done with it. XOM has been used to process documents that are gigabytes in size."
Failing that, you could check out Saxon's TinyTree implementation.
Also, navigating a DOM tree is quite irksome in comparison to using XQuery/XPath. Perhaps you could consider slicing up your large document into "manageable chunks" (e.g. smaller documents) then inserting the smaller documents into BaseX with a view to then running XQuery to get specific parts of the logical large document when and as required. This approach would use less memory and may well be more efficient.
Not really, because everything is somewhat deeply nested with very different numbers of nodes in the various subtrees. Partitioning would be at least cumbersome and would have to be done each time a new version of the data comes along.
I find it difficult to understand how there there can not be "something" you can do to break the XML down to something more manageable, and perhaps put the sliced XML documents in their own collection to signify a complete logical document. Could you perhaps give an example?
If BaseX's model to storing XML documents can not cope with such large XML documents then consider Sedna. As far as I am aware, Sedna is actually ideal for storing huge single file XML documents.
May I also suggest checking out the BaseX XQJ API [2], where retrieved XML can be obtained as a Java DOM Node (e.g. Element / Document), StaX XMLStreamReader and SAX ContentHandler.
Yes, I tried XQJ, but I cannot deploy XQJ on Android because it is in the javax.* namespace. Sure, I could repackage interface and implementation, but I'd rather try to avoid it. And I guess using XQJ would still cause BaseX to build up the whole tree in memory.
That's a shame.
Regards,
Charles
On 15.10.2012 15:36, Charles Foster wrote:
May I suggest XOM? [1].
Had a look at it but don't see how it could solve my challenges.
Did you thoroughly investigate XOM?
Your requirement: "I have a really large XML file which does not fit into memory, and I would like to navigate it as a DOM."
XOM Website (front page): "XOM is very memory efficient. If you read an entire document into memory, XOM uses as little memory as possible. More importantly, XOM allows you to filter documents as they're built so you don't have to build the parts of the tree you aren't interested in. For instance, you can skip building text nodes that only represent boundary white space, if such white space is not significant in your application. You can even process a document piece by piece and throw away each piece when you're done with it. XOM has been used to process documents that are gigabytes in size."
Failing that, you could check out Saxon's TinyTree implementation.
XML documents with gigabytes in size? Sounds good! I'll probably get back to it if BaseX can indeed not cope with my DOM navigation requirement and my second-best approach fails, which is to convert the XML document into an SQLite database.
Also, navigating a DOM tree is quite irksome in comparison to using XQuery/XPath. Perhaps you could consider slicing up your large document into "manageable chunks" (e.g. smaller documents) then inserting the smaller documents into BaseX with a view to then running XQuery to get specific parts of the logical large document when and as required. This approach would use less memory and may well be more efficient.
Not really, because everything is somewhat deeply nested with very different numbers of nodes in the various subtrees. Partitioning would be at least cumbersome and would have to be done each time a new version of the data comes along.
I find it difficult to understand how there there can not be "something" you can do to break the XML down to something more manageable, and perhaps put the sliced XML documents in their own collection to signify a complete logical document. Could you perhaps give an example?
If BaseX's model to storing XML documents can not cope with such large XML documents then consider Sedna. As far as I am aware, Sedna is actually ideal for storing huge single file XML documents.
Would be a nice try if Sedna could run on Android. But it is in C and not Java, so ...
May I also suggest checking out the BaseX XQJ API [2], where retrieved XML can be obtained as a Java DOM Node (e.g. Element / Document), StaX XMLStreamReader and SAX ContentHandler.
Yes, I tried XQJ, but I cannot deploy XQJ on Android because it is in the javax.* namespace. Sure, I could repackage interface and implementation, but I'd rather try to avoid it. And I guess using XQJ would still cause BaseX to build up the whole tree in memory.
That's a shame.
Yes! It can be circumvented, and I am prepared to do so, but this probably won't help me due to BaseX returning large trees and not load objects lazily.
Hi there,
just for the sake of completeness, my 2cents: Am 15.10.2012 um 16:53 schrieb Rainer Klute rainer.klute@itemis.de:
As far as I am aware, Sedna is actually ideal for storing huge single file XML documents.
I am not sure if Sedna is better suited or not, but this is out of topic anyway :-).
Would be a nice try if Sedna could run on Android. But it is in C and not Java, so …
I don't think the actual problem is storage or processing, but your application demanding for DOM like navigation and lazy loading of Elements. This is indeed out of BaseX’ scope (at least at the moment) as far as I can see.
I do not know your specifics, but I'd suggest you should at least consider giving BaseX (or any XML database for that matter) a try.
It’s rather easy and quick to extract only portions of a document and load other parts on demand. Usually even with documents of several gigabytes in size, thanks to indices, performance mainly depends on how much data you want and need at a given time.
But anyway, my reply is just for the sake of completeness, feel free to pursue whichever approach feels best for you and solves your problem :-)
Kind regards Michael
P.S. (bold claims following ;-)) I doubt that a non disk persisted tree representation— such as XOM or SAX— is able to beat an XML database when it comes to tasks other than serializing the whole document, and even there most databases might be well on par as serializing the on disk representation should be straight forward. XOM in their documentation states that you should not consider using XPath, as performance may degenerate. So deciding for or against XOM depends heavily on what you actually want and need :-)
Thanks, Michael, for your 2 ¢!
I think I have an idea now what BaseX can do and where the limitations are. Well, since I won't get this lazy-loading-while-navigating-a-DOM-thingy anywhere, I'll have to reconsider my system architecture and find another solution.
(A lazy loading might be an idea for a future BaseX version. :-))
True and i think this might be a very welcome addition many use cases could benefit from ;-)
Am 15.10.2012 um 19:12 schrieb Rainer Klute rainer.klute@itemis.de:
(A lazy loading might be an idea for a future BaseX version. :-))
Hi Rainer, I am in fact not sure if such an API exists; at least to my knowledge it does not.
Thanks for the valuable input and fruitful discussion so far :-)
Feel free to keep in touch and this thread alive. Feedback and questions are always very welcome!
Best Michael Am 15.10.2012 um 19:12 schrieb Rainer Klute rainer.klute@itemis.de:
Well, since I won't get this lazy-loading-while-navigating-a-DOM-thingy anywhere, I'll have to reconsider my system architecture and find another solution.
On 10/15/2012 07:12 PM, Rainer Klute wrote:
Thanks, Michael, for your 2 ¢!
I think I have an idea now what BaseX can do and where the limitations are. Well, since I won't get this lazy-loading-while-navigating-a-DOM-thingy anywhere, I'll have to reconsider my system architecture and find another solution.
(A lazy loading might be an idea for a future BaseX version. :-))
Strange enough -- maybe I didn't get the problem, but I'm rather sure BaseX can cope with this use case!? Just from a high level view it should come down to loading pages (with the nodes) in a buffer and using some buffer replacement strategy, plus exposing the XPath-axis and so on or at least some navigational-methods.
BTW: You maybe also could try using Sirix[1] and have a look into [2] for some veeery basic documentation. I'm currently working on index-structures and a Brackit(.org)-binding (to provide XQuery and XQuery Update Facility mechanisms as well as temporal XPath extensions). Maybe (or hopefully) you could get some performance boosts with SSDs in comparison to other DOM-like storage systems. However the real benefit is that you are able to version and persist the data together with (hopefully) in the future full ACID-safe transactions. Would be nice if you at least could have a look into it. I'd happily give you a helping hand if you need some precise instructions or if you encounter bugs (however, keep in mind that I'm currently the only developer -- and sadly I have so many ideas for features and improvements but at the time being I'm not aware of any users). However it's open source and I hope at some time another interested and motivated developer -- or two ;-) joins the "team". But well, first of all I need some users ;-) At this time I'm especially interested in API related matters. At least I've put some effort some weeks ago in a hopefully "nice" API and am just thinking about renaming some methods, but other than that "finalize" the API as it is (except for future additions for sure). But for this I'd first like to get some input, would be really great :-) However, I'd say if BaseX provides an API cope with it, as it's industrial strength software and has a lot of users.
kind regards, Johannes
[1] https://github.com/JohannesLichtenberger/sirix [2] https://github.com/JohannesLichtenberger/sirix/wiki/Simple-usage.
Hi Johannes,
I had a quick glance at Sirix: looks quite good overall and covers nicely navigational requirements.
Okay, I don't like ugly interface names starting with a capital I, like IFooBar. I'd prefer working with interfaces named FooBar and having implementing classes like FooBarImpl or FooBarWhatever. The reason is that I see the interface names all the time, so they must be nice, easy, pronouncable words. The names of the implementing classes I have to see seldom or not at all, so they may be ugly.
Using a fluent API can be very nice like in
wtx.moveTo(15).get().moveToRightSibling().get().moveToFirstChild().get().insertCommentAsFirstChild("foo");
However, a fluent expression is not necessarily easy to understand. Example:
for (final IAxis axis = new DescendantAxis.Builder(wtx).includeSelf().visitor( Optional.<IVisitor> of(new ModificationVisitor(wtx, wtx.getNode() .getNodeKey()))).build(); axis.hasNext();) { axis.next(); }
I cannot grasp immediately what is going on here by reading the expression from left to right. But, well, that may be due to not being content with your API.
I wish you all the best with Sirix!
On 16.10.2012 01:56, Johannes.Lichtenberger wrote:
BTW: You maybe also could try using Sirix[1] and have a look into [2] for some veeery basic documentation. I'm currently working on index-structures and a Brackit(.org)-binding (to provide XQuery and XQuery Update Facility mechanisms as well as temporal XPath extensions). Maybe (or hopefully) you could get some performance boosts with SSDs in comparison to other DOM-like storage systems. However the real benefit is that you are able to version and persist the data together with (hopefully) in the future full ACID-safe transactions. Would be nice if you at least could have a look into it. I'd happily give you a helping hand if you need some precise instructions or if you encounter bugs (however, keep in mind that I'm currently the only developer -- and sadly I have so many ideas for features and improvements but at the time being I'm not aware of any users). However it's open source and I hope at some time another interested and motivated developer -- or two ;-) joins the "team". But well, first of all I need some users ;-) At this time I'm especially interested in API related matters. At least I've put some effort some weeks ago in a hopefully "nice" API and am just thinking about renaming some methods, but other than that "finalize" the API as it is (except for future additions for sure). But for this I'd first like to get some input, would be really great :-) However, I'd say if BaseX provides an API cope with it, as it's industrial strength software and has a lot of users.
On 10/16/2012 10:24 AM, Rainer Klute wrote:
Hi Johannes,
I had a quick glance at Sirix: looks quite good overall and covers nicely navigational requirements.
Thanks.
Okay, I don't like ugly interface names starting with a capital I, like IFooBar. I'd prefer working with interfaces named FooBar and having implementing classes like FooBarImpl or FooBarWhatever. The reason is that I see the interface names all the time, so they must be nice, easy, pronouncable words. The names of the implementing classes I have to see seldom or not at all, so they may be ugly.
Thanks for having a quick look, I think I'll really change the interface names (and possibly or most probably also enums and method parameters). Does someone maybe know how to quickly rename _all_ parameters from pFooBar to fooBar in Eclipse (that is strip the "p" and change from upper to lowercase letters)?
After all the interfaces are most of the times either in the API-package or in dedicated interface-packages.
Using a fluent API can be very nice like in
wtx.moveTo(15).get().moveToRightSibling().get().moveToFirstChild().get().insertCommentAsFirstChild("foo");
That is one of the things I wasn't sure about. But now I'm quiet happy because you have the freedom to either do
moveTo(15).get() to get the transaction handle _or_ moveTo(15).hasMoved() in expressions which must return a boolean to check if the transaction-cursor really moved (for instance it doesn't move if the node with the given node-key is not available). But you never get a NPE.
However, a fluent expression is not necessarily easy to understand. Example:
for (final IAxis axis = new DescendantAxis.Builder(wtx).includeSelf().visitor( Optional.<IVisitor> of(new ModificationVisitor(wtx, wtx.getNode() .getNodeKey()))).build(); axis.hasNext();) { axis.next(); }
I cannot grasp immediately what is going on here by reading the expression from left to right. But, well, that may be due to not being content with your API.
Ok, that's quiet hard, I have to admit. But maybe that's one of the most powerful axis I've implemented.
For instance it's very closely modeled after the new Java7 file system walker API and works in conjunction with an IVisitResult interface or more precisely the enum:
/** * The result type of an {@link IVisitor} implementation. * * @author Johannes Lichtenberger, University of Konstanz */ public enum EVisitResult { /** Continue without visiting the siblings of this structural node. */ SKIPSIBLINGS,
/** Continue without visiting the descendants of this element. */ SKIPSUBTREE,
/** Continue traversal. */ CONTINUE,
/** Terminate traversal. */ TERMINATE, }
described afterwards. Thus, you are able to guide the preorder traversal (for instance skip siblings of the currently selected node, skip a whole subtree, continue normally in preorder or terminate the traversal).
It's a variant which you maybe should really use with a visitor implementation, that is an implementation which does different things for different node kinds:
/** * My visitor doing something application specific with different node kinds. */ public final class MyVisitor extends AbsVisitorSupport { @Override public IVisitResult visit(final @Nonnull ImmutableElement pNode) { return processElementNode(pNode); }
@Override public IVisitResult visit(final @Nonnull ImmutableText pNode) { return processTextNode(pNode); }
... }
A rather simpler invokation might be to use the "normal" DescendantAxis:
for (final IAxis axis = new DescendantAxis(trx); axis.hasNext();) { axis.next();
if (trx.isElement()) { LOGGER.info(trx.getName());
// Do something. } }
Or if you have to iterate over all structural and non-structural nodes use new NonStructuralWrapperAxis(new DescendantAxis(trx)) (I'm not sure about the name "NonStructuralWrapperAxis" I added a few days ago).
You are free to use hasNext(), next() and peek() for every axis, which might be very powerful sometimes.
Even implementing axis (if you have to do so at some day (now) is very _very_ easy). For instance just implement "nextKey()" and call "done()" once done.
I'd say it's easily one of the best use case you have ;-)
Furthermore a few weeks ago I've changed all filtering classes (namely just one, AbsFilter), such that all filters are now (Google) Guava Predicates. In my opinion very nice.
I wish you all the best with Sirix!
Thanks, but at times it's just depressing having spend endless hours without as of now anyone having ever used the project (despite myself and other students which have worked with (and on) Treetank -- a system which Sirix is based on or a fork which specializes on the tree-structure) ;-) For instance it's _very_ helpful to get comments like rename the interface-names and so on. That's the first thing I will do now...
BTW: If you are still interested you can also discuss or ask for help in https://groups.google.com/forum/#!forum/sirix-users, would be great.
kind regards, Johannes
+1, thx Johannes. BaseX has quite a powerful internal API, but JavaDoc is the only available documentation. I'll give some more feedback end of the week. Am 16.10.2012 01:57 schrieb "Johannes.Lichtenberger" < Johannes.Lichtenberger@uni-konstanz.de>:
On 10/15/2012 07:12 PM, Rainer Klute wrote:
Thanks, Michael, for your 2 ¢!
I think I have an idea now what BaseX can do and where the limitations are. Well, since I won't get this lazy-loading-while-navigating-**a-DOM-thingy anywhere, I'll have to reconsider my system architecture and find another solution.
(A lazy loading might be an idea for a future BaseX version. :-))
Strange enough -- maybe I didn't get the problem, but I'm rather sure BaseX can cope with this use case!? Just from a high level view it should come down to loading pages (with the nodes) in a buffer and using some buffer replacement strategy, plus exposing the XPath-axis and so on or at least some navigational-methods.
BTW: You maybe also could try using Sirix[1] and have a look into [2] for some veeery basic documentation. I'm currently working on index-structures and a Brackit(.org)-binding (to provide XQuery and XQuery Update Facility mechanisms as well as temporal XPath extensions). Maybe (or hopefully) you could get some performance boosts with SSDs in comparison to other DOM-like storage systems. However the real benefit is that you are able to version and persist the data together with (hopefully) in the future full ACID-safe transactions. Would be nice if you at least could have a look into it. I'd happily give you a helping hand if you need some precise instructions or if you encounter bugs (however, keep in mind that I'm currently the only developer -- and sadly I have so many ideas for features and improvements but at the time being I'm not aware of any users). However it's open source and I hope at some time another interested and motivated developer -- or two ;-) joins the "team". But well, first of all I need some users ;-) At this time I'm especially interested in API related matters. At least I've put some effort some weeks ago in a hopefully "nice" API and am just thinking about renaming some methods, but other than that "finalize" the API as it is (except for future additions for sure). But for this I'd first like to get some input, would be really great :-) However, I'd say if BaseX provides an API cope with it, as it's industrial strength software and has a lot of users.
kind regards, Johannes
[1] https://github.com/**JohannesLichtenberger/sirixhttps://github.com/JohannesLichtenberger/sirix [2] https://github.com/**JohannesLichtenberger/sirix/**wiki/Simple-usagehttps://github.com/JohannesLichtenberger/sirix/wiki/Simple-usage .
______________________________**_________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-**konstanz.de BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.**de/mailman/listinfo/basex-talkhttps://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
On 16.10.2012 10:42, Christian Grün wrote:
+1, thx Johannes. BaseX has quite a powerful internal API, but JavaDoc is the only available documentation. I'll give some more feedback end of the week.
That would be great! Can you comment on BaseX's lazy loading capability right now, i.e. whether it is possible or not?
Rainer,
I'll give some more feedback end of the week.
sorry for the delay. Right now my workload is quite impressive (the punishment for being offline some days), which is why I'll have to postpone my delay for one, two more days. Be sure to get an answer, though!
Christian ___________________________
On Tue, Oct 16, 2012 at 10:53 AM, Rainer Klute rainer.klute@itemis.de wrote:
On 16.10.2012 10:42, Christian Grün wrote:
+1, thx Johannes. BaseX has quite a powerful internal API, but JavaDoc is the only available documentation. I'll give some more feedback end of the week.
--
Best regards Rainer Klute
basex-talk@mailman.uni-konstanz.de