2014-09-25 5:54 GMT-05:00 Dirk Kirsten <dk@basex.org>:

Hello Oscar,

As Fabrice already suggested, maintaining a separate collection with
node-id mappings might be a viable solution. Another option could be to
split your documents up in a way that the relevant information is stored
in one collection (which is indexed) and all the other supplemental
information is stored in another collection. This way, the first
collection should be rather small and the text index should work fine.

>
> So, from all the information we receive, at this moment I estimate we only
> need around 25%, I though about having different databases with full and
> partial information but the thing is that somehow the requirements are not
> entirely defined on one hand, and on the other, there's information that we
> use on the queries and some other that we still need to display to its
> owner and that we're displaying using XSLT.

If you need to display additional information, it is no problem to
access multiple collections in a single XQuery. So splitting up the data
should not be a show-stopper.

>
> == Question 1: Indexes are only required for some fields ==
> We usually need to locate the records by some id, or query over some of the
> elements available on the XML files, but those are pretty much always the
> same, so those are the elements that I'd like to have indexed. That's why I
> don't see a reason for having indexes over the contents of all the elements
> since is unlikely (at least right now) we'll make use of those and instead
> they consume a lot of hard drive.

You currently can't define an index to just select certain elements. It
would certainly be very nice to have super-flexible indexes, but as you
can guess this is a non-trivial task. Maintaining separate collections
is currently the way to go.
>
> == Question 2: to store files on the filesystem or as raw on BaseX? ==
> Right now, we're storing the information we receive as XML files on the
> file system on a RAID 10, anyway what's your advice?, to keep the files
> stored on the filesystem directly or to let BaseX handle those (I think
> this is the difference between add/replace and store commands right?), is
> there any article you could point me I could use for reference?, as I see
> BaseX right now it is handling the queries and the index information right
> now but depends on the filesystem to retrieve the entire document, am I
> right?

If you have a non-small collection of documents, simply storing them in
the file system is certainly not very performant. Using XQuery, you can
read from the file system, but that means parsing has to be executed
each time.

As Fabrice pointed out (thanks!), the concept is different than what you
described here. Using add/replace parses an XML file and adds it to the
database. During parsing, the XML file will be stored in a binary
format, to be able to optimize queries and to access relevant data much
faster. You can not add/replace any binary file to BaseX, as it would
not be parseable. Store, on the other hand, simply copies the file and
can therefore handle any binary file. This is useful if you e.g. want to
store media files within your DB, but you most likely do not want to
store XML files in a binary way, as it is similar in performance as
reading from the plain filesystem.

In short: You most likely want to add your documents to a collection.

>
> == Question 3: dynamic optimize and index updates? ==
> As you can imagine, I'll need to have the indexes updated
> since"data-mining" will be done with the information from the people
> registered on it. I've seen is not possible to run the "optimize" command
> while the app is up, I'm not sure about the indexes getting updated on real
> time either, but this somehow is troubling me since the idea is to have the
> app running 24x7, and if we get to have a lot of registered users, to
> update the indexes or to optimize the db will take a long time, isn't it?.
> So any strategies on this?

I don't quite get what you mean by "optimize can not be run when the app
is up". Optimize can not be run if the database is opened by another
context (as it is updating and we maintain ACID), but your app shouldn't
hold open the database all the time.

One option you might want to look into is updating indexes (see
http://docs.basex.org/wiki/Options#UPDINDEX), it might be beneficial for
your use case. You still have to trigger the indexing by using optimize.

One common strategy for such scenarios is also to maintain separate
collections. One with the most current data, which is not indexed and
can be updated quite fast. And then another collection with the bulk of
the data, which is indexed and can be access fast. A cron job would than
schedule to current data to be transferred to the other collection
during times of low load on the server. This way, your updates will be
performed on a rather small collection without the need to optimize the
indexes all the time, while read operations can be fast as the majority
of the data is nicely indexed. Again, accessing both collections is no
hassle with XQuery.

>
> == Question 4: connection pooling ==
> I have only found XQJ-Pool to be used with BaseX, does anybody know about
> any other pooling mechanism available for BaseX?

I am not aware of any.

Cheers,
Dirk

--
Dirk Kirsten, BaseX GmbH, http://basex.org
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
| Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22