Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks, 
- Mansi  

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud <fetanchaud@questel.com> wrote:

Hi Mansi,

 

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

 

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return

                file:write(

$file-name,

<names>

                               {

                                               for $name in $doc//E/@name/data()

                                               return

                                                               <name>{$name}</name>

}

</names>

)

 

Is it for indexing ?

 

Hope it helps,

 

Best regards,

 

Fabrice Etanchaud

Questel/Orbit

 

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth
Envoyé : jeudi 6 novembre 2014 16:33
À : Christian Grün
Cc : BaseX
Objet : Re: [basex-talk] Out Of Memory

 

This would need a lot of details, so bear with me below:

 

Briefly my XML files look like:

 

<A name="">

    <B name="">

       <C name="">

            <D name="">

                 <E name=""/>

 

<A> can contain <B>, <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

 

Query: /A/*//E/@name/string()

 

This query, was going OOM, within few mins.

 

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

 

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

 

XYZ.xml //E/@name

PQR.xml //E/@name

 

Let me know if you would need more details, to appreciate the issue ?

- Mansi

 

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best,
Christian




On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <mansi.sheth@gmail.com> wrote:
> Hello,
>
> I have a use case, where I have to extract lots in information from each XML
> in each DB. Something like, attribute values of most of the nodes in an XML.
> For such, queries based goes Out Of Memory with below exception. I am giving
> it ~12GB of RAM on i7 processor. Well I can't complain here since I am most
> definitely asking for loads of data, but is there any way I can get these
> kinds of data successfully ?
>
> mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
> BaseX 8.0 beta b45c1e2 [Server]
> Server was started (port: 1984)
> HTTP Server was started (port: 8984)
> Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap
> space
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
> at
> org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:744)
>
>
> --
> - Mansi



 

--

- Mansi




--
- Mansi