Follow up--I tried giving BaseX the full 16GB of RAM and it still ultimately locked up with the memory meter showing 13GB.
I'm thinking this must be some kind of memory leak.
I tried importing the DITA Open Toolkit's documentation source and that worked fine with the max memory being about 2.5GB, but it's only about 250 topics.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/3/18, 4:59 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
In the context of trying to do fun things with DITA docs in BaseX I downloaded the latest BaseX (9.0.1) and tried creating a new database and loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of hardware RAM available.
My corpus is about 4000 DITA topics totaling about 30MB on disk. They are all in a single directory (not my decision) if that matters.
Using the "parse DTDs" option and default indexing options (no token or full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM the memory usage during load will eventually go to 12GB, at which point the processing appears to stop (that is, whatever I set the max memory to, when it's reached, things stop but I only got out of memory errors when I had much lower settings, like the default 2GB).
I'm currently running a test with 14GB allocated and it is continuing but it does go to 12GB occasionally (watching the memory display on the Add progress panel).
No individual file is that big--the biggest is 150K and typical is 30K or smaller.
I wouldn't expect BaseX to have this kind of memory problem so I'm wondering if maybe there's an issue with memory on macOS or with DITA documents in particular (the DITA DTDs are notoriously large)?
Should I expect BaseX to be able to load this kind of corpus with 14GB of RAM?
Cheers,
E. -- Eliot Kimber http://contrext.com
More experimentation indicates that the issue is the DTDs--if I load the same content without DTD parsing then it loads fine and takes the expected relatively small amount of memory.
I think the solution is to turn on Xerces' grammar caching. The only danger there is that different DTDs within the same content set can different expansions for the same external parameter entity reference (e.g., MathML DTDs), which then can lead to validation issues. For this reason the DITA OT makes use of the grammar cache switchable but on by default.
Another option for DITA content in particular is to use the OT's preprocessing to parse all the docs and then use BaseX with the parsed docs where all the attributes have been expanded into the source.
Cheers,
E. -- Eliot Kimber http://contrext.com
On 5/4/18, 9:52 AM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
Follow up--I tried giving BaseX the full 16GB of RAM and it still ultimately locked up with the memory meter showing 13GB.
I'm thinking this must be some kind of memory leak.
I tried importing the DITA Open Toolkit's documentation source and that worked fine with the max memory being about 2.5GB, but it's only about 250 topics.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/3/18, 4:59 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
In the context of trying to do fun things with DITA docs in BaseX I downloaded the latest BaseX (9.0.1) and tried creating a new database and loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of hardware RAM available.
My corpus is about 4000 DITA topics totaling about 30MB on disk. They are all in a single directory (not my decision) if that matters.
Using the "parse DTDs" option and default indexing options (no token or full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM the memory usage during load will eventually go to 12GB, at which point the processing appears to stop (that is, whatever I set the max memory to, when it's reached, things stop but I only got out of memory errors when I had much lower settings, like the default 2GB).
I'm currently running a test with 14GB allocated and it is continuing but it does go to 12GB occasionally (watching the memory display on the Add progress panel).
No individual file is that big--the biggest is 150K and typical is 30K or smaller.
I wouldn't expect BaseX to have this kind of memory problem so I'm wondering if maybe there's an issue with memory on macOS or with DITA documents in particular (the DITA DTDs are notoriously large)?
Should I expect BaseX to be able to load this kind of corpus with 14GB of RAM?
Cheers,
E. -- Eliot Kimber http://contrext.com
Hi Eliot,
Thanks for your observations.
I think the solution is to turn on Xerces' grammar caching.
I’m wondering what is happening here. Did you want to say that caching is enabled by default, and that it should be possible to turn it off?
Cheers, Christian
The only danger there is that different DTDs within the same content set can different expansions for the same external parameter entity reference (e.g., MathML DTDs), which then can lead to validation issues. For this reason the DITA OT makes use of the grammar cache switchable but on by default.
Another option for DITA content in particular is to use the OT's preprocessing to parse all the docs and then use BaseX with the parsed docs where all the attributes have been expanded into the source.
Cheers,
E.
Eliot Kimber http://contrext.com
On 5/4/18, 9:52 AM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
Follow up--I tried giving BaseX the full 16GB of RAM and it still ultimately locked up with the memory meter showing 13GB. I'm thinking this must be some kind of memory leak. I tried importing the DITA Open Toolkit's documentation source and that worked fine with the max memory being about 2.5GB, but it's only about 250 topics. Cheers, E. -- Eliot Kimber http://contrext.com On 5/3/18, 4:59 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote: In the context of trying to do fun things with DITA docs in BaseX I downloaded the latest BaseX (9.0.1) and tried creating a new database and loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of hardware RAM available. My corpus is about 4000 DITA topics totaling about 30MB on disk. They are all in a single directory (not my decision) if that matters. Using the "parse DTDs" option and default indexing options (no token or full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM the memory usage during load will eventually go to 12GB, at which point the processing appears to stop (that is, whatever I set the max memory to, when it's reached, things stop but I only got out of memory errors when I had much lower settings, like the default 2GB). I'm currently running a test with 14GB allocated and it is continuing but it does go to 12GB occasionally (watching the memory display on the Add progress panel). No individual file is that big--the biggest is 150K and typical is 30K or smaller. I wouldn't expect BaseX to have this kind of memory problem so I'm wondering if maybe there's an issue with memory on macOS or with DITA documents in particular (the DITA DTDs are notoriously large)? Should I expect BaseX to be able to load this kind of corpus with 14GB of RAM? Cheers, E. -- Eliot Kimber http://contrext.com
Yes, I would want caching on by default with the option to turn it off. I'm assuming it's currently not turned on (but to be honest I haven't taken the time to check the source code).
Certainly for DITA content grammar caching is the only practical way to parse a large number of topics in the same JVM without both using lots of memory and eating an avoidable processing cost of re-processing the grammar files again for each document.
DITA is probably somewhat unique in this regard because it takes a such a different approach to grammar organization and use than pretty much any other XML application.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/14/18, 12:17 PM, "Christian Grün" christian.gruen@gmail.com wrote:
Hi Eliot,
Thanks for your observations.
> I think the solution is to turn on Xerces' grammar caching.
I’m wondering what is happening here. Did you want to say that caching is enabled by default, and that it should be possible to turn it off?
Cheers, Christian
The only danger there is that different DTDs within the same content set can different expansions for the same external parameter entity reference (e.g., MathML DTDs), which then can lead to validation issues. For this reason the DITA OT makes use of the grammar cache switchable but on by default. > > Another option for DITA content in particular is to use the OT's preprocessing to parse all the docs and then use BaseX with the parsed docs where all the attributes have been expanded into the source. > > Cheers, > > E. > -- > Eliot Kimber > http://contrext.com > > > On 5/4/18, 9:52 AM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote: > > Follow up--I tried giving BaseX the full 16GB of RAM and it still ultimately locked up with the memory meter showing 13GB. > > I'm thinking this must be some kind of memory leak. > > I tried importing the DITA Open Toolkit's documentation source and that worked fine with the max memory being about 2.5GB, but it's only about 250 topics. > > Cheers, > > E. > > -- > Eliot Kimber > http://contrext.com > > On 5/3/18, 4:59 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote: > > In the context of trying to do fun things with DITA docs in BaseX I downloaded the latest BaseX (9.0.1) and tried creating a new database and loading docs into it using the BaseX GUI. This is on macOS 10.13.4 with 16GB of hardware RAM available. > > My corpus is about 4000 DITA topics totaling about 30MB on disk. They are all in a single directory (not my decision) if that matters. > > Using the "parse DTDs" option and default indexing options (no token or full text indexes) I'm finding that even with 12GB of RAM allocated to the JVM the memory usage during load will eventually go to 12GB, at which point the processing appears to stop (that is, whatever I set the max memory to, when it's reached, things stop but I only got out of memory errors when I had much lower settings, like the default 2GB). > > I'm currently running a test with 14GB allocated and it is continuing but it does go to 12GB occasionally (watching the memory display on the Add progress panel). > > No individual file is that big--the biggest is 150K and typical is 30K or smaller. > > I wouldn't expect BaseX to have this kind of memory problem so I'm wondering if maybe there's an issue with memory on macOS or with DITA documents in particular (the DITA DTDs are notoriously large)? > > Should I expect BaseX to be able to load this kind of corpus with 14GB of RAM? > > Cheers, > > E. > -- > Eliot Kimber > http://contrext.com > > > > > > > > > > >
I would have expected some MBs to be sufficient for parsing even complex DTDs if nothing is cached (but caching could definitely speed up processing), so maybe there’s still something that we could have a look at. If you are interested, feel free to provide me with your files via a private message.
On Mon, May 14, 2018 at 7:40 PM, Eliot Kimber ekimber@contrext.com wrote:
Yes, I would want caching on by default with the option to turn it off. I'm assuming it's currently not turned on (but to be honest I haven't taken the time to check the source code).
Certainly for DITA content grammar caching is the only practical way to parse a large number of topics in the same JVM without both using lots of memory and eating an avoidable processing cost of re-processing the grammar files again for each document.
DITA is probably somewhat unique in this regard because it takes a such a different approach to grammar organization and use than pretty much any other XML application.
Cheers,
E.
Yes, I wouldn't expect the grammars to chew up gigabytes. I'll provide a test data set for you.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/14/18, 12:45 PM, "Christian Grün" christian.gruen@gmail.com wrote:
I would have expected some MBs to be sufficient for parsing even complex DTDs if nothing is cached (but caching could definitely speed up processing), so maybe there’s still something that we could have a look at. If you are interested, feel free to provide me with your files via a private message.
On Mon, May 14, 2018 at 7:40 PM, Eliot Kimber ekimber@contrext.com wrote: > Yes, I would want caching on by default with the option to turn it off. I'm assuming it's currently not turned on (but to be honest I haven't taken the time to check the source code). > > Certainly for DITA content grammar caching is the only practical way to parse a large number of topics in the same JVM without both using lots of memory and eating an avoidable processing cost of re-processing the grammar files again for each document. > > DITA is probably somewhat unique in this regard because it takes a such a different approach to grammar organization and use than pretty much any other XML application. > > Cheers, > > E.
Hmm.
In the process of testing my test data set I can't reproduce the earlier behavior.
In my current tests, using the same data and the same BaseX version, I get a maximum of maybe 1GB for the largest file but just a few hundred MBs once everything is loaded.
For 3800 topics of roughly 50K each (on average) it takes just a couple of seconds to load them with no DTDs, a minute or so with DTDs, which is consistent with the time cost of reparsing the (large) DITA grammars for each topic.
So not sure what was happening when I tried this before but I definitely rebooted and installed macOS updates since then, so could have been some Java issue or who knows what else.
The good news is that even without grammar caching the DITA topics do load in a reasonable (if not ideal) amount of time and with appropriate memory usage.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/14/18, 12:53 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
Yes, I wouldn't expect the grammars to chew up gigabytes. I'll provide a test data set for you.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/14/18, 12:45 PM, "Christian Grün" christian.gruen@gmail.com wrote:
I would have expected some MBs to be sufficient for parsing even complex DTDs if nothing is cached (but caching could definitely speed up processing), so maybe there’s still something that we could have a look at. If you are interested, feel free to provide me with your files via a private message.
On Mon, May 14, 2018 at 7:40 PM, Eliot Kimber ekimber@contrext.com wrote: > Yes, I would want caching on by default with the option to turn it off. I'm assuming it's currently not turned on (but to be honest I haven't taken the time to check the source code). > > Certainly for DITA content grammar caching is the only practical way to parse a large number of topics in the same JVM without both using lots of memory and eating an avoidable processing cost of re-processing the grammar files again for each document. > > DITA is probably somewhat unique in this regard because it takes a such a different approach to grammar organization and use than pretty much any other XML application. > > Cheers, > > E.
Good to know; I’ll record this as positive news ;) Feel free to give me an update once you encounter a similar behavior.
On Mon, May 14, 2018 at 8:40 PM, Eliot Kimber ekimber@contrext.com wrote:
Hmm.
In the process of testing my test data set I can't reproduce the earlier behavior.
In my current tests, using the same data and the same BaseX version, I get a maximum of maybe 1GB for the largest file but just a few hundred MBs once everything is loaded.
For 3800 topics of roughly 50K each (on average) it takes just a couple of seconds to load them with no DTDs, a minute or so with DTDs, which is consistent with the time cost of reparsing the (large) DITA grammars for each topic.
So not sure what was happening when I tried this before but I definitely rebooted and installed macOS updates since then, so could have been some Java issue or who knows what else.
The good news is that even without grammar caching the DITA topics do load in a reasonable (if not ideal) amount of time and with appropriate memory usage.
Cheers,
E.
-- Eliot Kimber http://contrext.com
On 5/14/18, 12:53 PM, "Eliot Kimber" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ekimber@contrext.com> wrote:
Yes, I wouldn't expect the grammars to chew up gigabytes. I'll provide a test data set for you. Cheers, E. -- Eliot Kimber http://contrext.com On 5/14/18, 12:45 PM, "Christian Grün" <christian.gruen@gmail.com> wrote: I would have expected some MBs to be sufficient for parsing even complex DTDs if nothing is cached (but caching could definitely speed up processing), so maybe there’s still something that we could have a look at. If you are interested, feel free to provide me with your files via a private message. On Mon, May 14, 2018 at 7:40 PM, Eliot Kimber <ekimber@contrext.com> wrote: > Yes, I would want caching on by default with the option to turn it off. I'm assuming it's currently not turned on (but to be honest I haven't taken the time to check the source code). > > Certainly for DITA content grammar caching is the only practical way to parse a large number of topics in the same JVM without both using lots of memory and eating an avoidable processing cost of re-processing the grammar files again for each document. > > DITA is probably somewhat unique in this regard because it takes a such a different approach to grammar organization and use than pretty much any other XML application. > > Cheers, > > E.
basex-talk@mailman.uni-konstanz.de