By giving 6G of RAM to the JVM I succeeded in building the full-text index, but it doesn't seem to be making any difference in query time.

I have a slightly older copy of the data that is probably a hundred or so records smaller than the one that is indexed for full text, and my query takes about 40s on each one, so the FTINDEX seems to make no difference. I'm not an old XQuery hand, so it's altogether possible that my queries are not optimal. I'll append my query below.

Using the GUI, I can see that the value of FTINDEX for this database is is true, though when I open the database with the 'basex' command and use INFO, it shows the value 'false'.

Query:

=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
xquery version "3.0";

declare namespace publication-template = "http://atira.dk/schemas/pure4/wsdl/template/abstractpublication/current";
declare namespace core="http://atira.dk/schemas/pure4/model/core/current" ;
declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance" ;
declare namespace cur="http://atira.dk/schemas/pure4/model/template/abstractpublication/current" ;
declare namespace extensions-core="http://atira.dk/schemas/pure4/model/core/extensions/current" ;
declare namespace person-template="http://atira.dk/schemas/pure4/model/template/abstractperson/current" ;
declare namespace externalperson-template="http://atira.dk/schemas/pure4/model/template/abstractexternalperson/current" ;
declare namespace externalorganisation-template="http://atira.dk/schemas/pure4/model/template/externalorganisation/current" ;
declare namespace organisation-template="http://atira.dk/schemas/pure4/model/template/abstractorganisation/current" ;
declare namespace journal-template="http://atira.dk/schemas/pure4/model/template/abstractjournal/current";
declare namespace cur1 = "http://atira.dk/schemas/pure4/model/template/abstractpublication/current";

for $pa in /publication-template:*/core:result/core:content/cur1:persons/person-template:personAssociation[person-template:externalperson]
    where $pa/person-template:externalperson/externalperson-template:name/core:lastName contains text {'Meric'}
    let $lname := $pa/person-template:name/core:lastName/text()
    let $fname := $pa/person-template:name/core:firstName/text()
    let $uuid := $pa/ancestor::core:content/@uuid/data()
    return ($lname, $fname, $uuid)
=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=

All suggestions welcome, and thanks to Christian & John Mitchell for helping me so far.

Chuck

On Tue, Oct 20, 2015 at 2:16 PM, Chuck Bearden <cfbearden@gmail.com> wrote:

Sorry for not using "Reply All" earlier.

Setting FTINDEXSPLITSIZE to 20000000 enabled the process to get a little further, if the meaning of each dot is the same. FTINDEXSPLITSIZE at default:

..............................|..................................................................|..........................................................................|...............................................................................|..

FTINDEXSPLITSIZE = 20000000

.......|.......|........|.......|......|........|.............|.............|.............|.............|.............|.............|.............|.............|..............|.............|.............|.............|.............|.............|.............|............

If it's a matter of making the indexing process take longer, that's not a problem.

Thanks,
Chuck

On Tue, Oct 20, 2015 at 1:27 PM, Chuck Bearden <cfbearden@gmail.com> wrote:
Thanks Christian, I'll try the FTINDEXSPLITSIZE option.

I'm also open to modifying the XML files it that would help. Because
of limitations of the service from which we harvest them RESTfully, I
have only 20 actual content elements in each file. If you think it
would make a difference, I could consolidate them to have, say, 200 or
500 of the actual content elements per file, but I have no idea if
that would change how the indexing falls out.

The files also have structures where some properties of each record
are each represented by a URL, and ID value, and a string. I could
XSLT the files to remove all but the string (human readable is better
for our purposes) to make them less verbose.

BaseX is really super for doing data quality assessments of the XML,
and if we could get full-text indexing working, it would speed things
up considerably. Thanks to you & your team for all the work you've put
in to the application!

Alles Gute
Chuck Bearden

On Tue, Oct 20, 2015 at 12:55 PM, Christian Grün

<christian.gruen@gmail.com> wrote:
> I see; it seems that the index creation is failing at the very final
> step, in which partial index structures, which are temporarily written
> to disk, are merged.
>
> You could either to increase Xmx even more (to 6 or 7G?). If this
> doesn't work, you could try assign different values to the
> FTINDEXSPLITSIZE option [1] (start e.g. with 20000000).
>
> Sorry for the trouble. Feel free to keep me updated, maybe we find a
> way to fix this,
> Christian
>
> [1] http://docs.basex.org/wiki/Options#FTINDEXSPLITSIZE
>
>
> On Tue, Oct 20, 2015 at 7:48 PM, Chuck Bearden <cfbearden@gmail.com> wrote:
>> Here's the stack trace:
>>
>> =.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
>>> create db pure_20151019 pure_20151019
>> Creating Database...
>> ..;..;..;..;..;..;.;..;..;..;..;..;.;..;.;.;.;.....;.....;.....;......;.....;.....;.......;.;.;;.;.;;.;.;;.;.;;.;.;;.;.;;.;.................................................;..........................................................;..........................................................;..........................................................;..........................................................;..........................................................;...................................................
>> 677584.62 ms (1435 MB)
>> Indexing Text...
>> ...........................................................................................................................................................................................................................................................
>> 98215794 operations, 178526.99 ms (1611 MB)
>> Indexing Attribute Values...
>> ...........................................................................................................................................................................................................................................................
>> 178304119 operations, 135613.26 ms (2005 MB)
>> Indexing Full-Text...
>> ..............................|..................................................................|..........................................................................|...............................................................................|..java.lang.OutOfMemoryError:
>> Java heap space
>> at org.basex.index.ft.FTList.next(FTList.java:93)
>> at org.basex.index.ft.FTBuilder.merge(FTBuilder.java:236)
>> at org.basex.index.ft.FTBuilder.write(FTBuilder.java:140)
>> at org.basex.index.ft.FTBuilder.build(FTBuilder.java:85)
>> at org.basex.index.ft.FTBuilder.build(FTBuilder.java:23)
>> at org.basex.data.DiskData.createIndex(DiskData.java:187)
>> at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:103)
>> at org.basex.core.cmd.CreateIndex.create(CreateIndex.java:91)
>> at org.basex.core.cmd.CreateDB.run(CreateDB.java:104)
>> at org.basex.core.Command.run(Command.java:398)
>> at org.basex.core.Command.execute(Command.java:100)
>> at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
>> at org.basex.api.client.Session.execute(Session.java:36)
>> at org.basex.core.CLI.execute(CLI.java:103)
>> at org.basex.core.CLI.execute(CLI.java:87)
>> at org.basex.BaseX.console(BaseX.java:191)
>> at org.basex.BaseX.<init>(BaseX.java:166)
>> at org.basex.BaseX.main(BaseX.java:42)
>> org.basex.core.BaseXException: Out of Main Memory.
>> You can try to:
>> - increase Java's heap size with the flag -Xmx<size>
>> - deactivate the text and attribute indexes.
>> at org.basex.core.Command.execute(Command.java:101)
>> at org.basex.api.client.LocalSession.execute(LocalSession.java:132)
>> at org.basex.api.client.Session.execute(Session.java:36)
>> at org.basex.core.CLI.execute(CLI.java:103)
>> at org.basex.core.CLI.execute(CLI.java:87)
>> at org.basex.BaseX.console(BaseX.java:191)
>> at org.basex.BaseX.<init>(BaseX.java:166)
>> at org.basex.BaseX.main(BaseX.java:42)
>> Out of Main Memory.
>> You can try to:
>> - increase Java's heap size with the flag -Xmx<size>
>> - deactivate the text and attribute indexes.
>>> d
>> =.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
>>
>> Here's how the process looked in the output of 'ps -ef', in case
>> that's relevant:
>>
>> =.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=.=
>> cfbeard+ 88769 88757 46 12:15 pts/7 00:00:24 java -cp
>> /home/cfbearden/opt/basex-8.3.0/BaseX.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-api-8.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/basex-xqj-1.5.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-codec-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-fileupload-1.3.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/commons-io-1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/igo-0.4.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/jansi-1.11.jar:/home/cfbearden/opt/basex-8.3.0/lib/javax.servlet-3.0.0.v201112011016.jar:/home/cfbearden/opt/basex-8.3.0/lib/jdom-1.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-continuation-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-http-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-io-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-security-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-server-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-servlet-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-util-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-webapp-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jetty-xml-8.1.17.v20150415.jar:/home/cfbearden/opt/basex-8.3.0/lib/jing-20091111.jar:/home/cfbearden/opt/basex-8.3.0/lib/jline-2.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/jts-1.13.jar:/home/cfbearden/opt/basex-8.3.0/lib/lucene-stemmers-3.4.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/milton-api-1.8.1.4.jar:/home/cfbearden/opt/basex-8.3.0/lib/mime-util-2.1.3.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-api-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/slf4j-simple-1.7.12.jar:/home/cfbearden/opt/basex-8.3.0/lib/tagsoup-1.2.1.jar:/home/cfbearden/opt/basex-8.3.0/lib/xmldb-api-1.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xml-resolver-1.2.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj2-0.2.0.jar:/home/cfbearden/opt/basex-8.3.0/lib/xqj-api-1.0.jar:
>> -Xmx4g org.basex.BaseX -d
>>
>>
>> On Tue, Oct 20, 2015 at 12:38 PM, Chuck Bearden <cfbearden@gmail.com> wrote:
>>> It hasn't failed yet; I've gotten the progress indicators, along with
>>> the phases that have been completed:
>>>
>>> Creating Database...
>>> Indexing Text...
>>> Indexing Attribute Values...
>>>
>>> It's still working on "Indexing Full-Text...". I'll post whatever I
>>> get when it fails. Maybe it won't this time :)
>>>
>>> Chuck
>>>
>>> On Tue, Oct 20, 2015 at 12:33 PM, Christian Grün
>>> <christian.gruen@gmail.com> wrote:
>>>>> Creating Database...
>>>>> ..;..;..;..;..;..;.;..;..
>>>>
>>>> Do you get any output after this line (I would expected to see a stack
>>>> trace, or at least an error message…)?
>>>>
>>>>
>>>>
>>>>> Where 'pure_20151019' is both the name of the database and the
>>>>> subdirectory where all my XML files are.
>>>>>
>>>>> It could well be that I'm missing a crucial option; I'm still
>>>>> relatively new to BaseX. It's great stuff, though.
>>>>>
>>>>> Because of my employer's IT environment, I have to run my Linux
>>>>> workstation in a VMWare VM, though I doubt that that makes a
>>>>> difference.
>>>>>
>>>>> Thanks,
>>>>> Chuck
>>>>>
>>>>> On Tue, Oct 20, 2015 at 11:15 AM, Christian Grün
>>>>> <christian.gruen@gmail.com> wrote:
>>>>>> Hi Chuck,
>>>>>>
>>>>>> Usually, 4G is more than enough to create a full-text index for 16G of
>>>>>> XML. Obviously, however, that's not the case for your input data. You
>>>>>> could try to distribute your documents in multiple database. As as
>>>>>> alternative, we could have a look at your data and try to find out
>>>>>> what's going wrong. You can also use the -d flag and send us the stack
>>>>>> trace.
>>>>>>
>>>>>> Best,
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 20, 2015 at 4:19 PM, Chuck Bearden <cfbearden@gmail.com> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have about 16G of XML data in about 52000 files, and I was hoping to
>>>>>>> build a full-text index over it. I've tried two approaches: enable
>>>>>>> full-text indexing as I create the database and then loading the data,
>>>>>>> and creating the full-text index after loading the data. If I enable
>>>>>>> ADDCACHE and modify the basex shell script to use 4g of RAM instead of
>>>>>>> 512M, I have no problem loading the data. If I try to load with
>>>>>>> FTINDEX or create the index afterward, the process runs out of memory.
>>>>>>>
>>>>>>> I could believe that I'm overlooking some option that would make this
>>>>>>> possible, but I suspect I just have too much data. I welcome your
>>>>>>> thoughts & suggestions.
>>>>>>>
>>>>>>> All the best,
>>>>>>> Chuck Bearden