bug report in basex 6.2.3 fulltext

List overview All Threads
Download

newer

older

bug in vb.net sample code for new...

Grouping and sorting

Sandra Maria Silcot

25 Aug 2010 25 Aug '10

8:27 a.m.

Hi all,

The following query works:

...

XQUERY //b4a[shipNameNorm/text() contains text 'mount.+' using

wildcards]/shipNameNorm <shipNameNorm>mountstewartelphinstone</shipNameNorm> <shipNameNorm>mountstewartelphinstone</shipNameNorm> <shipNameNorm>mountstewartelphinstone</shipNameNorm> Query executed in 22.31 ms.

But with a space in the same query, the server just hangs with java using up all cpu and after a long time I get an out of memory error:

21:24:01.510 [127.0.0.1:35052] XQUERY //b4a[shipNameNorm/text() contains text 'mount .+' using wildcards]/shipNameNorm Error: Out of Main Memory. The following hints might help you: - increase Java's heap size with the flag -Xmx<size> - choose the internal XML parser in the GUI or via 'set intparse on' - deactivate the text and attribute indexes 18607.03 ms

To double check, I have recreated the db and reindexed (as it had been created under an earlier version). This time I don't get an out of memory error, but this:

...

XQUERY //b4a[shipNameNorm/text() contains text 'mount .+' using

wildcards]/shipNameNorm [XQST0054] Circular variable definition?

So this still appears to me to be a bug. The db is about 1.3GB, 26 documents. and Nodes: 48959283 with Indexes: Path Summary: ON Text Index: ON Attribute Index: ON Full-Text Index: ON (wildcards)

Many thanks for any guidance.

Cheers, Sandra

Show replies by date

Leonard Wörteler

25 Aug 25 Aug

10:04 a.m.

Hi Sandra,

Am 25.08.2010 14:27, schrieb Sandra Maria Silcot:

...

21:24:01.510 [127.0.0.1:35052] XQUERY //b4a[shipNameNorm/text() contains text 'mount .+' using wildcards]/shipNameNorm Error: Out of Main Memory. The following hints might help you: - increase Java's heap size with the flag -Xmx<size> - choose the internal XML parser in the GUI or via 'set intparse on' - deactivate the text and attribute indexes 18607.03 ms

this sounds like a bug in the wildcards-supporting trie-based index...

...

To double check, I have recreated the db and reindexed (as it had been created under an earlier version). This time I don't get an out of memory error, but this:

...
XQUERY //b4a[shipNameNorm/text() contains text 'mount .+' using

wildcards]/shipNameNorm [XQST0054] Circular variable definition?

This only means that now the stack's full, not the heap. can you reproduce the error? If so, could you provide us with the stack trace? As you seem to use the client/server architecture you probably have to use the local API to get it:

...

import org.basex.core.Context; import org.basex.core.cmd.Close; import org.basex.core.cmd.Open; import org.basex.core.cmd.XQuery;

public class WildcardsBug {

static final String DB_NAME = ...;

public static void main(final String[] args) throws Exception { final Context ctx = new Context(); new Open(DB_NAME).execute(ctx); System.out.println(new XQuery("//b4a[shipNameNorm/text() " + "contains text 'mount .+' using wildcards]/shipNameNorm" ).execute(ctx)); new Close().execute(ctx); }

}

A small, executable example would be even better, but as the DB size suggests, it's probably not trivial to create one.

Thank you for reporting this bug and in advance for helping us fix it, Cheers Leo

Leonard Wörteler

10:56 a.m.

Hi again,

Am 25.08.2010 16:04, schrieb Leonard Wörteler:

...

As you seem to use the client/server architecture you probably have to use the local API to get it:

...
import org.basex.core.Context; import org.basex.core.cmd.Close; import org.basex.core.cmd.Open; import org.basex.core.cmd.XQuery;

public class WildcardsBug {

static final String DB_NAME = ...;

public static void main(final String[] args) throws Exception { final Context ctx = new Context(); new Open(DB_NAME).execute(ctx); System.out.println(new XQuery("//b4a[shipNameNorm/text() " + "contains text 'mount .+' using wildcards]/shipNameNorm" ).execute(ctx)); new Close().execute(ctx); }

}

...as my code could also swallow the stack trace, please try this instead:

...

import org.basex.core.Context; import org.basex.core.cmd.Close; import org.basex.core.cmd.Open; import org.basex.query.QueryProcessor;

public class WildcardsBug {

static final String DB_NAME = ...;

public static void main(final String[] args) throws Exception { final Context ctx = new Context(); new Open(DB_NAME).execute(ctx); new QueryProcessor("//b4a[shipNameNorm/text() contains text " + "'mount .+' using wildcards]/shipNameNorm", ctx).execute(); new Close().execute(ctx); }

}

Sorry for the inconvenience...

"All problems in computer science can be solved by another level of indirection" -- David Wheeler "...except for the problem of too many layers of indirection." -- Kevlin Henney

Leo

Dave Glick

11:43 a.m.

New subject: Getting Information on Each XQuery Full Text Result

Hello,

I'm just starting to play with the full text support and am considering replacing some legacy text searching capabilities with queries. In order to do so however, I need a few features that I haven't been able to find by looking at the specs. Is it possible to return the absolute text offset in characters (either from the start of the document or the start of the result node) for each match? Along with that is it possible to return every match, even if it results in returning the same node more than once (if it has more than one occurrence for example)?

Even if XQuery Full Text doesn't work for this particular need, it's still very cool and I really like the implementation and look forward to finding other uses.

Thanks,

Dave

Leonard Wörteler

1:26 p.m.

New subject: Getting Information on Each XQuery Full Text Result

Hi Dave,

Am 25.08.2010 17:43, schrieb Dave Glick:

...

Is it possible to return the absolute text offset in characters (either from the start of the document or the start of the result node) for each match? Along with that is it possible to return every match, even if it results in returning the same node more than once (if it has more than one occurrence for example)?

well, I don't think there's an *official* way to get the full-text positions out of BaseX until now. It's only used in the query process and in the GUI, for highlighting the matches.

But if I got your previous mails right, you know your way around the BaseX codebase pretty well, so here's the Cheater's Guide:

The GUI gets the positions by setting the hidden static property org.basex.core.Prop.gui to true. That lets the FTPosData object be propagated to the resulting Nodes. It's accessible via Nodes.ftpos. The interface isn't as nice as it could be, but you get everything you asked for.

Please note that serializing the result will yield control characters used for highlighting in the GUI. This can be avoided by discarding the full-text position as in

...

final Nodes res = qp.queryNodes(); final Nodes copy = new Nodes(res.nodes, res.data);

and serializing the copy after that.

As these are internals of BaseX, the solution described above may stop working at any time in the future. When we find the time we will implement a simpler interface for this, but we're not really short of ToDos...

I hope this helps you in any way...

...

Even if XQuery Full Text doesn't work for this particular need, it's still very cool and I really like the implementation and look forward to finding other uses.

That's nice to hear!

Cheers Leo

Christian Grün

1:27 p.m.

An addition to Leo's answer (thanks anyway):

The search string 'mount .+' will trigger two keyword searches for "mount" and ".+". As the second search is very expensive (it will return the complete index), it's most likely the reason for the exception.

To find all texts that have "mount" as word, followed by another word, the query could be rewritten as

...[text() contains text 'exercise'][text() contains text ftnot 'exercise' at end]

If ".+" is used as only search term, index access will be skipped, and sequential execution will be chosen. I've now rewritten the code to skip index access whenever a single keyword starts with a dot.

Hope this helps, Christian

Sandra Maria Silcot

29 Aug 29 Aug

9:45 p.m.

Christian,

Thanks for that information; it makes sense now. I have rewritten my query accordingly.

Best wishes,

Sandra

ps: Thanks to Leo also.

...

An addition to Leo's answer (thanks anyway):

The search string 'mount .+' will trigger two keyword searches for

"mount" and ".+". As the second search is very expensive (it will return the complete index), it's most likely the reason for the

...

exception.

To find all texts that have "mount" as word, followed by another word,

the query could be rewritten as

...

...[text() contains text 'exercise'][text() contains text ftnot 'exercise' at end]

If ".+" is used as only search term, index access will be skipped, and

sequential execution will be chosen. I've now rewritten the code to skip index access whenever a single keyword starts with a dot.

...

Hope this helps, Christian

5436

Age (days ago)

5441

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

6 comments

5 participants

tags (0)

participants (5)

Christian Grün
Dave Glick
Leonard Wörteler
Leonard Wörteler
Sandra Maria Silcot