Hi Agnieszka,
This looks like a classic XLE problem. I face similar problems when parsing Wolof data (real sentences with a similar length as yours).
@Jani: By the way, I think this is a good topic to discuss in the XLE forum.
My suggestions would be the following:
1) Try to eliminate all kind of ambiguities as much as possible, in particular spurious ambiguities.
XLE provides some tools that help to inspect spurious ambiguities; e.g. the xle command print-ambiguity-sources may be particularly useful.
A similar trick that helps a lot is check non-exclusive disjunctions and clearly define disjunctions in the grammar (rules, templates, etc.) to be sure that these are mutually exclusive
(use check features if necessary).
2) A second possibility would be to try to put XLE in the skimming mode (see the XLE documentation).
You could then ``play'' with the skimming variables start_skimming_when_scratch_storage_exceeds, start_skimming_when_total_events_exceed,
max_new_events_per_graph_when_skimming and skimming_nogoods. Also, check if the value you are using for max_xle_scratch_storage is not a bit too high.
As stated in the documentation, in this mode, XLE does a bounded amount of work per subtree.
This guarantees that it will finish processing the sentence in a polynomial amount of time.
Though, this does not guarantee to get the sentences parsed correctly, it may help you to have at least some output and to reduce the number of timeouts.
In my experience, the main problem with skimming is to find the best value for the variables mentioned above.
3) A third option is to use the optimality mark stoppoint.
With this mark, you can design your grammar so that it will consider expensive and rare constructions only when no other analysis is available.
You can combine this option with skimming: for instance you put those OT marks denoting to expensive and rare constructions in the variable skimming_nogoods.
This can help reduce the timeouts and speed up the parser.
Here also, there is the drawback that you might not get the analysis you're looking for just because stoppoint has blocked it.
I hope this can help.
Best
Bamba
On Oct 8, 2013, at 11:01 AM, Agnieszka Patejuk wrote:
Dear all,
We've been developing POLFIE, an LFG grammar of Polish, at ICS PAS for
over 3 years. We'd like to concentrate on extending the empirical
scope of the grammar, but we're having performance issues which affect
the results quite badly.
Currently we're testing the grammar on sentences extracted from
Składnica, a treebank of Polish. There are 8333 sentences, average
sentence length is 10 (all segments are counted, including
punctuation).
How are these sentences parsed?
– an individual dictionary is created for each sentence (so the words
are already disambiguated morphosyntactically)
– each sentence is parsed on its own, in one XLE run.
The following performance variables are used when parsing:
– 100 seconds (set timeout 100)
– 4096 MB memory (set max_xle_scratch_storage 4096).
Current results (out of 8333 sentences):
– parsed: 6926
– failed: 154
– out of memory: 11
– timeout: 1228
– unknown error: 14
Almost 15% of sentences are timed out, which is very worrying. The
average length of a parsed sentence is almost 9 (8.74), while the
average length of a timed out sentence is almost 19 (18.67).
Have you had similar problems? Are you parsing real sentences, how
long are your sentences?
Do you have any suggestions what we could do so as to reduce the
number of timed out sentences?
Best,
Agnieszka
_______________________________________________
ParGram mailing list
ParGram@mailman.uni-konstanz.dehttps://mailman.uni-konstanz.de/mailman/listinfo/pargram