Hi Agnieszka,
This looks like a classic XLE problem. I face similar problems when
parsing Wolof data (real sentences with a similar length as yours).
@Jani: By the way, I think this is a good topic to discuss in the XLE
forum.
My suggestions would be the following:
1) Try to eliminate all kind of ambiguities as much as possible, in
particular spurious ambiguities. XLE provides some tools that help to
inspect spurious ambiguities; e.g. the xle command
print-ambiguity-sources may be particularly useful. A similar trick
that helps a lot is check non-exclusive disjunctions and clearly
define disjunctions in the grammar (rules, templates, etc.) to be
sure that these are mutually exclusive (use check features if
necessary).
2) A second possibility would be to try to put XLE in the skimming
mode (see the XLE documentation). You could then ``play'' with the
skimming variables start_skimming_when_scratch_storage_exceeds,
start_skimming_when_total_events_exceed,
max_new_events_per_graph_when_skimming and skimming_nogoods. Also,
check if the value you are using for max_xle_scratch_storage is not a
bit too high.
As stated in the documentation, in this mode, XLE does a bounded
amount of work per subtree. This guarantees that it will finish
processing the sentence in a polynomial amount of time. Though, this
does not guarantee to get the sentences parsed correctly, it may help
you to have at least some output and to reduce the number of
timeouts. In my experience, the main problem with skimming is to find
the best value for the variables mentioned above.
3) A third option is to use the optimality mark stoppoint.
With this mark, you can design your grammar so that it will consider
expensive and rare constructions only when no other analysis is
available. You can combine this option with skimming: for instance
you put those OT marks denoting to expensive and rare constructions
in the variable skimming_nogoods. This can help reduce the timeouts
and speed up the parser.
Here also, there is the drawback that you might not get the analysis
you're looking for just because stoppoint has blocked it.
I hope this can help.
Best
Bamba
On Oct 8, 2013, at 11:01 AM, Agnieszka Patejuk wrote:
Dear all,
We've been developing POLFIE, an LFG grammar of Polish, at ICS PAS
for over 3 years. We'd like to concentrate on extending the
empirical scope of the grammar, but we're having performance issues
which affect the results quite badly.
Currently we're testing the grammar on sentences extracted from
Składnica, a treebank of Polish. There are 8333 sentences, average
sentence length is 10 (all segments are counted, including
punctuation).
How are these sentences parsed?
– an individual dictionary is created for each sentence (so the
words
are already disambiguated morphosyntactically)
– each sentence is parsed on its own, in one XLE run.
The following performance variables are used when parsing:
– 100 seconds (set timeout 100)
– 4096 MB memory (set max_xle_scratch_storage 4096).
Current results (out of 8333 sentences):
– parsed: 6926
– failed: 154
– out of memory: 11
– timeout: 1228
– unknown error: 14
Almost 15% of sentences are timed out, which is very worrying. The
average length of a parsed sentence is almost 9 (8.74), while the
average length of a timed out sentence is almost 19 (18.67).
Have you had similar problems? Are you parsing real sentences, how
long are your sentences?
Do you have any suggestions what we could do so as to reduce the
number of timed out sentences?
Best,
Agnieszka
_______________________________________________
ParGram mailing list
ParGram@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/pargram