Hi,

to add to this discussion from my experience...

Bamba's tips sound very good. Tracy spent a great deal of time getting the English grammar to be efficient and I guess with a good success rate, as Joachims statistics show.

One thing to do is to have as much as possible be dealt with via the c-structure rather than using "expensive" f-structure equations/disjunctions. In the German and English grammars, the use of complex categories helped quite a lot.

The English grammar also makes heavy use of the OT marks --- this took some careful adjusting and experimenting so as to cut out the analyses one didn't want without losing the ones one does want.

Also, as Bamba says, the elimination of (spurious) ambiguities is extremely important.

As far as I remember, John Maxwell also spent some time implementing XLE so that where possible, local f-structure constraints would be resolved speedily. So one thing to try to do is to have f-structure constraints resolved as locally as possible, rather than passing them up and up. Of course, this depends on the phenomenon.

The Urdu grammar at the moment is horribly inefficient. We have very few c-structure rules and tons of disjunctions with much of the work being done via f-structure constraints. And we have many spurious ambiguities, whereby there is more than one way to arrive at the correct solution. We are currently reimplementing it to remove this inefficiency.

Best,

Miriam

On 10/8/13 1:02 PM, Joachim Wagner wrote:

Hi Agnieszka,

We parsed the BNC http://www.natcorp.ox.ac.uk/ with XLE and the ParGram 
English LFG and got only 0.55% timeouts on machines from around 2007.

We used the XLE command parse-testfile with parse-literally set to 1, 
max_xle_scratch_storage set to 1,000 MB, a time-out of 60 seconds and no 
skimming.

I remember that I had strange problems when trying higher values for 
max_xle_scratch_storage. Maybe try 1000 MB instead of 4096 just to see 
what happens.

Also, XLE crashed on rare occasions (less than 0.002% of sentences).

More details in my phd thesis:
 * Breakdown of events (time out, out of memory, no parse etc):
   Table 5.1 (page 136)
 * Sentence length distribution of the BNC: curve 1.000 in
   Figure 7.13 (page 225)

2007 paper:  http://doras.dcu.ie/15214/
2012 thesis: http://doras.dcu.ie/16776/

Best regards,
Joachim



On Tuesday 08 October 2013 12:01:22 Bamba Dione wrote:

Hi Agnieszka,
This looks like a classic XLE problem. I face similar problems when
parsing Wolof data (real sentences with a similar length as yours).
@Jani: By the way, I think this is a good topic to discuss in the XLE
forum.

My suggestions would be the following:

1) Try to eliminate all kind of ambiguities as much as possible, in
particular spurious ambiguities. XLE provides some tools that help to
inspect spurious ambiguities; e.g. the xle command
print-ambiguity-sources may be particularly useful. A similar trick
that helps a lot is check non-exclusive disjunctions and clearly
define disjunctions in the grammar (rules, templates, etc.) to be
sure that these are mutually exclusive (use check features if
necessary).

2) A second possibility would be to try to put XLE in the skimming
mode (see the XLE documentation). You could then ``play'' with the
skimming variables start_skimming_when_scratch_storage_exceeds,
start_skimming_when_total_events_exceed,
max_new_events_per_graph_when_skimming and skimming_nogoods. Also,
check if the value you are using for max_xle_scratch_storage is not a
bit too high.

As stated in the documentation, in this mode, XLE  does a bounded
amount of work per subtree. This guarantees that it will finish
processing the sentence in a polynomial amount of time. Though, this
does not guarantee to get the sentences parsed correctly, it may help
you to have at least some output and to reduce the number of
timeouts. In my experience, the main problem with skimming is to find
the best value for the variables mentioned above.

3) A third option is to use the optimality mark stoppoint.
With this mark, you can design your grammar so that it will consider
expensive and rare constructions only when no other analysis is
available. You can combine this option with skimming: for instance
you put those OT marks denoting to expensive and rare constructions
in the variable skimming_nogoods. This can help reduce the timeouts
and speed up the parser.
Here also, there is the drawback that you might not get the analysis
you're looking for just because stoppoint has blocked it.

I hope this can help.

Best
Bamba

On Oct 8, 2013, at 11:01 AM, Agnieszka Patejuk wrote:

Dear all,

We've been developing POLFIE, an LFG grammar of Polish, at ICS PAS
for over 3 years. We'd like to concentrate on extending the
empirical scope of the grammar, but we're having performance issues
which affect the results quite badly.

Currently we're testing the grammar on sentences extracted from
Składnica, a treebank of Polish. There are 8333 sentences, average
sentence length is 10 (all segments are counted, including
punctuation).

How are these sentences parsed?
– an individual dictionary is created for each sentence (so the
words
are already disambiguated morphosyntactically)
– each sentence is parsed on its own, in one XLE run.

The following performance variables are used when parsing:
– 100 seconds (set timeout 100)
– 4096 MB memory (set max_xle_scratch_storage 4096).

Current results (out of 8333 sentences):
– parsed: 6926
– failed: 154
– out of memory: 11
– timeout: 1228
– unknown error: 14

Almost 15% of sentences are timed out, which is very worrying. The
average length of a parsed sentence is almost 9 (8.74), while the
average length of a timed out sentence is almost 19 (18.67).

Have you had similar problems? Are you parsing real sentences, how
long are your sentences?

Do you have any suggestions what we could do so as to reduce the
number of timed out sentences?

Best,
Agnieszka
_______________________________________________
ParGram mailing list
ParGram@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/pargram

-- 
*************************************************************
Miriam Butt
FB Sprachwissenschaft
Universitaet Konstanz 
Fach 184		Tel: +49 7531 88 5109
78457 Konstanz		Fax: +49 7531 88 4865
Germany		             +49 7531 88 5115

miriam.butt@uni-konstanz.de
http://ling.uni-konstanz.de/pages/home/butt

"Leo looked scandalized.  'Before lunch? No, my dear boy. Poker.  
Wouldn't play bridge before lunch.'  
	 Margery Allingham "The Case of the Late Pig" (p. 28)

*************************************************************