Briefly (being on vacation with only iPhone access): as Miriam says, moving distinctions into the c-structure helps a lot. The Norwegian grammar doesn't use complex categories, but we (equivalently) distinguish between many subtypes of syntactic categories by means of subscripts (VPmain, VPfin, VPinf, PRONexpl, etc.), which clearly helps a lot. In order to keep f-structure constraints local, use the COMPLETE template. Avoid restricted unification, especially in combination with functional uncertainty.
Helge
Sendt fra min iPhone
Den 8. okt. 2013 kl. 17:12 skrev Miriam Butt miriam.butt@uni-konstanz.de:
Hi,
to add to this discussion from my experience...
Bamba's tips sound very good. Tracy spent a great deal of time getting the English grammar to be efficient and I guess with a good success rate, as Joachims statistics show.
One thing to do is to have as much as possible be dealt with via the c-structure rather than using "expensive" f-structure equations/disjunctions. In the German and English grammars, the use of complex categories helped quite a lot.
The English grammar also makes heavy use of the OT marks --- this took some careful adjusting and experimenting so as to cut out the analyses one didn't want without losing the ones one does want.
Also, as Bamba says, the elimination of (spurious) ambiguities is extremely important.
As far as I remember, John Maxwell also spent some time implementing XLE so that where possible, local f-structure constraints would be resolved speedily. So one thing to try to do is to have f-structure constraints resolved as locally as possible, rather than passing them up and up. Of course, this depends on the phenomenon.
The Urdu grammar at the moment is horribly inefficient. We have very few c-structure rules and tons of disjunctions with much of the work being done via f-structure constraints. And we have many spurious ambiguities, whereby there is more than one way to arrive at the correct solution. We are currently reimplementing it to remove this inefficiency.
Best,
Miriam
On 10/8/13 1:02 PM, Joachim Wagner wrote: Hi Agnieszka,
We parsed the BNC http://www.natcorp.ox.ac.uk/ with XLE and the ParGram English LFG and got only 0.55% timeouts on machines from around 2007.
We used the XLE command parse-testfile with parse-literally set to 1, max_xle_scratch_storage set to 1,000 MB, a time-out of 60 seconds and no skimming.
I remember that I had strange problems when trying higher values for max_xle_scratch_storage. Maybe try 1000 MB instead of 4096 just to see what happens.
Also, XLE crashed on rare occasions (less than 0.002% of sentences).
More details in my phd thesis:
- Breakdown of events (time out, out of memory, no parse etc): Table 5.1 (page 136)
- Sentence length distribution of the BNC: curve 1.000 in Figure 7.13 (page 225)
2007 paper: http://doras.dcu.ie/15214/ 2012 thesis: http://doras.dcu.ie/16776/
Best regards, Joachim
On Tuesday 08 October 2013 12:01:22 Bamba Dione wrote:
Hi Agnieszka, This looks like a classic XLE problem. I face similar problems when parsing Wolof data (real sentences with a similar length as yours). @Jani: By the way, I think this is a good topic to discuss in the XLE forum.
My suggestions would be the following:
- Try to eliminate all kind of ambiguities as much as possible, in
particular spurious ambiguities. XLE provides some tools that help to inspect spurious ambiguities; e.g. the xle command print-ambiguity-sources may be particularly useful. A similar trick that helps a lot is check non-exclusive disjunctions and clearly define disjunctions in the grammar (rules, templates, etc.) to be sure that these are mutually exclusive (use check features if necessary).
- A second possibility would be to try to put XLE in the skimming
mode (see the XLE documentation). You could then ``play'' with the skimming variables start_skimming_when_scratch_storage_exceeds, start_skimming_when_total_events_exceed, max_new_events_per_graph_when_skimming and skimming_nogoods. Also, check if the value you are using for max_xle_scratch_storage is not a bit too high.
As stated in the documentation, in this mode, XLE does a bounded amount of work per subtree. This guarantees that it will finish processing the sentence in a polynomial amount of time. Though, this does not guarantee to get the sentences parsed correctly, it may help you to have at least some output and to reduce the number of timeouts. In my experience, the main problem with skimming is to find the best value for the variables mentioned above.
- A third option is to use the optimality mark stoppoint.
With this mark, you can design your grammar so that it will consider expensive and rare constructions only when no other analysis is available. You can combine this option with skimming: for instance you put those OT marks denoting to expensive and rare constructions in the variable skimming_nogoods. This can help reduce the timeouts and speed up the parser. Here also, there is the drawback that you might not get the analysis you're looking for just because stoppoint has blocked it.
I hope this can help.
Best Bamba
On Oct 8, 2013, at 11:01 AM, Agnieszka Patejuk wrote:
Dear all,
We've been developing POLFIE, an LFG grammar of Polish, at ICS PAS for over 3 years. We'd like to concentrate on extending the empirical scope of the grammar, but we're having performance issues which affect the results quite badly.
Currently we're testing the grammar on sentences extracted from Składnica, a treebank of Polish. There are 8333 sentences, average sentence length is 10 (all segments are counted, including punctuation).
How are these sentences parsed? – an individual dictionary is created for each sentence (so the words are already disambiguated morphosyntactically) – each sentence is parsed on its own, in one XLE run.
The following performance variables are used when parsing: – 100 seconds (set timeout 100) – 4096 MB memory (set max_xle_scratch_storage 4096).
Current results (out of 8333 sentences): – parsed: 6926 – failed: 154 – out of memory: 11 – timeout: 1228 – unknown error: 14
Almost 15% of sentences are timed out, which is very worrying. The average length of a parsed sentence is almost 9 (8.74), while the average length of a timed out sentence is almost 19 (18.67).
Have you had similar problems? Are you parsing real sentences, how long are your sentences?
Do you have any suggestions what we could do so as to reduce the number of timed out sentences?
Best, Agnieszka _______________________________________________ ParGram mailing list ParGram@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/pargram
--
Miriam Butt FB Sprachwissenschaft Universitaet Konstanz Fach 184 Tel: +49 7531 88 5109 78457 Konstanz Fax: +49 7531 88 4865 Germany +49 7531 88 5115
miriam.butt@uni-konstanz.de http://ling.uni-konstanz.de/pages/home/butt
"Leo looked scandalized. 'Before lunch? No, my dear boy. Poker. Wouldn't play bridge before lunch.' Margery Allingham "The Case of the Late Pig" (p. 28)
ParSem mailing list ParSem@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/parsem
_______________________________________________ XLE mailing list XLE@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/xle