Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Yes! I'm definitely in. I missed the first ParaGram because I couldn't fit it in with my day job, but I won't miss this one.
Lori
On Wed, Jul 28, 2021, 8:00 AM Agnieszka Patejuk < agnieszka.patejuk@googlemail.com> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list lfg-list@mailman.uni-konstanz.de wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
1. Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
2. Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
3. A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote:
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <
lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi everybody and Joan in particular, liked Lori's remarks on UD and UMR very much, it's no good reducing the annotation set to suit general projects... they can become too generic. Being retired I should have spare time to help. Rodolfo
Il giorno ven 30 lug 2021 alle ore 05:34 Lori Levin levin@andrew.cmu.edu ha scritto:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
- Cross-lingual studies of grammatical relations, information structure,
constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
- Treebanks and parsers that can be used for corpus-based studies in
linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
- A challenge to UD (universal dependency) and UMR (uniform meaning
representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote:
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <
lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hello everyone,
Just a brief remark from me (I am travelling): Lori has very good points. I would find it particularly interesting if we could develop the basis for promoting our approach to treebanking vis à vis UD. (I liked the stone soup comparison. In Norway it is ’nail soup’ (’spikersuppe’) – the sort of nails you hammer.) Of course, it remains to be seen what is practically possible. For one thing, there can hardly be a concomitant development of XLE, as in the original ParGram project.
Best, Helge
- jul. 2021 kl. 17:54 skrev Lori Levin levin@andrew.cmu.edu:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan <bresnan@stanford.edu mailto:bresnan@stanford.edu> wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <lfg-list@mailman.uni-konstanz.de mailto:lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi all,
A UD-like project based on a firm foundation sounds like a goer. Hope it takes off and becomes a large part of the new phase of pargram.
Mel
On Fri, 30 Jul 2021 at 6:56 pm, Helge Julius Jakhelln Dyvik < helge.dyvik@uib.no> wrote:
Hello everyone,
Just a brief remark from me (I am travelling): Lori has very good points. I would find it particularly interesting if we could develop the basis for promoting our approach to treebanking vis à vis UD. (I liked the stone soup comparison. In Norway it is ’nail soup’ (’spikersuppe’) – the sort of nails you hammer.) Of course, it remains to be seen what is practically possible. For one thing, there can hardly be a concomitant development of XLE, as in the original ParGram project.
Best, Helge
- jul. 2021 kl. 17:54 skrev Lori Levin levin@andrew.cmu.edu:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
- Cross-lingual studies of grammatical relations, information structure,
constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
- Treebanks and parsers that can be used for corpus-based studies in
linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
- A challenge to UD (universal dependency) and UMR (uniform meaning
representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote:
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <
lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
1. Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
2. Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
3. A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan <bresnan@stanford.edumailto:bresnan@stanford.edu> wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <lfg-list@mailman.uni-konstanz.demailto:lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
*First, my great respect for UD:*
1. By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
3. Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
*Second, some points about community-sourcing: *
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
1. Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
2. I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
3. Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
*Finally, content words as heads and other policies that were made without anticipating all of the consequences*
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning < manning@stanford.edu> wrote:
Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
- Cross-lingual studies of grammatical relations, information structure,
constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
- Treebanks and parsers that can be used for corpus-based studies in
linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
- A challenge to UD (universal dependency) and UMR (uniform meaning
representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote:
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <
lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi,
interesting discussion.
I think we should definitely restart ParGram meetings — they were a good forum for discussion and exchange and we could do some of that on-line these days — part of why we stopped having them was a lack of travel funds.. But the world is different now and we can put appropriate infrastructure in place along with meeting in person as needed or wanted.
@Koenraad — we’ll look into moving the Bergen site into to Konstanz.
I am working with a Tamil PhD student (Sarves) and he has been trying to use the Tamil UD bank. He’s got an LFG Tamil ParGram grammar and can map to UD, but the problem is that the Tamil UD bank that exists actually has chosen wrong categories or does not have some (at some point it did not have dative subjects and this abstract I reviewed was trying to introduce them to UD but several UD reviewers pushed back with the argument that this category simply did not and should not exist in UD). So he is a bit stuck, he wants to be able map to UD and it’s technically possible, but it’s not possible in terms of content mapping. One would have to fix the existing Tamil UD and all of the Tamil people involved in the Tamil UD agree that this needs to be done, but since most of them are more CS than linguistics and Tamil is under-researched, the way forward as to the fixes is not immediately clear. They need ParGram type help, as Lori points out.
Having said that, I have realized over the years of training students that very few people can actually grasp all of the linguistics and engineering and formal underpinnings of everything we did and do in ParGram. Which is why it is difficult to make it a broad based community effort — one just has to know too many things (and be good at them).
The genius of UD was to distill things down to a simple scheme that the “crowd” could understand and apply, so it could become a broad-based community effort. But then, it turns out that that scheme is too unspecified to deal with the variety of crosslinguistic patterns that are out there. And us ParGram and LFG people do indeed understand the crosslinguistic patterns fairly well and could advise, while not losing sight of the practicalities and compromises needed for CL work. But then there are not many of us….
And it also takes time to get to know a language like Tamil, which has a long but confusing literature and very few formal linguists who work on it. So I have been helping, but it took me a while to understand just their complementizer system, for example. In CL you can’t wait around for some linguist to take a year or even just a few months to figure out just the complementizers…
I can sort them out when they come and tell me they have 48 auxiliaries or some such nonsense because I know all about verbs, but the bottom line is that good linguistics is hard and takes time and needs many more experts on a language than what we can currently field. One solution would be for the big tech companies to fund 10 linguists for each of the world’s languages. That would already help a lot. And be a tiny fraction of their overall budget.
That’s my two cents!
Miriam
On 30. Jul 2021, at 19:32, Lori Levin levin@andrew.cmu.edu wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
- By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
- Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
- Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning manning@stanford.edu wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list lfg-list@mailman.uni-konstanz.de wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
************************************************************* Miriam Butt Department of Linguistics University of Konstanz Fach 184 Tel: +49 7531 88 5109 78457 Konstanz Fax: +49 7531 88 4865 Germany +49 7531 88 5115
miriam.butt@uni-konstanz.de https://www.ling.uni-konstanz.de/butt/
“One of the very first Indian words to enter the English language was the HIndustani slang for plunder: loot. According to the Oxford English Dictionary, this word was rarely heard outside the plains of north India until the late eighteenth century, when it suddenly became a common term across Britain."
William Dalrymple, p. xxiii, "The Anarchy: The Relentless Rise of the East India Company" *************************************************************
p.s. I think there is no doubt at all that Tamil can and should have dative experiencer subjects in UD — and we even got in an example of a dative experiencer subject into the new UD paper in Section 4.4 — but, as noted, there are certainly issues with various UD treebanks, whether coming from traditional grammar notions, trying to wrongly generalize from English, a lack of linguistic expertise, or whatever. I think they can be worked through, though sometimes it takes some careful interpersonal work, and some treebanks have been actively improved and made more consistent with others and the UD specification. It would be great to see Tamil improved using some of the discussions that have taken place. I wish Sarves luck (and could help in a limited way).
Chris.
On Jul 30, 2021, at 9:35 PM, Miriam Butt miriam.butt@uni-konstanz.de wrote:
Hi,
interesting discussion.
I think we should definitely restart ParGram meetings — they were a good forum for discussion and exchange and we could do some of that on-line these days — part of why we stopped having them was a lack of travel funds.. But the world is different now and we can put appropriate infrastructure in place along with meeting in person as needed or wanted.
@Koenraad — we’ll look into moving the Bergen site into to Konstanz.
I am working with a Tamil PhD student (Sarves) and he has been trying to use the Tamil UD bank. He’s got an LFG Tamil ParGram grammar and can map to UD, but the problem is that the Tamil UD bank that exists actually has chosen wrong categories or does not have some (at some point it did not have dative subjects and this abstract I reviewed was trying to introduce them to UD but several UD reviewers pushed back with the argument that this category simply did not and should not exist in UD). So he is a bit stuck, he wants to be able map to UD and it’s technically possible, but it’s not possible in terms of content mapping. One would have to fix the existing Tamil UD and all of the Tamil people involved in the Tamil UD agree that this needs to be done, but since most of them are more CS than linguistics and Tamil is under-researched, the way forward as to the fixes is not immediately clear. They need ParGram type help, as Lori points out.
Having said that, I have realized over the years of training students that very few people can actually grasp all of the linguistics and engineering and formal underpinnings of everything we did and do in ParGram. Which is why it is difficult to make it a broad based community effort — one just has to know too many things (and be good at them).
The genius of UD was to distill things down to a simple scheme that the “crowd” could understand and apply, so it could become a broad-based community effort. But then, it turns out that that scheme is too unspecified to deal with the variety of crosslinguistic patterns that are out there. And us ParGram and LFG people do indeed understand the crosslinguistic patterns fairly well and could advise, while not losing sight of the practicalities and compromises needed for CL work. But then there are not many of us….
And it also takes time to get to know a language like Tamil, which has a long but confusing literature and very few formal linguists who work on it. So I have been helping, but it took me a while to understand just their complementizer system, for example. In CL you can’t wait around for some linguist to take a year or even just a few months to figure out just the complementizers…
I can sort them out when they come and tell me they have 48 auxiliaries or some such nonsense because I know all about verbs, but the bottom line is that good linguistics is hard and takes time and needs many more experts on a language than what we can currently field. One solution would be for the big tech companies to fund 10 linguists for each of the world’s languages. That would already help a lot. And be a tiny fraction of their overall budget.
That’s my two cents!
Miriam
On 30. Jul 2021, at 19:32, Lori Levin levin@andrew.cmu.edu wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
- By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
- Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
- Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning manning@stanford.edu wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list lfg-list@mailman.uni-konstanz.de wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Miriam Butt Department of Linguistics University of Konstanz Fach 184 Tel: +49 7531 88 5109 78457 Konstanz Fax: +49 7531 88 4865 Germany +49 7531 88 5115
miriam.butt@uni-konstanz.de https://www.ling.uni-konstanz.de/butt/
“One of the very first Indian words to enter the English language was the HIndustani slang for plunder: loot. According to the Oxford English Dictionary, this word was rarely heard outside the plains of north India until the late eighteenth century, when it suddenly became a common term across Britain."
William Dalrymple, p. xxiii, "The Anarchy: The Relentless Rise of the East India Company"
Hi Chris,
that would be great — I met with Sarves this morning (cced here) and he was expressing frustration that people are already training parsers and other systems on the faulty Tamil UD — we resolved that we should try to get a committee together to help understand how to fix things — some issues are not as straightforward as dative subjects (gapping is apparently a problem) and will take some careful work, as you say. If you could help at some point, that would be great. I’ll try to start looking into it with Sarves as of October or so.
Cheers,
Miriam
On 8. Aug 2021, at 02:51, Christopher D. Manning manning@stanford.edu wrote:
p.s. I think there is no doubt at all that Tamil can and should have dative experiencer subjects in UD — and we even got in an example of a dative experiencer subject into the new UD paper in Section 4.4 — but, as noted, there are certainly issues with various UD treebanks, whether coming from traditional grammar notions, trying to wrongly generalize from English, a lack of linguistic expertise, or whatever. I think they can be worked through, though sometimes it takes some careful interpersonal work, and some treebanks have been actively improved and made more consistent with others and the UD specification. It would be great to see Tamil improved using some of the discussions that have taken place. I wish Sarves luck (and could help in a limited way).
Chris.
On Jul 30, 2021, at 9:35 PM, Miriam Butt miriam.butt@uni-konstanz.de wrote:
Hi,
interesting discussion.
I think we should definitely restart ParGram meetings — they were a good forum for discussion and exchange and we could do some of that on-line these days — part of why we stopped having them was a lack of travel funds.. But the world is different now and we can put appropriate infrastructure in place along with meeting in person as needed or wanted.
@Koenraad — we’ll look into moving the Bergen site into to Konstanz.
I am working with a Tamil PhD student (Sarves) and he has been trying to use the Tamil UD bank. He’s got an LFG Tamil ParGram grammar and can map to UD, but the problem is that the Tamil UD bank that exists actually has chosen wrong categories or does not have some (at some point it did not have dative subjects and this abstract I reviewed was trying to introduce them to UD but several UD reviewers pushed back with the argument that this category simply did not and should not exist in UD). So he is a bit stuck, he wants to be able map to UD and it’s technically possible, but it’s not possible in terms of content mapping. One would have to fix the existing Tamil UD and all of the Tamil people involved in the Tamil UD agree that this needs to be done, but since most of them are more CS than linguistics and Tamil is under-researched, the way forward as to the fixes is not immediately clear. They need ParGram type help, as Lori points out.
Having said that, I have realized over the years of training students that very few people can actually grasp all of the linguistics and engineering and formal underpinnings of everything we did and do in ParGram. Which is why it is difficult to make it a broad based community effort — one just has to know too many things (and be good at them).
The genius of UD was to distill things down to a simple scheme that the “crowd” could understand and apply, so it could become a broad-based community effort. But then, it turns out that that scheme is too unspecified to deal with the variety of crosslinguistic patterns that are out there. And us ParGram and LFG people do indeed understand the crosslinguistic patterns fairly well and could advise, while not losing sight of the practicalities and compromises needed for CL work. But then there are not many of us….
And it also takes time to get to know a language like Tamil, which has a long but confusing literature and very few formal linguists who work on it. So I have been helping, but it took me a while to understand just their complementizer system, for example. In CL you can’t wait around for some linguist to take a year or even just a few months to figure out just the complementizers…
I can sort them out when they come and tell me they have 48 auxiliaries or some such nonsense because I know all about verbs, but the bottom line is that good linguistics is hard and takes time and needs many more experts on a language than what we can currently field. One solution would be for the big tech companies to fund 10 linguists for each of the world’s languages. That would already help a lot. And be a tiny fraction of their overall budget.
That’s my two cents!
Miriam
On 30. Jul 2021, at 19:32, Lori Levin levin@andrew.cmu.edu wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
- By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
- Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
- Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning manning@stanford.edu wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list lfg-list@mailman.uni-konstanz.de wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Miriam Butt Department of Linguistics University of Konstanz Fach 184 Tel: +49 7531 88 5109 78457 Konstanz Fax: +49 7531 88 4865 Germany +49 7531 88 5115
miriam.butt@uni-konstanz.de https://www.ling.uni-konstanz.de/butt/
“One of the very first Indian words to enter the English language was the HIndustani slang for plunder: loot. According to the Oxford English Dictionary, this word was rarely heard outside the plains of north India until the late eighteenth century, when it suddenly became a common term across Britain."
William Dalrymple, p. xxiii, "The Anarchy: The Relentless Rise of the East India Company"
************************************************************* Miriam Butt Department of Linguistics University of Konstanz Fach 184 Tel: +49 7531 88 5109 78457 Konstanz Fax: +49 7531 88 4865 Germany +49 7531 88 5115
miriam.butt@uni-konstanz.de https://www.ling.uni-konstanz.de/butt/
“One of the very first Indian words to enter the English language was the HIndustani slang for plunder: loot. According to the Oxford English Dictionary, this word was rarely heard outside the plains of north India until the late eighteenth century, when it suddenly became a common term across Britain."
William Dalrymple, p. xxiii, "The Anarchy: The Relentless Rise of the East India Company" *************************************************************
Hi,
The CL article about UD claims that “a UD tree represents a sentence’s observed surface predicate–argument structure”, but it does not seem to annotate predicate-argument structure in the way LFG understands this in f-structures, and that’s where the stone soup comes in.
A paper presented at LFG’20 states that “the Norwegian UD treebank does not annotate predicate–argument structures” and shows how convoluted it is to try and extract all ARG1s of a verb from the Norwegian UD treebank.
(http://web.stanford.edu/group/cslipublications/cslipublications/LFG/LFG-2020... http://web.stanford.edu/group/cslipublications/cslipublications/LFG/LFG-2020/lfg2020-rdmd.pdf, section 5: Comparison with dependency treebanks)
Best, Koenraad
On 30 Jul 2021, at 19:32, Lori Levin levin@andrew.cmu.edu wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
- By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
- Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
- Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning <manning@stanford.edu mailto:manning@stanford.edu> wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin <levin@andrew.cmu.edu mailto:levin@andrew.cmu.edu> wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan <bresnan@stanford.edu mailto:bresnan@stanford.edu> wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <lfg-list@mailman.uni-konstanz.de mailto:lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hello Koenraad,
With 20/20 hindsight, we probably should not have used the term "surface predicate–argument structure” in the paper but rather something like “predicate and surface grammatical relations”. It is certainly the case that, in recent decades, the term “predicate-argument structure” is almost exclusively used for a representation of semantic roles. Nevertheless, I think the article does make quite clear that UD represents neither semantic roles (“argument structure”) nor surface constituency but rather roughly the level of LFG f-structure (for overt grammatical relations). As a historical note, in the 1982 LFG book, the representations like "PRED ‘hand<(^SUBJ)(^OBJ2)(^OBJ)>’” are referred to as a predicate-argument expressions. While I accept that usage has since changed, it’s pretty close to this representation (but overt arguments only) that is represented in UD, and so I think UD does — up to certain deliberate changes — do things “in the way LFG understands this in f-structures”. It’s just that there’s no a-structure.
It is therefore true, as the paper shows, that to get out an a-structure notion like ARG1, you need to make an expression that is the disjunction of a number of subcases. The best I can advertise is that there are less subcases than for capturing the same notion using tgrep-style expressions over Penn Treebank-style treebank trees, an activity that many have engaged in over the last three decades. And if you do want to work at the level of f-structure-style grammatical relations, then you’re in a fairly nice place with UD.
Chris.
On Jul 30, 2021, at 11:18 PM, Koenraad De Smedt <desmedt@uib.nomailto:desmedt@uib.no> wrote:
Hi,
The CL article about UD claims that “a UD tree represents a sentence’s observed surface predicate–argument structure”, but it does not seem to annotate predicate-argument structure in the way LFG understands this in f-structures, and that’s where the stone soup comes in.
A paper presented at LFG’20 states that “the Norwegian UD treebank does not annotate predicate–argument structures” and shows how convoluted it is to try and extract all ARG1s of a verb from the Norwegian UD treebank.
(http://web.stanford.edu/group/cslipublications/cslipublications/LFG/LFG-2020..., section 5: Comparison with dependency treebanks)
Best, Koenraad
On 30 Jul 2021, at 19:32, Lori Levin <levin@andrew.cmu.edumailto:levin@andrew.cmu.edu> wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
1. By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
3. Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
1. Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
2. I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
3. Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning <manning@stanford.edumailto:manning@stanford.edu> wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin <levin@andrew.cmu.edumailto:levin@andrew.cmu.edu> wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
1. Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
2. Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
3. A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan <bresnan@stanford.edumailto:bresnan@stanford.edu> wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <lfg-list@mailman.uni-konstanz.demailto:lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi Lori and everyone, thanks for writing up the detailed thoughts! Sorry for my slow replies….
Top-level take away:
If any ParGram or other LFG people would be interested in contributing to UD, you would be very welcome and you could clearly contribute a lot!
The UD approach and the UD representation certainly have their flaws, and one of the things that UD has suffered from is not enough linguists involved for enough hours. There is certainly still a lot to discuss, document and fix in UD, and any time that linguists, such as Bill Croft, have gotten involved with UD, good things have come out of that. Even when they are people who have a lot of qualms and differences with the way UD does things (Kim Gerdes comes to mind…), they have nevertheless contributed very substantially to the quality of UD, and we appreciate it.
Longer discussion:
Thanks for noting some of UD’s strengths. One more that I would add is that by its scaling and (attempt at) uniformity, UD is also enabling an increasing amount of typological linguistics and psycholinguistic work that just was not possible before. If you haven’t seen any of that, it includes papers such as:
Richard Futrell et al. 2015. Large-scale evidence of dependency length minimization …. PNAS. https://doi.org/10.1073/pnas.1502134112 Matías Guzmán Naranjo et al. 2018. Quantitative word order typology with UD. TLT. https://mguzmann89.gitlab.io/pdf/word-order-oslo.pdf Natalia Levshina. 2019. Token-based typology …. Linguistic Typology. https://www.degruyter.com/document/doi/10.1515/lingty-2019-0025/html Michael Hahn et al. 2020. Universals of word order …. PNAS. https://www.pnas.org/cgi/doi/10.1073/pnas.1910923117 Himanshu Yadav et al. 2021. Do dependency lengths explain constraints …. Linguistics Vanguard. https://doi.org/10.1515/lingvan-2019-0070 Michael Hahn et al. 2021. Modeling word and morpheme order …. Psychological Review. https://psycnet.apa.org/record/2021-31510-001
Both community sourcing and a sustained collaboration like ParGram have their advantages and disadvantages. I openly admit that. In some ways the tradeoffs are similar to the “The Cathedral and the Bazaar” discussion around software development.
There is no doubt that some of the current UD treebanks are not very good. Our hope is that over time they will be improved. Some of them have been being quite actively improved. The main constraints on that are human time, expertise, and commitment. To take one not very good example that has come up, the current Tamil TTB treebank: Some of its problems are well known and documented. There is some commitment to fix them, but things haven’t been moving very fast. I’m sure more help would be welcome. Now, a closer collaboration like ParGram can help with creating commitment and sharing expertise, but it doesn’t automatically create more human time. Incidentally, the Tamil TTB was originally created by converting the TamilTB from Prague Dependency Treebank style. I don’t actually know the history and how it was converted very well, but it doesn’t seem like the 1500-page PDT manual was very effective at yielding a high-quality starting point in this case! Nevertheless, in UD’s case, more documentation and tutorials on linguistic tests as you suggest would definitely help. It all comes down to someone finding the time to write them. It’s a large task since it essentially become akin to writing language grammars and broad coverage syntax textbooks.
I think I was the original source of the prediction that there would never be a version 3 of UD — I should be more careful what I say! — but the statement does reflect some of the complications of the scale and loose bazaar collaboration of UD, which does contrast with the small, tight, sustained collaboration of ParGram. The problem is that relatively few people have so far stuck around as long term collaborators, helpers, and guides in the UD community. Most people work at some point in time to build or convert a treebank for some domain or some language that they are interested in. This does make it very difficult to consider a major revision at this point, because there are a very large number of treebanks with it being unclear that anyone is willing to spend large amounts of time revising them to fit a new standard. The best current approach seems to be to gradually evolve version 2 by better clarifying, documenting, and consistentizing how various things are annotated. Nevertheless, on the other hand, the bazaar model has allowed us to assemble a very large number of treebanks for a large number of languages with reasonable consistency and in a relatively short timespan.
For treating content words consistently as heads, I do have my own misgivings about that one, though I have come to like it more over time. :) I think it is a little bit unfair to say that the policies were made without anticipating the consequences. I mean, as is usual in life, it’s not that we’d thought absolutely everything through and worked out how to analyze every construction in every language. Nevertheless, there was a fair bit of prior work testing out different possibilities: Stanford Dependencies for English supported several variants of partial support for content words as heads where it could be done or not done for prepositions and the copula, while “Stanford Dependencies” for Finnish was the first fully “content words as heads” treebank. And there was quite a bit of further working through analyses and having in person and email discussions on the road to both UD v1 and UD v2. I do agree that there is just a problem with the treatment of a copula linking to a clausal complement, which doesn’t seem to have a satisfying fix within the assumptions assembled. But, hey, if we only have one bad problem, we’re not doing too badly.
I’d also like to mention that coming from my roughly LFG perspective, adopting content-words-as-heads was a key part of what made UD happen and be a big success. That is: Things could either have continued like linguistics-as-usual, with everyone doing their own thing, or people could compromise enough that a whole bunch of people could come together and work with a common representation. Moving to content words as heads, even though it has some issues for languages with real prepositions and copulas like English was a big part of getting enough common ground that some LFG people, some Prague School people and some other European traditional and dependency grammar people could get together and work with a common shared representation. (It also actually makes things a little less English-centric, in that we’re essentially saying “Analyze English as if it were Finnish!”.)
Finally, if you’ve read this far, I really recommend Section 5 of the new UD paper. I think it (very briefly!) provides key wisdom about why UG might be better off without argument structure, why it would likely be difficult to community-source a resource of similar scale with full LFG, and why it might well have less impact, even if it could be done.
Chris.
On Jul 30, 2021, at 10:32 AM, Lori Levin <levin@andrew.cmu.edumailto:levin@andrew.cmu.edu> wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
First, my great respect for UD:
1. By making UD a community-sourced project, a very large resource was created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
3. Community-sourcing and stone-souping allowed all of this to happen quickly, years faster than it could have been done without community-sourcing.
Second, some points about community-sourcing:
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
1. Some UD treebanks are written by linguists and are excellent. Others show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
2. I believe that the UD documentation is intentionally brief so that people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
3. Further copying of English grammar results from over-interpretation of the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
Finally, content words as heads and other policies that were made without anticipating all of the consequences
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning <manning@stanford.edumailto:manning@stanford.edu> wrote: Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin <levin@andrew.cmu.edumailto:levin@andrew.cmu.edu> wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
1. Cross-lingual studies of grammatical relations, information structure, constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
2. Treebanks and parsers that can be used for corpus-based studies in linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
3. A challenge to UD (universal dependency) and UMR (uniform meaning representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan <bresnan@stanford.edumailto:bresnan@stanford.edu> wrote: I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <lfg-list@mailman.uni-konstanz.demailto:lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Thanks Chris for your long and careful message. Thanks Agnieszka for starting this conversation.
I particularly liked Chris' "Things could either have continued like linguistics-as-usual, with everyone doing their own thing, or people could compromise enough that a whole bunch of people could come together and work with a common representation." I am happy that this joining of forces has happened for UD and I hope more joining of concerns and programs can happen from now on.
Most of the people who wrote so far seem in favor of re-starting PARGRAM. I also think this is a great idea and I hope we can do it. But I want more: I would like to (re?)-start ParSEM, which is a whole new kettle of fish. This can be done later on.
As Agnieszka and Lori have mentioned doing it online should be easier and maybe just a few hours online every six months will be enough to get momentum! The Bazaar model (as Chris described it) also allows for more people to commit only a few resources when they're busy (or bored). I think we have nothing to lose and much to gain if we start an online ParGram. Shall we?
best wishes, Valeria
On Sat, Aug 7, 2021 at 7:22 PM Christopher D. Manning manning@stanford.edu wrote:
Hi Lori and everyone, thanks for writing up the detailed thoughts! Sorry for my slow replies….
*Top-level take away:*
If any ParGram or other LFG people would be interested in contributing to UD, you would be very welcome and you could clearly contribute a lot!
The UD approach and the UD representation certainly have their flaws, and one of the things that UD has suffered from is not enough linguists involved for enough hours. There is certainly still a lot to discuss, document and fix in UD, and any time that linguists, such as Bill Croft, have gotten involved with UD, good things have come out of that. Even when they are people who have a lot of qualms and differences with the way UD does things (Kim Gerdes comes to mind…), they have nevertheless contributed very substantially to the quality of UD, and we appreciate it.
*Longer discussion:*
Thanks for noting some of UD’s strengths. One more that I would add is that by its scaling and (attempt at) uniformity, UD is also enabling an increasing amount of typological linguistics and psycholinguistic work that just was not possible before. If you haven’t seen any of that, it includes papers such as:
Richard Futrell et al. 2015. Large-scale evidence of dependency length minimization …. PNAS. https://doi.org/10.1073/pnas.1502134112 Matías Guzmán Naranjo et al. 2018. Quantitative word order typology with UD. TLT. https://mguzmann89.gitlab.io/pdf/word-order-oslo.pdf Natalia Levshina. 2019. Token-based typology …. Linguistic Typology. https://www.degruyter.com/document/doi/10.1515/lingty-2019-0025/html Michael Hahn et al. 2020. Universals of word order …. PNAS. https://www.pnas.org/cgi/doi/10.1073/pnas.1910923117 Himanshu Yadav et al. 2021. Do dependency lengths explain constraints …. Linguistics Vanguard. https://doi.org/10.1515/lingvan-2019-0070 Michael Hahn et al. 2021. Modeling word and morpheme order …. Psychological Review. https://psycnet.apa.org/record/2021-31510-001
Both community sourcing and a sustained collaboration like ParGram have their advantages and disadvantages. I openly admit that. In some ways the tradeoffs are similar to the “The Cathedral and the Bazaar” discussion around software development.
There is no doubt that some of the current UD treebanks are not very good. Our hope is that over time they will be improved. Some of them have been being quite actively improved. The main constraints on that are human time, expertise, and commitment. To take one not very good example that has come up, the current Tamil TTB treebank: Some of its problems are well known and documented. There is some commitment to fix them, but things haven’t been moving very fast. I’m sure more help would be welcome. Now, a closer collaboration like ParGram can help with creating commitment and sharing expertise, but it doesn’t automatically create more human time. Incidentally, the Tamil TTB was originally created by converting the TamilTB from Prague Dependency Treebank style. I don’t actually know the history and how it was converted very well, but it doesn’t seem like the 1500-page PDT manual was very effective at yielding a high-quality starting point in this case! Nevertheless, in UD’s case, more documentation and tutorials on linguistic tests as you suggest would definitely help. It all comes down to someone finding the time to write them. It’s a large task since it essentially become akin to writing language grammars and broad coverage syntax textbooks.
I think I was the original source of the prediction that there would never be a version 3 of UD — I should be more careful what I say! — but the statement does reflect some of the complications of the scale and loose bazaar collaboration of UD, which does contrast with the small, tight, sustained collaboration of ParGram. The problem is that relatively few people have so far stuck around as long term collaborators, helpers, and guides in the UD community. Most people work at some point in time to build or convert a treebank for some domain or some language that they are interested in. This does make it very difficult to consider a major revision at this point, because there are a very large number of treebanks with it being unclear that anyone is willing to spend large amounts of time revising them to fit a new standard. The best current approach seems to be to gradually evolve version 2 by better clarifying, documenting, and consistentizing how various things are annotated. Nevertheless, on the other hand, the bazaar model has allowed us to assemble a very large number of treebanks for a large number of languages with reasonable consistency and in a relatively short timespan.
For treating content words consistently as heads, I do have my own misgivings about that one, though I have come to like it more over time. :) I think it is a little bit unfair to say that the policies were made without anticipating the consequences. I mean, as is usual in life, it’s not that we’d thought absolutely everything through and worked out how to analyze every construction in every language. Nevertheless, there was a fair bit of prior work testing out different possibilities: Stanford Dependencies for English supported several variants of partial support for content words as heads where it could be done or not done for prepositions and the copula, while “Stanford Dependencies” for Finnish was the first fully “content words as heads” treebank. And there was quite a bit of further working through analyses and having in person and email discussions on the road to both UD v1 and UD v2. I do agree that there is just a problem with the treatment of a copula linking to a clausal complement, which doesn’t seem to have a satisfying fix within the assumptions assembled. But, hey, if we only have one bad problem, we’re not doing too badly.
I’d also like to mention that coming from my roughly LFG perspective, adopting content-words-as-heads was a key part of what made UD happen and be a big success. That is: Things could either have continued like linguistics-as-usual, with everyone doing their own thing, or people could compromise enough that a whole bunch of people could come together and work with a common representation. Moving to content words as heads, even though it has some issues for languages with real prepositions and copulas like English was a big part of getting enough common ground that some LFG people, some Prague School people and some other European traditional and dependency grammar people could get together and work with a common shared representation. (It also actually makes things a little less English-centric, in that we’re essentially saying “Analyze English as if it were Finnish!”.)
Finally, if you’ve read this far, I really recommend Section 5 of the new UD paper. I think it (very briefly!) provides key wisdom about why UG might be better off without argument structure, why it would likely be difficult to community-source a resource of similar scale with full LFG, and why it might well have less impact, even if it could be done.
Chris.
On Jul 30, 2021, at 10:32 AM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Chris,
The CL article is wonderful! I recommend it to everyone on this list.
While having great respect for UD, I stand by my "stone soup" analogy (which I stole from Yorick Wilks's characterization of statistical MT).
For ParGram people, I know this is off the topic of ParGram, so feel free to ignore.
*First, my great respect for UD:*
- By making UD a community-sourced project, a very large resource was
created that has enabled research on joint, cross-lingual, and multilingual models of parsing. This research could not have been done without UD.
2.UD has enabled the development of parsers for low-resource languages, both with supervised training on a UD treebank and with transfer from higher-resourced languages.
- Community-sourcing and stone-souping allowed all of this to happen
quickly, years faster than it could have been done without community-sourcing.
*Second, some points about community-sourcing: *
(I just now remembered that just before the pandemic, I told Joakim that I would help write better documentation to improve the quality of community-sourced UD treebanks as a way of remediating the points below.)
Starting from UD's stated goal of having a collection of treebanks in a uniform, comparable format, to facilitate cross-lingual machine learning:
- Some UD treebanks are written by linguists and are excellent. Others
show misunderstandings of the definitions of dependency labels and head-dependent relationships. Thus they do not fulfill the goal of having a uniform, comparable format. The validation scripts for UD are a great resource, but they cannot detect when a treebanker makes a decision that is not quite what a linguist would do given the same documentation. The documentation needs to include procedures for how to tell if recipients in your language are core-arguments or obliques, etc.
- I believe that the UD documentation is intentionally brief so that
people will not be intimidated, say, in comparison to the 300-page manual of the Penn Treebank or the 1500-page manual to the Prague Dependency Treebanks. UD documentation includes a number of constructions (as in your CL paper), but it doesn't have a decision process for deciding whether the thing that translates into a secondary predicate in English really is a secondary predicate in your language, which I think leads to copying English grammar instead of analyzing the grammar of your own language. Again, I think this could be fixed by tutorials on linguistic tests for various things along with typology lessons. e.g., "There are n things that typologists know about coordinate structures, apply these diagnostics to see which type you are dealing with. Now that you know which of the n types you are dealing with, here are specific treebanking guidelines for that type".
I'm noticing more and more in my teaching that students from other countries equate grammar with English grammar, causing them to miss-analyze their own language by assuming that it is like English. For example, students who speak languages with locative possession constructions (a book is at me) will say that possession in their language is just like English. The UD documentation does not help people with this. There needs to be some pre-treebanking awareness. ParGram people would be great at writing the needed documentation, tutorials, and instructions.
- Further copying of English grammar results from over-interpretation of
the goal of having a uniform, comparable format for all treebanks. (I don't mean to pick on Japan. I have discussed this politely with Graham Neubig.) Some groups in Japan (Unidict?) took this to an extreme, which to my mind, totally defeated the purpose of UD. In your CL article, "ta" is not segmented separately from the word "katta", but some Japanese segmenters treat "ta" as a separate word. In some Japanese UD's "ta" is an AUX. Why did they think this was a good idea? Because tenses are carried on auxiliary verbs in European languages, so this makes Japanese more like European languages and should be consistent with UD's goal of uniform, comparable trees across languages. They also treated "koto" as an AUX because it seems like an aux when it nominalizes a sentence, but they also treated it as an aux in "noun no koto". One of my students was experimenting with cross-lingual parser training, and had to remove Japanese from the training pool or at least delete all instances of "koto" before training with Japanese.
The UD goals could be easily mis-understood by non-linguists to mean "make everything like English". However, multi-lingual neural net parsers work by learning what is languagey about language. If you are not a linguist and you do something naive, without experimenting first, because you think it is compatible with UD goals, you are likely to mess up what is languagey about your language. I think this is what happened with over-segmentation in Japanese. Again, ParGram people could help people understand what it means to make your f-structures compatible without copying the grammar of another language.
*Finally, content words as heads and other policies that were made without anticipating all of the consequences*
Here is where the stone soup comes in most: it would have been good to anticipate the consequences before becoming too big to change.
UD has decided to define itself as dependency relations among content words. I can't complain about this: around 2003-2004 I was in a group called Interlingual Annotation of Multilingual Text Corpora (IAMTC), which was short-lived, but included people like Ed Hovy, Owen Rambow, Bonnie Dorr, and Nizar Habash. IAMTC also advocated content words as heads. We were a little too early for community sourcing and very large-scale annotation, and not nearly as charismatic as the UD people, so we faded away.
Any policy has consquences. Here are some of the consequences of "content words as heads":
The head of "Women in the house and in the senate" is "house" because "is", "in", and "and" can't be heads. A better solution would be a constructional approach, identifying this as an instance of a type of predicative construction.
The UD discussion board has long discussions of sentences like "The answer is nobody knows". "Is" can't be the head, so "knows" is the head. But that leads to a violation of the "one subject" policy. "nobody" is the subject of "knows" but "answer" is also the subject of "knows". This is the kind of thing that linguists anticipate before they try to start the soup using only stones.
And now, I see in the UD discussion boards that it may not be possible to move on to version 3 because there has been too much effort and volume in version 2.
--Lori
On Fri, Jul 30, 2021 at 10:08 AM Christopher D. Manning < manning@stanford.edu> wrote:
Of course, I’d describe UD — and its relation to LFG — slightly differently. 😉
At any rate, we do have — and have always had — more than 3 pages; and just now we’ve got an article out in Computational Linguistics:
https://direct.mit.edu/coli/article/47/2/255/98516/Universal-Dependencies
Chris
On Jul 29, 2021, at 8:34 PM, Lori Levin levin@andrew.cmu.edu wrote:
Hi Joan,
I hope it is appropriate for me to chime in, since my day job as a funded project monkey prevented me from attending all but one ParGram meeting in the past.
I'm now trying to get back to doing more linguistics, but with the perspective of the needs of the field of language technologies.
Here is how I see the importance of ParGram:
- Cross-lingual studies of grammatical relations, information structure,
constructions in a very broad sense, argument realization, grammaticalization, etc. leading to theoretical insight into the nature of these things.
- Treebanks and parsers that can be used for corpus-based studies in
linguistics, and perhaps in some hybrid third-wave neuro-symbolic systems in language technologies, especially in low-resource languages.
- A challenge to UD (universal dependency) and UMR (uniform meaning
representation): I think we can learn from UD and UMR how to do things on a larger scale. But at the same time, we can save them from their fate as stone soup in the following sense: they thought they could do something easy (make soup using only "stones" and water, which consisted of three pages of definitions of grammatical relations), but as they progressed, they needed to keep adding "carrots", "onions", "bones" (serious linguistic decisions). But unlike the story, where the soup turned out good, UD has turned out messy and too big to fail. Sometimes they talk about possibly not going on to Version 3 because Version 2 is too big to change. We can show how to do a UD-like project on a firm foundation.
--Lori
On Thu, Jul 29, 2021 at 12:45 AM Joan Bresnan bresnan@stanford.edu wrote:
I would like to hear some discussion of what the goals would be.
—Joan (my spelling checker keeps rewriting my name as Jian, which is what we call the straight sword in my taiji group— but this is really unintentional)
Sent on the fly
On Jul 28, 2021, at 5:00 AM, Agnieszka Patejuk via LFG-list <
lfg-list@mailman.uni-konstanz.de> wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram
meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Hi all,
Not directly related to the question about ParGram meetings, but I would like to mention that using https://pargram.w.uib.no https://pargram.w.uib.no/, a blog-like news site about ParGram, hosted at Bergen, has become more difficult, since the U. of Bergen has restricted edit access to people either on the university network or on specific IP addresses. I therefore suggest discontinuing the use of this site.
On the one hand, the news functionality of this site could easily be taken over by the ParGram email list.
On the other hand, the site also has some useful links and information pages, such as a comprehensive feature table and a substantial bibliography. These could be copied to the ParGram Wiki at Konstanz and further maintained there (https://wiki.uni-konstanz.de/pargram/index.php?title=Main_Page&wteswitch... https://wiki.uni-konstanz.de/pargram/index.php?title=Main_Page&wteswitched=1&veaction=edit).
Best regards, Koenraad
On 28 Jul 2021, at 13:59, Agnieszka Patejuk via LFG-list lfg-list@mailman.uni-konstanz.de wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
Dear All,
It has almost been a month since my original post and over a week since the last message in this thread, so this might be a good time to post a summary.
There have been many, many responses – many thanks to everyone who responded, either to the list or directly to me.
At least 14 people are interested in taking part in ParGram meetings, which is great.
Here are some selected questions/issues (sorted by date) – I think it is up to the ParGram community to consider these: • Joan Bresnan: "I would like to hear some discussion of what the goals would be." (seconded by Damir) • Damir Cavar: "Traction into the semantic and pragmatic domain would also be of interest. Would you also envision to broaden the perspectives and include other grammar related computational frameworks (e.g., GF), or even other theoretical frameworks?" • Helge Dyvik: "I would find it particularly interesting if we could develop the basis for promoting our approach to treebanking vis à vis UD."
A considerable part of the discussion was related to UD – I am not going to summarize this thread, but the relevant posts were sent to all lists, so anyone who is interested may read these.
All best, Agnieszka
On Wed, 28 Jul 2021 at 12:59, Agnieszka Patejuk agnieszka.patejuk@googlemail.com wrote:
Hi,
I am sending this message to all potentially relevant lists (ParGram, XLE, LFG, ParSem) so as to maximise the chance that it will reach people who are interested in this topic. If you know someone who might be interested but is not subscribed to these lists, please consider letting them know.
The question is: who would be interested in restarting ParGram meetings?
I am not sure what would be the best way to organise this discussion, so I am suggesting the following: • if you think your answer might be of interest to many people (for instance, it might spark a discussion), please consider replying to the list(s) • if not, please reply only to me – I will later go through the responses and post a summary to the relevant lists.
I hope that later we can discuss this topic in more detail with people who have expressed interest.
All best, Agnieszka
parsem@mailman.uni-konstanz.de