Strand 6: Language resources: data, lexicons, corpora, tools (manager Cédric Gendrot, LPP-University of Paris)





/! This Strand is no longer operational as from 2020 /!

This Strand involves more than 60 participants from all Labex teams, and interacts with all other axes. Its objectives are threefold:

  • Define and control a language resource policy within LabEx (establish an inventory of existing and required resources, encourage resource developers to use standards and develop mappings with other resources, define a distribution policy around availability free).
  • Design and implement advanced linguistic resource development techniques to improve their accuracy, coverage and speed of development (techniques for rare languages, transfer of language skills between closely related languages, collaborative techniques for the development of languages) resources);
  • Apply the two previous objectives to the development of linguistic resources for French – which is obviously a language of choice for LabEx – but also for other languages ​​of different typological families for which there is a particular interest in LabEx, for example , for languages, social or application reasons.

It should be emphasized that the importance of language resources is not limited to university studies. The development of linguistic resources has a strong social impact that depends, inter alia, often on the language concerned: (i) developing cooperation and the transfer of knowledge to developing countries (Mauritian Creole, Afro-Asian languages ​​…); (ii) preserve linguistic diversity and teach the languages ​​of France (for example, Western Armenian, the languages ​​of French Guyana …); (iii) provide resources for NLP systems used in commercial and military intelligence (eg, Iranian languages, Mandarin Chinese, Arabic …) and other popular NLP applications, such as machine translation ( French, Mandarin Chinese, etc.).


Common tools for language resources:

This sub-axis will provide a repository for inventorying and distributing existing and future resources in the LabEx, in relation to all other components and in particular with component 3 (an online inventory of existing and future resources will be created and maintained). A special effort will be made to address standardization, distribution and dissemination issues, including by promoting the free availability of all resources, at least for research purposes. Another major objective of this sub-axis, in collaboration with resource developers and users, is to develop adequate tools to validate, edit and annotate all kinds of linguistic resources, as well as tools to extract linguistic information. Finally, this sub-axis will play a key role in training researchers on how to develop and exploit language resources.


Design semi-automatic techniques for resource development:

For most languages, no usable language resources are available, although they serve as a basis for the development of experimental language studies and NLP tools. However, developing such resources is a very expensive task. As a result, semi-automatic resource development techniques are a critical area of ​​research. For this reason, this sub-axis will be responsible for improving the state of the art of algorithmic, formal and practical models to reduce as much as possible the cost of developing language resources. A particular effort will be made for the development of basic linguistic resources, ie morphological lexicons and labellers of parts of speech as well as corpora of speech. Specific techniques will also be developed to take advantage of situations where the language concerned is closely linked to another for which linguistic resources already exist. Moreover, even if human intervention will be reduced as much as possible, an important objective of this sub-axis will be to understand how to optimize it, notably through collaborative techniques (wikis, online games, Turkish mechanics).


Develop new linguistic resources:

Semantic resources are expensive to develop, and very few exist for French, even if research in linguistics, psycholinguistics and NLP would greatly benefit from semantic resources on a large scale, as shown by recent work on English. One of our goals is to annotate particular corpora, like the French TreeBank, with a wide variety of semantic annotation layers, in addition to the existing morphosyntactic and syntactic layers, namely anaphora, (co) reference, named entities, FrameNet and discursive information. Another goal will be to produce important new oral resources (for example, a MapTask for French in Axis 2, acquisition corpuses in Axis 4, corpora of learners and a phonological database in Axis 1, in addition to the existing ones Thirdly, LabEx members will continue their work on the study and equipment of French from a historical perspective (resources for medieval French, resources created from historical grammars and others). significant resources will also be developed for more than 20 languages ​​of particular interest to WP LabEx, in collaboration with other Axes, including Axis 3. As new needs emerge, new projects will be will be put in place.


Here is the list of Strand 6 research operations:

  • LR-1 A joint approach to language resources development (resp. C. Plancq) LLF, Alpage, LPP-P3, LLACAN, Lacito, SeDyL

– LR-2 Designing semi-automatic techniques for resource development

  • LR-2.1Techniques for resource-scarce languages (resp. B. Sagot) Alpage, LLF, LPP-P3
  • LR-2.2Techniques for transfering lexical resources from one language to a closely-related one (resp. B Sagot) Alpage, LLF, MII, LPP-P3, SeDyL
  • LR-2.3Techniques for speech corpora  (resp. M. Adda-Decker) LPP-P3, LACITO, Sedyl
  • LR-3 Developing new resources for French
  • LR-3.1Instantiating a French FrameNet and FrameNet-annotate the French TreeBank (resp. M. Candito) Alpage, LLF, Lattice
  • LR-3.2Adding annotation layers in the French Treebank for anaphora, (co)reference, named entities and discourse structures  (resp. L. Danlos) Alpage, LLF, Lattice
  • LR-3.3A multilayer meta-lexicon for French: developing mappings between existing resources (resp. B Sagot) Alpage, LPNCog, LPP-P5
  • LR-3.5Acoustic and physiological data for multi-sensor investigation of normal speech (resp. J. Vaissière) LPP-P3, LLF
  • LR-3.6Development of non-standard speech corpora: learner speech in French and English (resp. E. Delais) LLF, LPP-P3
  • LR-3.7Pathological speech: acoustic, perceptual  and physiological data (resp. C. Fougeron) LPP-P3, LLF
  • LR-3.8Longitudinal corpus of spoken French: acoustic, perceptual and physiological data (resp. C. Gendrot) LPP-P3, LFF
  • LR-4 Developing resources for various languages
  • LR-4.1Developing morphological and syntactic resources for western Iranian languages (resp. P.Samvelian)
  • LR-4.3Linguistic resources for Mandarin Chinese (resp. C. Saillard) LLF, Alpage
  • LR-4.6 Towards a Treebank for Mauritian creole (resp. F. Henri) LLF, Alpage
  • LR-4.8 Text corpora for Manding languages (Bambara, Maninka) (resp. V. Vydrin)
  • LR-4.9 A historical perspective on language resources and linguistic traditions (resp. S. Archaimbault)
  • LR-4.10 BdD PluriL – Base de Données Plurilingues (resp. I. Léglise) SEDYL, LACITO, LLF, LLACAN
  • LR-4.11  Automatic paradigm generation and language description (resp. G. Jacques) CRLAO, Alpage, LLF, HTL

LR-4.12  Resource acquisition for Hausa : ResHau (resp. B. Crysmann) LLF, Alpage, LLACAN

LR-4.13 Animal vocalizations: database for a better comprehension of the origin and evolution of human language (resp. D. Demolin) LPP/P3


LR_4.14 Outils de Traitement Automatique des Langues pour la linguistique de terrain : déploiement et perfectionnement du logiciel de transcription automatique Persephone (resp. Alexis Michaud)