Contrary to popular understanding, numerous African languages (and Mande languages in particular) already possess grammatical and lexicographical descriptions of the “traditional” type. However, modern descriptive require work based on large-scale empirical data, i.e. big text corpora.

The availability of big annotated text corpora for numerous world languages has had a major impact on all linguistic disciplines. Unfortunately, corpus linguistics for African languages lags far behind. Nonetheless, there has been progress; in particular, since 2010 several annotated text corpora for Mande languages have been put online. The current version of the Bambara Reference Corpus contains more than 11 million words, and the Maninka Reference Corpus, about 3.5 million. These corpora and others have been developed as part of the Corpora Mandeica project ( A number of scholars have published works based on data from these corpora (see the list of references),

The main goal of the PhD project is to outline a corpus-driven linguistic study of Manding languages, and to carry out one (or several) investigation(s) in this field. Possible areas of focus include but are by no means limited to:

  • grammatical semantics of certain affixes or auxiliary words;
  • analysis of polysemy of content words, with an eye towards the compilation of corpus-driven dictionary;
  • syntactic studies based on the syntactically annotated subcorpus;
  • particularities of different textual genres;
  • various linguistic phenomena based on parallel corpora.

This PhD project will require mastering the program tools of Corpora Mandeica: the Daba program; its Git repository; the interfaces for integration of texts into the corpus (metadata attribution, disambiguation of automatically annotated texts; syntactic annotation along the model of Universal Dependences). The PhD student will thus actively participate in the development of the Manding corpora.


Required skills and experience:

The candidate must have a MA degree in linguistics before the start of the contract. A good knowledge of linguistic typology is necessary, as well as at least the basics of corpus linguistics. Good proficiency in Bambara and/or Guinean Maninka is necessary.

Desirable skills:

  • knowledge of the Linux environment;
  • knowledge of the Python programming language;
  • knowledge of the Toolbox software (for dictionary compiling);
  • knowledge of the Elan software;
  • corpus annotation experience.

Research context: Labex-EFL research project Corpora for Mande studies (coordinator: Valentin Vydrin).

Insitutional context: INALCO & Laboratoire Langage, langues et cultures d’Afrique (LLACAN). Cooperation with l’Equipe de Recherche Textes, Informatique, Multilinguisme (ERTIM, INALCO) is planned.

