Increase the DIgital VITALity and visibility of languages of France:
linguistic descriptions and annotated corpora

ANR logo
ANR-21-CE27-0004

Increase the DIgital VITALity and visibility of languages of France: <br>
linguistic descriptions and annotated corpora

Description

This project aims to increase the vitality and visibility of several languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. It is positioned at the crossroads of descriptive linguistics and corpus linguistics. Its main goal is the constitution of resources, in particular raw and annotated corpora, with several objectives:

  • Build (i) monolingual corpora in genres that are close to or transcribe oral language, for example plays or narrative ethnotexts, and (ii) parallel corpora (from translations);
  • Develop annotated corpora in the “Universal Dependencies” framework ;
  • Produce complete and up-to-date descriptions and linguistic formalisations based on corpora;
  • Raise awareness in the NLP (Natural Language Processing) community of the problems of non-standardised languages and the need to take variation into account in NLP systems;
  • Share and transfer experiences and tools between languages in the project and explore methods of technology transfer.

Latest News

Participation in the scientific meeting of the GdR “Computational, formal & field linguistics”

Research carried out as part of the DIVITAL project was presented at the GdR Linguistique Informatique, Formelle et de Terrain scientific days on 20 and 21 November 2023 in Nancy:

  • Cristina Garcia Holgado. More than just data : Dialectal variation and NLP resources for Corsican and Poitevin- Saintongeais
  • Delphine Bernhard. Transfert zero-shot pour l’étiquetage morphosyntaxique : analyse de l’impact de la transformation des données à étiqueter pour les dialectes alsaciens