This project aims to increase the vitality and visibility of several languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. It is positioned at the crossroads of descriptive linguistics and corpus linguistics. Its main goal is the constitution of resources, in particular raw and annotated corpora, with several objectives:
- Build (i) monolingual corpora in genres that are close to or transcribe oral language, for example plays or narrative ethnotexts, and (ii) parallel corpora (from translations);
- Develop annotated corpora in the “Universal Dependencies” framework ;
- Produce complete and up-to-date descriptions and linguistic formalisations based on corpora;
- Raise awareness in the NLP (Natural Language Processing) community of the problems of non-standardised languages and the need to take variation into account in NLP systems;
- Share and transfer experiences and tools between languages in the project and explore methods of technology transfer.