A prototype system for obtaining and managing training data for multilingual learning

The project aims to empower less-resourced language communities to create parallel corpora for machine translation, enhancing language preservation and cultural heritage through an open-source prototype.

Subsidie
€ 150.000
2023

Projectdetails

Introduction

It is difficult to build high-quality machine translation systems for less-resourced languages, such as the minority languages of Europe. State-of-the-art machine translation is trained on large parallel corpora, texts and their translations. But such corpora are not available for less-resourced languages.

Project Overview

We will provide a system for the rapid and inexpensive creation of new parallel corpora. Our PoC project will both produce an open-source prototype utilizing findings from the PI's ERC StG and determine IPR and future funding.

Key Innovations

The key innovation of the prototype will be that it can be used by the less-resourced language community themselves. Current systems require extensive background in natural language processing. Allowing the community to create and curate parallel data has clear social benefits.

Social Impact

The creation of high-quality machine translation systems for less-resourced languages will allow for more content creation in these languages, playing a strong role in the preservation of these languages. Curated parallel data will also be useful in activities such as education and cultural heritage research.

Funding Landscape

Government funding is available for digital language preservation for many of the 7000 languages spoken on Earth. Companies with online translation systems such as Google and DeepL/Linguee are not addressing this market, as the ROI is too low. It makes more sense to empower local communities to create such parallel data.

Evaluation and Future Development

We will carefully evaluate our prototype to ensure that it meets their needs. Along with the creation of the prototype, we will determine how best to structure the IPR to support future development.

Considerations for Implementation

Consulting, which we have already carried out for the Sorbian community, and a certification scheme for users of our system are two possibilities we will consider, along with:

  1. Commercial machine translation
  2. Multilingual classification problems such as hate speech detection.

Financiële details & Tijdlijn

Financiële details

Subsidiebedrag€ 150.000
Totale projectbegroting€ 150.000

Tijdlijn

Startdatum1-10-2023
Einddatum30-9-2025
Subsidiejaar2023

Partners & Locaties

Projectpartners

  • TECHNISCHE UNIVERSITAET MUENCHENpenvoerder
  • LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Land(en)

Germany

Vergelijkbare projecten binnen European Research Council

ERC Advanced...

Evaluating and Programming Intelligent Chatbots for Any Language

EPICAL aims to enhance intelligent chatbots by integrating low resource languages through innovative techniques in text generation, translation, and speech processing, promoting multilingual inclusivity.

€ 2.498.200
ERC Consolid...

DEep COgnition Learning for LAnguage GEneration

This project aims to enhance NLP models by integrating machine learning, cognitive science, and structured memory to improve out-of-domain generalization and contextual understanding in language generation tasks.

€ 1.999.595
ERC Consolid...

Uncovering the creative process: from inception to reception of translated content using machine translation

INCREC aims to analyze the creative processes of professional translators using neural machine translation to enhance translation quality and user experience in literary and audiovisual contexts.

€ 1.993.643
ERC Proof of...

Responsive classifiers against hate speech in low-resource settings

Respond2Hate aims to empower users in low-resource settings to locally filter hate speech from social media using adaptive NLP models, enhancing online safety without relying on tech companies.

€ 150.000
ERC Advanced...

Linguistic traces: low-frequency forms as evidence of language and population history

This project aims to reconstruct early European languages by analyzing low-frequency linguistic variants in historical texts, integrating philology with deep learning to uncover cultural interactions.

€ 2.498.135

Vergelijkbare projecten uit andere regelingen

Mkb-innovati...

Pas ChatGPT aan voor Dierentalen

Dit project onderzoekt de haalbaarheid van technologie die dierentalen vertaalt naar menselijke talen om de communicatie tussen mensen en dieren te verbeteren.

€ 20.000
Mkb-innovati...

Quantum Computing Powered Neural Machine Translation

Het project onderzoekt de haalbaarheid van quantum computing voor het ontwikkelen van efficiëntere vertaalmodellen, zodat Tolq.com een leidende positie kan verwerven in de vertaalmarkt.

€ 20.000
Mkb-innovati...

Haalbaarheidsonderzoek naar ontwikkeling van VocalMind Ai software om taalbarrière op te heffen

Dit haalbaarheidsonderzoek richt zich op het ontwikkelen van VocalMindAi, een unieke vertaaltool die tekst omzet in spraak om een eerlijke waterverdeling wereldwijd te bevorderen binnen een community.

€ 20.000
Mkb-innovati...

Haalbaarheid GPT Domain Specific Neural Network Text

Tolq.com IP B.V. onderzoekt de haalbaarheid van een innovatieve tekstgenerator om het kostbare en tijdrovende schrijfproces voor meertalige websites te optimaliseren.

€ 20.000
Mkb-innovati...

Bias Neutraliser

CorTexter ontwikkelt een deep learning software om onbedoelde vooroordelen in recruitmentteksten te herkennen en te neutraliseren, waardoor gelijke kansen voor werkzoekenden worden bevorderd.

€ 20.000