uncategorized

Glossary for MT

Various machine translation systems have a mechanism of enforcing our preferred terminology: when we provide a “term-translation” glossary, they check if “term” occurs in source and apply its “translation” in the output. The “translation” will override whatever term the MT would use without your glossary. The mechanism is predictable and handles inflected forms when the language pair requires them, so your glossary only needs to contain basic forms.

The glossary for the most of popular MT systems must be a CSV file structured like this, without definition, examples, or any additional information:

source term,termin przetłumaczony

The simplicity of the mechanism is its limitation: your glossary must have only one row for each source term (no variants). For good and predictable results, I recommend to clear the glossary from any rows where source or target term is empty (happens in a working copy of a larger glossary) and terms that contain a comma.

Since CAT plug-ins that allow glossary import for MT are not very helpful in identifying the issues with a CSV file, I created a script that cleans a glossary to a form that can be used by the most of MT systems and leaves a list of excluded terms (you can edit some of these excluded terms and add them to the glossary later). Enjoy: https://github.com/martab0/nettoyeur