[2403.13638] Pretraining Language Models Using Translationese

[Submitted on 20 Mar 2024 (v1), last revised 6 Jul 2025 (this version, v3)]

View a PDF of the paper titled Pretraining Language Models Using Translationese, by Meet Doshi and 2 other authors

View PDF
HTML (experimental)

Abstract:In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.

Submission history

From: Meet Doshi [view email]
[v1]
Wed, 20 Mar 2024 14:41:01 UTC (8,205 KB)
[v2]
Thu, 21 Mar 2024 04:03:59 UTC (8,205 KB)
[v3]
Sun, 6 Jul 2025 14:59:46 UTC (9,665 KB)

Submission history

Leave a Comment Cancel Reply