HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration

by MorphologySeptember 10th, 2024

Too Long; Didn't Read

HeArBERT is a new bilingual Arabic-Hebrew language model that uses Arabic transliterated into Hebrew script, achieving notable improvements in machine translation. This approach contrasts with previous models like mBERT, which face limitations due to data representation and script differences. The study highlights the advantages of script unification and addresses gaps in existing research on transliteration effects in bilingual NLP tasks.

featured image - HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration

Authors:

(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;

(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.

Table of Links

Abstract and Introduction

Related Work

Methodology

Experimental Settings

Results

Conclusion and Limitations

Bibliographical References

K et al. (2020) have suggested that structural similarity of languages is essential for language model’s multilingual generalization capabilities. Their suggestion was further discussed by Dufter and Schütze (2020), who highlighted the essential components for a model to possess “multilinguality”, and show that the order of the words in the sentence is key to the model’s cross-lingual generalization capabilities. mBERT, as introduced by Devlin et al. (2019), was a pioneering language model that encompassed multiple languages, including Arabic and Hebrew. However, both Arabic and Hebrew are significantly under-represented in the pre-training data, resulting in inferior performance compared to the equivalent monolingual models on various downstream tasks (Antoun et al., 2020; Lan et al., 2020; Chriqui and Yahav, 2022; Seker et al., 2022). GigaBERT (Lan et al., 2020) is another multilingual model, trained for English and Arabic. However, the best results for most of the known NLP tasks are typically achieved by one of the large monolingual models in both Arabic and Hebrew. CAMeLBERT (Inoue et al., 2021), is one of those models. It is trained on texts written in Modern Standard Arabic (MSA), Classical Arabic, as well as dialectal variants. In the realm of Hebrew language models, AlephBERT (Seker et al., 2022) stands out as one of the leading performers, alongside others like HeBERT (Chriqui and Yahav, 2022). Among other datasets, the monolingual models mentioned above use the relevant parts of the OSCAR dataset (Ortiz Suárez et al., 2020) for training. Our model relies solely on the OSCAR dataset for both Hebrew and Arabic, resulting in a considerably smaller total number of words for each language in comparison to the existing monolingual language models.

The effect of transliteration on cross-lingual generalization were discussed previously in (Dhamecha et al., 2021; Chau and Smith, 2021) and more recently in (Moosa et al., 2023; Purkayastha et al., 2023). None of these works study the languages of our focus: Arabic and Hebrew. Dhamecha et al. (2021) focused on languages from the Indo-Aryan family, which has been studied before for cross-lingual generalization and also has several publicly available multilingual models. To the best of our knowledge, our work is first to study generalization between Arabic and Hebrew and no multilingual masked language models that include both languages have been published apart from mBERT.

Chau and Smith (2021) address the generalization from high- to low-resourced languages. However, both Arabic and Hebrew are currently considered medium- to high-resourced languages. Furthermore, their evaluation focuses solely on tokenlevel classification tasks, such as dependency parsing and part-of-speech tagging, whereas our evaluation targets machine translation, a sequence-to-sequence bilingual task.

Purkayastha et al. (2023) employ Romanization for transliteration, whereas we transliterate Arabic into the Hebrew script. Analogous to Chau and Smith (2021), their evaluation centers on token-level classification tasks, which are not addressed in our work.

This paper is under CC BY 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Morphology@morphology

Morphology sculpts meaning and structure, fostering a sustainable understanding of the world's diversity.

Read my stories About @morphology

TOPICS

machine-learning #ai #bilingual-language-model #transliteration #pre-trained-language-models #semitic-languages-nlp #script-unification-in-nlp #hearbert #machine-translation

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration

Too Long; Didn't Read

Table of Links

2. Related Work

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES