Cross outline

10/4/2023

The main motivation behind the outline-based approach is to avoid the known pitfalls of direct translation and produce evaluation data better representing the linguistic and cultural realities of each language in the sample. Through this process, we create the Cross-lingual Outline-based Dialogue dataset (termed cod), supporting natural language understanding (intent detection and slot labeling tasks), dialogue state tracking, and end-to-end dialogue modeling in 11 domains and 4 typologically and areally diverse languages: Arabic, Indonesian, Russian, and Kiswahili. This ensures both the cost-effectiveness and cross-lingual comparability offered by manual translation, and the naturalness and culture-specificity of creating data from scratch. Finally, outlines are paraphrased by human subjects into their native tongue and slot values are adapted to the target culture and geography. Then, the schemata are automatically mapped into outlines in English, which describe the intention that should underlie each dialogue turn and the slots of information it should contain, as shown in Table 1. In particular, abstract dialogue schemata, specific to individual domains, are sampled from the English Schema-Guided Dialogue dataset (SGD Shah et al., 2018 Rastogi et al., 2020). To address all these gaps, in this work we devise a novel outline-based annotation pipeline for multilingual ToD datasets that combines the best of both processes. However, this process is highly time- and money-consuming, thus failing to scale to large quantities of examples and languages, and often lacks coverage in terms of possible dialogue flows (Zhu et al., 2020 Quan et al., 2020).

As an alternative to translation, new ToD datasets can be created from scratch in a target language through the Wizard-of-Oz framework (WOZ Kelley, 1984) where humans impersonate both the client and the assistant. Further, they provide over-optimistic estimates of performance due to the artificial similarity between source and target texts (Artetxe et al., 2020). While this process is cost-efficient and typically makes data and results comparable across languages, it yields dialogues that lack naturalness (Lembersky et al., 2012 Volansky et al., 2015), are not properly localized nor culture-specific (Clark et al., 2020). Most are obtained by manual or semi-automatic translation of an English source (Castellucci et al., 2019 Bellomaria et al., 2019 Susanto and Lu, 2017 Upadhyay et al., 2018 Xu et al., 2020 Ding et al., 2022 Zuo et al., 2021, inter alia). However, even when available, they suffer from several pitfalls. Therefore, the main driver of development in multilingual ToD is the creation of multilingual resources. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that cod prevents over-inflated performance, typically met with prior translation-based ToD datasets. Qualitative and quantitative analyses of cod versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Our Cross-lingual Outline-based Dialogue dataset ( cod) enables natural language understanding, dialogue state tracking, and end-to-end dialogue evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Through this process we annotate a new large-scale dataset for evaluation of multilingual and cross-lingual ToD systems. These in turn guide the target language annotators in writing dialogues by providing instructions about each turn’s intents and slots. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows.

Nevertheless, its potential is not fully realized, as current multilingual ToD datasets-both for modular and end-to-end modeling-suffer from severe limitations.

Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers.

0 Comments

Cross outline

Leave a Reply.

Author

Archives

Categories