Бүтээл, нийтлэл - Монгол Улсын Их Сургууль

Судалгааны чиглэл:

Мэдээллийг профессор, багш, ажилтан МУИС-ийн мэдээллийн санд бүртгүүлснээр танд харуулж байна. Мэдээлэл дутуу, буруу тохиолдолд бид хариуцлага хүлээхгүй.

Cognate Production Using Character-Based Neural Machine Translation Without Segmentation 2025

Зохиогч(ид): З.Цолмон, Г.Амарсанаа, Б.Хуягбаатар, M.Tsendsuren
"Cognate Production Using Character-Based Neural Machine Translation Without Segmentation" IEEE Access, vol. 13, pp. 34824 - 34830, 2025-2-19

https://ieeexplore.ieee.org/document/10892102?source=authoralert

Хураангуй

Cognates are words that share a common origin or have been borrowed across languages, often exhibiting similarities in both sound and meaning. In this work, we introduce a fully character-level neural sequence-to-sequence model for cognate production that does not require any segmentation. Our model operates at the character-level to transform a source word into its corresponding cognate in the target language, thereby obviating out-of-vocabulary issues and alleviating the need for subword segmentation. We evaluated our approach on a novel dataset and found that it outperforms both statistical machine translation baselines and prior neural methods on the same training dataset, as measured by standard coverage and mean reciprocal rank metrics. These results underscore the effectiveness of character-level sequence-to-sequence architectures for cognate generation in diverse language settings, including cross-alphabetic transformations.

A taxonomy and review of generalization research in NLP 2023

Зохиогч(ид): Б.Хуягбаатар, H.Dieuwke, G.Mario, D.Verna
"A taxonomy and review of generalization research in NLP" Nature Machine Intelligence, vol. 5, pp. 1161–1174, 2023-10-19

https://www.nature.com/articles/s42256-023-00729-y

Хураангуй

The ability to generalize well is one of the primary desiderata for models of natural language processing (NLP), but what ‘good generalization’ entails and how it should be evaluated is not well understood. In this Analysis we present a taxonomy for characterizing and understanding generalization research in NLP. The proposed taxonomy is based on an extensive literature review and contains five axes along which generalization studies can differ: their main motivation, the type of generalization they aim to solve, the type of data shift they consider, the source by which this data shift originated, and the locus of the shift within the NLP modelling pipeline. We use our taxonomy to classify over 700 experiments, and we use the results to present an in-depth analysis that maps out the current state of generalization research in NLP and make recommendations for which areas deserve attention in the future.

SIGMORPHON–UniMorph 2023 shared task 0: Typologically diverse morphological inflection 2023

Зохиогч(ид): G.Omer, Б.Хуягбаатар, K.Salam, A.Aryaman, N.Garrett, V.Ekaterina
"SIGMORPHON–UniMorph 2023 shared task 0: Typologically diverse morphological inflection", Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, Canada, 2023-7-14, vol. 2023, pp. 117-125

Хураангуй

The 2023 SIGMORPHON–UniMorph shared task on typologically diverse morphological inflection included a wide range of languages: 26 languages from 9 primary language families. The data this year was all lemma-split, to allow testing models’ generalization ability, and structured along the new hierarchical schema presented in (Batsuren et al., 2022). The systems submitted this year, 9 in number, showed ingenuity and innovativeness, including hard attention for explainability and bidirectional decoding. Special treatment was also given by many participants to the newly-introduced data in Japanese, due to the high abundance of unseen Kanji characters in its test set.

Multilingual NLP: Algorithmic Transparency, Subword Tokenization, and Generalization Benchmark 2023

Зохиогч(ид): Б.Хуягбаатар
"Multilingual NLP: Algorithmic Transparency, Subword Tokenization, and Generalization Benchmark", Монголын Мэдээллийн Технологи 2023, 2023-5-12, vol. 10, pp. 1

Хураангуй

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. This new release also makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet. Finally, I will talk about the usages of morphological resources in NLP.

How Universal is Metonymy? Results from a Large-Scale Multilingual Analysis 2022

Зохиогч(ид): K.Temuulen, B.Gábor, B.Thomas, M.Daariimaa, G.Fausto, Б.Хуягбаатар
"How Universal is Metonymy? Results from a Large-Scale Multilingual Analysis", the Workshop on Research in Computational Linguistic Typology and Multilingual NLP, United States of America, 2022-7-22, vol. 4, pp. 96-98

Хураангуй

Metonymy is regarded by most linguists as a universal cognitive phenomenon, especially since the emergence of the theory of conceptual mappings. However, the field data backing up claims of universality has not been large enough so far to provide conclusive evidence. We introduce a large-scale analysis of metonymy based on a lexical corpus of over 20 thousand metonymy instances from 189 languages and 69 genera. No prior study, to our knowledge, is based on linguistic coverage as broad as ours. Drawing on corpus analysis, evidence of universality is found at three levels: systematic metonymy in general, particular metonymy patterns, and specific metonymy concepts.

The SIGMORPHON 2022 Shared Task on Morpheme Segmentationan 2022

Зохиогч(ид): Б.Хуягбаатар, B.Gábor, A.Aryaman, M.Viktor, G.Kyle, Ž.Zdenek, Г.Амарсанаа, C.Ryan, V.Ekaterina
"The SIGMORPHON 2022 Shared Task on Morpheme Segmentationan", SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, USA, 2022-7-14, vol. 2022, pp. 103-116

Хураангуй

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship 2022

Зохиогч(ид): K.Temuulen, B.Gabor, Б.Хуягбаатар, F.Abed Alhakim, C.Nandu, Г.Амарсанаа, K.Hadi, C.Yamini, G.Fausto
"Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship", International Conference on Language Resources and Evaluation, France, 2022-6-21, vol. 13, pp. 1-10

Хураангуй

This paper describes a method to enrich lexical resources with content relating to linguistic diversity, based on knowledge from the field of lexical typology. We capture the phenomenon of diversity through the notions of lexical gap and language-specific word and use a systematic method to infer gaps semi-automatically on a large scale. As a first result obtained for the domain of kinship terminology, known to be very diverse throughout the world, we publish a lexico-semantic resource consisting of 198 domain concepts, 1,911 words, and 37,370 gaps covering 699 languages. We see potential in the use of resources such as ours for the improvement of a variety of cross-lingual NLP tasks, which we demonstrate through a downstream application for the evaluation of machine translation systems.

Language Diversity: Visible to Humans, Exploitable by Machines 2022

Зохиогч(ид): B.Gabor, B.Erdenebileg, C.Yamini, Б.Хуягбаатар, A.Danish, G.Fausto
"Language Diversity: Visible to Humans, Exploitable by Machines", The Annual Meeting of the Association for Computational Linguistics, Ireland, 2022-5-22, vol. 60, pp. 1-10

Хураангуй

The Universal Knowledge Core (UKC) is a large multilingual lexical database with a focus on language diversity and covering over two thousand languages. The aim of the database, as well as its tools and data catalogue, is to make the abstract notion of linguistic diversity visually understandable for humans and formally exploitable by machines. The UKC website lets users explore millions of individual words and their meanings, but also phenomena of cross-lingual convergence and divergence, such as shared interlingual meanings, lexicon similarities, cognate clusters, or lexical gaps. The UKC LiveLanguage Catalogue, in turn, provides access to the underlying lexical data in a computer-processable form, ready to be reused in cross-lingual applications.

UniMorph 4.0 – Universal Morphology 2022

Зохиогч(ид): Б.Хуягбаатар
"UniMorph 4.0 – Universal Morphology", International Conference on Language Resources and Evaluation, France, 2022-5-10, vol. 13, pp. 1-16

Хураангуй

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push toward the inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

Mitigating Bias in Algorithmic Systems - A Fish-Eye View 2022

Зохиогч(ид): O.Kalia, O.JAHNA, K.Styliani, Б.Хуягбаатар, G.Fausto, B.Veronika, S.Avital, H.Alan, K.Tsvi
"Mitigating Bias in Algorithmic Systems - A Fish-Eye View" ACM Computing Surveys, vol. 55, pp. 1-37, 2022-3-31

https://dl.acm.org/doi/abs/10.1145/3527152

Хураангуй

Mitigating bias in algorithmic systems is a critical issue drawing attention across communities within the information and computer sciences. Given the complexity of the problem and the involvement of multiple stakeholders – including developers, end users and third-parties – there is a need to understand the landscape of the sources of bias, and the solutions being proposed to address them, from a broad, cross-domain perspective. This survey provides a “fish-eye view,” examining approaches across four areas of research. The literature describes three steps toward a comprehensive treatment – bias detection, fairness management and explainability management – and underscores the need to work from within the system as well as from the perspective of stakeholders in the broader context.

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology 2021

Зохиогч(ид): Б.Хуягбаатар, B.Gabor, G.Fausto
"MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology", SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Thailand, 2021-8-5, vol. 18, pp. 39–48

Хураангуй

Large-scale morphological databases provide essential input to a wide range of NLP applications. Inflectional data is of particular importance for morphologically rich (agglutinative and highly inflecting) languages, and derivations can be used, e.g. to infer the semantics of out-of-vocabulary words. Extending the scope of state-of-the-art multilingual morphological databases, we announce the release of MorphyNet, a high-quality resource with 15 languages, 519k derivational and 10.1M inflectional entries, and a rich set of morphological features. MorphyNet was extracted from Wiktionary using both hand-crafted and automated methods, and was manually evaluated to be of a precision higher than 98%. Both the resource generation logic and the resulting database are made freely available and are reusable as stand-alone tools or in combination with existing resources.

Гарал нэг үгийг үсгийн дараалалд тулгуурласан seq2seqзагвараар үүсгэх нь 2021

Зохиогч(ид): З.Цолмон, Г.Амарсанаа, Б.Хуягбаатар
"Гарал нэг үгийг үсгийн дараалалд тулгуурласан seq2seqзагвараар үүсгэх нь" MONGOLIAN JOURNAL OF ENGINEERING AND APPLIED SCIENCES, vol. 2, pp. 1-6, 2021-5-14

http://seas.num.edu.mn/article/232

Хураангуй

Аливаа хэл хооронд оршдог бичлэг болон дуудлага төстэй, ижил утгатай гарал нэг үгсийг тодорхойлох нькомпьютер хэл шинжлэлийн даалгаварт хэрэглэх хэлний шинэ нөөцийг үүсгэх боломжийг олгож байна.Энэ ажлаар үгийн үсгийн дараалалд тулгуурлан гарал нэг үгийг автоматаар үүсгэх аргыг боловсруулахыгзорьсон юм. Бид төстэй болон өөр үсэгтэй таван хос хэлний хувьд гарал нэг үгсийг үүсгэх seq2seq гүнсургалтын загварыг гаргалаа. Сургасан загварыг үүсгэсэн үгийн тэмдэгтийн зөрүүгээр үнэлэхэд дунджаар 0.73 оновчтойгоор гарал нэг үгсийг зөв үүсгэж чадсан.

A Large and Evolving Cognate Database 2021

Зохиогч(ид): Б.Хуягбаатар, B.Gabor, G.Fausto
"A Large and Evolving Cognate Database" Language Resources and Evaluation, vol. 56 or 57, pp. 1-23, 2021-3-1

https://www.springer.com/journal/10579/

Хураангуй

We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and35 writing systems, with new releases already in preparation. The paper presents the algorithm and input resources used for its computation, an evaluation of the result, as well as a quantitative analysis of cognate data leading to novel insights on language diversity. Further-more, as an example on the use of large-scale cross-lingual knowledge bases for improving the quality of multilingual applications, we present a case study on the use of CogNet for bilingual lexicon induction in the framework of cross-lingual transfer learning.

CogNet: A Large-Scale Cognate Database 2019

Зохиогч(ид): Б.Хуягбаатар, B.Gabor, G.Fausto
"CogNet: A Large-Scale Cognate Database", The Annual Meeting of the Association for Computational Linguistics, Italy, 2019-7-27, vol. 57, pp. 3136-3145

Хураангуй

This paper introduces CogNet, a new, large-scale lexical database that provides cognates -words of common origin and meaning- across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available wordnets, with an accuracy evaluated to 94%. Finally, it presents statistics about the cognate data and some initial insights into it, hinting at a possible future exploitation of the resource by various fields of lingustics.

Building the Mongolian WordNet 2019

Зохиогч(ид): Б.Хуягбаатар, Г.Амарсанаа, Ч.Алтангэрэл, G.Fausto
"Building the Mongolian WordNet", The 10th Global WordNet Conference, Poland, 2019-7-23, vol. 10, pp. 238-244

Хураангуй

This paper presents the Mongolian Wordnet (MOW), and a general methodology of how to construct it from various sources e.g. lexical resources and expert translations. As of today, the MOW contains 23,665 synsets, 26,875 words, 2,979 glosses, and 213 examples. The manual evaluation of the resource1 estimated its quality at 96.4%.

End to End - Towards a Framework for Reducing Biases and Promoting Transparency of Algorithmic System 2019

Зохиогч(ид): S.Avital, Б.Хуягбаатар, B.Veronika, G.Fausto, H.Alan, K.Styliani, K.Tsvi, O.Jahna
"End to End - Towards a Framework for Reducing Biases and Promoting Transparency of Algorithmic System", 14th International Workshop On Semantic And Social Media Adaptation And Personalization, Cyprus, 2019-6-4, vol. 14, pp. 1-6

Хураангуй

Algorithms play an increasing role in our everyday lives. Recently, the harmful potential of biased algorithms has been recognized by researchers and practitioners. We have also witnessed a growing interest in ensuring the fairness and transparency of algorithmic systems. However, so far there is no agreed upon solution and not even an agreed terminology. The proposed research defines the problem space, solution space and a prototype of comprehensive framework for the detection and reducing biases in algorithmic systems. https://ieeexplore.ieee.org/abstract/document/8864914

Understanding and Exploiting Language Diversity

Зохиогч(ид): Б.Хуягбаатар

Хураангуй

12081

Дэлгэрэнгүй мэдээлэл