查看原文
其他

刊讯|SSCI 期刊《计算语言学》2022年第48卷第1-4期

六万学者关注了→ 语言学心得 2024-02-19

刊讯|SSCI 期刊 《应用语言学评论》 2022年第5-6期

2023-03-07

刊讯|SSCI 期刊《语料库语言学和语言学理论》2022年第3期

2023-03-04

语言学年报•期刊动态|《汉语学习》(2022)

2023-03-03


COMPUTATIONAL LINGUISITICS

Volume 48, Issue 1-4, June 2022

COMPUTATIONAL LINGUISITICS(SSCI一区,2021 IF:7.778)2022年第1-4期共发文37篇,其中研究性论文30篇,书评2篇,调查2项。研究论文涉及低资源、病理语言参数识别、异常检测、N-Best抽取、自动情绪识别、循环模式、语义分布、伪样本改进、多语言句子编码、文本分类、主题模型和分布式表示等。欢迎转发扩散!(2022年已更完)

目录


ISSUE 1

ARTICLES

■ To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP, by Gözde Gül Şahin, 5–42.

■ Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease, by Veronika Vincze, 119–153.

■ Novelty Detection: A Perspective from Natural Language Processing, by Tirthankar Ghosal, 77–117.

■ Improved N-Best Extraction with an Evaluation on Language Data, by Johanna Björklund, 119–153.


SURVEY

■ Deep Learning for Text Style Transfer: A Survey, by Di Jin, 155–205.


SQUIB

■ Probing Classifiers: Promises, Shortcomings, and Advances, by Yonatan Belinkov , 207–219.


LAST WORD

■ Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems, by Manaal Faruqui, 221–232.


BOOK REVIEW

■ Natural Language Processing: A Machine Learning Perspective by Yue Zhang and Zhiyang Teng, by Julia Ive, 233–235.


ISSUE 2

ARTICLES

■ Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis, by Saif M. Mohammad, 239–278.

■ Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization, by Md Tahmid Rahman Laskar, 279–320.

■ Challenges of Neural Machine Translation for Short Texts, by Yu Wan, 321–342.

■ Annotation Curricula to Implicitly Train Non-Expert Annotators, by Ji-Ung Lee, 343–373.

■ Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity, by Himanshu Yadav, 375–401.

■ Dual Attention Model for Citation Recommendation with Analyses on Explainability of Attention Mechanisms and Qualitative Experiments, by Yang Zhang, 403–470.


SQUIB

■ On Learning Interpreted Languages with Recurrent Models, by Denis Paperno, 471–482.


LAST WORD

■ Boring Problems Are Sometimes the Most Interesting, by Richard Sproat, 483–490.



IUSSE3

ARTICLES

■ Linear-Time Calculation of the Expected Sum of Edge Lengths in Random Projective Linearizations of Trees, by Lluís Alemany-Puig, Ramon Ferrer-i-Cancho, Pages 491–516.

■ The Impact of Edge Displacement Vaserstein Distance on UD Parsing Performance, by Mark Anderson, Carlos Gómez-Rodríguez, Pages 517–554.

■ UDapter: Typology-based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling, by Ahmet Üstün, Arianna Bisazza, Gosse Bouma, Gertjan van Noord Pages 555–592.

■ Tractable Parsing for CCGs of Bounded Degree, by Lena Katharina Schiffer, Marco Kuhlmann, Giorgio Satta, Pages 593–633.

■ Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology, by Rochelle Choenni, Ekaterina Shutova, Pages 635–672.


SURVEY

■ Survey of Low-Resource Machine Translation, by Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Pages 673–732.

■ Position Information in Transformers: An Overview, by Philipp Dufter, Martin Schmitt, Hinrich Schütze, Pages 733–763.


IUSSE4

ARTICLES

■ Noun2Verb: Probabilistic Frame Semantics for Word Class Conversion, by Lei Yu, Yang Xu, Pages 783–818.

■ Enhancing Lifelong Language Learning by Improving Pseudo-Sample Generation, by Kasidis Kanwatchara, Thanapapas Horsuwan, Piyawat Lertvittayakumjorn, Pages 819–848.

■ Nucleus Composition in Transition-based Dependency Parsing, by Joakim Nivre, Ali Basirat, Luise Dürlich, Pages 849–886.

■ Effective Approaches to Neural Query Language Identification, by Xingzhang Ren, Baosong Yang, Dayiheng Liu, Pages 887–906.

■ Information Theory–based Compositional Distributional Semantics, by Enrique Amigó, Alejandro Ariza-Casabona, Victor Fresno, Pages 907–948.

■ Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review, by Ilia Kuznetsov, Jan Buchmann, Max Eichler, Pages 949–986.

■ Hierarchical Interpretation of Neural Text Classification, by Hanqi Yan, Lin Gui, Yulan He, Pages 987–1020.

■ Neural Embedding Allocation: Distributed Representations of Topic Models, by Kamrun Naher Keya, Yannis Papanikolaou, James R. Foulds, Pages 1021–1052.

■ The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization, by Ildikó Pilán, Pierre Lison, Lilja Øvrelid,Pages 1053–1101.

■ How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study, by Saeed Esmail, Kfir Bar, Nachum Dershowitz, Pages 1103–1123.


REVIEWS

■ Explainable Natural Language Processing, by George Chrysostomou, Pages 1137–1139.


摘要

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP 

Gözde Gül Şahin, Koç University, Computer Science and Engineering Department

Abstract Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).



Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease 

Veronika Vincze, MTA-SZTE Research Group on Artificial Intelligence

Martina Katalin Szabó, MTA TK Computational Social Science - Research Center for Educational and Network Studies (CSS-RECENS) and University of Szeged, Institute of Informatics

Ildikó Hoffmann, Research Centre for Linguistics Eötvös Lorand Research Network and University of Szeged, Department of Hungarian Linguistics, Szeged

László Tóth, University of Szeged, Institute of Informatics

Magdolna Pákáski, University of Szeged Department of Psychiatry

János Kálmán, University of Szeged, Department of Psychiatry

Gábor Gosztolya, MTA-SZTE Research Group on Artificial Intelligence


Abstract In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.



Novelty Detection: A Perspective from Natural Language Processing

Tirthankar Ghosal, Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

Tanik Saikh, Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India

Tameesh Biswas, Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India

Asif Ekbal, Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India

Pushpak Bhattacharyya, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Powai, India

Abstract 

The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.



Improved N-Best Extraction with an Evaluation on Language Data 

Johanna Björklund, Umea University, Department of Computing Science

Frank Drewes, Umea University, Department of Computing Science

Anna Jonsson, Umea University, Department of Computing Science

Abstract 

The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.



Deep Learning for Text Style Transfer: A Survey 

Di Jin, Amazon, Alexa AI

Zhijing Jin, Max Planck Institute for Intelligent Systems, Empirical Inference Department and ETH Zürich, Department of Computer Science

Zhiting Hu, UC San Diego, Halıcıoğlu Data Science Institute (HDSI)

Olga Vechtomova, University of Waterloo, Faculty of Engineering

Rada Mihalcea, University of Michigan, EECS, College of Engineering

Abstract Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this article, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task.



Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov, Yonatan Belinkov 

Abstract 

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This squib critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.



Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems

Manaal Faruqui, Google Assistant

Dilek Hakkani-Tür, Amazon Alexa AI

Abstract As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this article, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system’s pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end data sets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.


Natural Language Processing: A Machine Learning Perspective by Yue Zhang and Zhiyang Teng

Julia Ive, Queen Mary University of London, UK

Abstract Natural Language Processing (NLP) is a discipline at the crossroads of Artificial Intelligence (Machine Learning [ML] as its part), Linguistics, Cognitive Science, and Computer Science that enables machines to analyze and generate natural language data. The multi-disciplinary nature of NLP attracts specialists of various backgrounds, mostly with the knowledge of Linguistics and ML. As the discipline is largely practice-oriented, traditionally NLP textbooks are focused on concrete tasks and tend to elaborate on the linguistic peculiarities of ML approaches to NLP. They also often introduce predominantly either traditional ML or deep learning methods. This textbook introduces NLP from the ML standpoint, elaborating on fundamental approaches and algorithms used in the field such as statistical and deep learning models, generative and discriminative models, supervised and unsupervised models, and so on. In spite of the density of the material, the book is very easy to follow. The complexity of the introduced topics is built up gradually with references to previously introduced concepts while relying on a carefully observed unified notation system. The textbook is oriented to prepare the final-year undergraduate, as well as graduate students of relevant disciplines, for the NLP course and stimulate related research activities. Considering the comprehensiveness of the topics covered in an accessible way, the textbook is also suitable for NLP engineers, non-ML specialists, and a broad range of readers interested in the topic.



Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis

Saif M. Mohammad, National Research Council Canada

Abstract The importance and pervasiveness of emotions in our lives makes affective computing a tremendously important and vibrant line of work. Systems for automatic emotion recognition (AER) and sentiment analysis can be facilitators of enormous progress (e.g., in improving public health and commerce) but also enablers of great harm (e.g., for suppressing dissidents and manipulating voters). Thus, it is imperative that the affective computing community actively engage with the ethical ramifications of their creations. In this article, I have synthesized and organized information from AI Ethics and Emotion Recognition literature to present fifty ethical considerations relevant to AER. Notably, this ethics sheet fleshes out assumptions hidden in how AER is commonly framed, and in the choices often made regarding the data, method, and evaluation. Special attention is paid to the implications of AER on privacy and social groups. Along the way, key recommendations are made for responsible AER. The objective of the ethics sheet is to facilitate and encourage more thoughtfulness on why to automate, how to automate, and how to judge success well before the building of AER systems. Additionally, the ethics sheet acts as a useful introductory document on emotion recognition (complementing survey articles).



Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization 

Md Tahmid Rahman Laskar, Dialpad Canada Inc., Information Retrieval and Knowledge Management Research Lab, York University

Enamul Hoque, School of Information Technology, York University

Jimmy Xiangji Huang, Information Retrieval and Knowledge Management Research Lab, York University

Abstract The Query-Focused Text Summarization (QFTS) task aims at building systems that generate the summary of the text document(s) based on the given query. A key challenge in addressing this task is the lack of large labeled data for training the summarization model. In this article, we address this challenge by exploring a series of domain adaptation techniques. Given the recent success of pre-trained transformer models in a wide range of natural language processing tasks, we utilize such models to generate abstractive summaries for the QFTS task for both single-document and multi-document scenarios. For domain adaptation, we apply a variety of techniques using pre-trained transformer-based summarization models including transfer learning, weakly supervised learning, and distant supervision. Extensive experiments on six datasets show that our proposed approach is very effective in generating abstractive summaries for the QFTS task while setting a new state-of-the-art result in several datasets across a set of automatic and human evaluation metrics.


Challenges of Neural Machine Translation for Short Texts 

Yu Wan, NLP2CT Lab, University of Macau

Baosong Yang, Alibaba Group

Derek Fai Wong, NLP2CT Lab, University of Macau

Lidia Sam Chao, NLP2CT Lab, University of Macau

Liang Yao, Alibaba Group

Haibo Zhang, Alibaba Group

Boxing Chen, Alibaba Group

Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.


Annotation Curricula to Implicitly Train Non-Expert Annotators 

Ji-Ung Lee, UKP Lab / TU Darmstadt

Jan-Christoph Klie, UKP Lab / TU Darmstadt

Iryna Gurevych, UKP Lab / TU Darmstadt

Abstract Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowdsourcing scenarios where domain expertise is not required. To alleviate these issues, this work proposes annotation curricula, a novel approach to implicitly train annotators. The goal is to gradually introduce annotators into the task by ordering instances to be annotated according to a learning curriculum. To do so, this work formalizes annotation curricula for sentence- and paragraph-level annotation tasks, defines an ordering strategy, and identifies well-performing heuristics and interactively trained models on three existing English datasets. Finally, we provide a proof of concept for annotation curricula in a carefully designed user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. The results indicate that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can be a promising research direction to improve data collection. To facilitate future research—for instance, to adapt annotation curricula to specific tasks and expert annotation scenarios—all code and data from the user study consisting of 2,400 annotations is made available.


Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity

Himanshu Yadav, Department of Linguistic, University of Potsdam

Samar Husain, Department of Humanities and Social Sciences, Indian Institute of Technology, Delhi

Richard Futrell, Department of Language Science, University of California, Irvine


Abstract Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.


Dual Attention Model for Citation Recommendation with Analyses on Explainability of Attention Mechanisms and Qualitative Experiments 

Yang Zhang, Graduate School of Informatics, Kyoto University

Qiang Ma, Graduate School of Informatics, Kyoto University

Abstract 

Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources have become non-trivial tasks. Conventional citation recommendation methods suffer from severe information losses. For example, they do not consider the section header of the paper that the author is writing and for which they need to find a citation, the relatedness between the words in the local context (the text span that describes a citation), or the importance of each word from the local context. These shortcomings make such methods insufficient for recommending adequate citations to academic manuscripts. In this study, we propose a novel embedding-based neural network called dual attention model for citation recommendation (DACR) to recommend citations during manuscript preparation. Our method adapts the embedding of three semantic pieces of information: words in the local context, structural contexts,1 and the section on which the author is working. A neural network model is designed to maximize the similarity between the embedding of the three inputs (local context words, section headers, and structural contexts) and the target citation appearing in the context. The core of the neural network model comprises self-attention and additive attention; the former aims to capture the relatedness between the contextual words and structural context, and the latter aims to learn their importance. Recommendation experiments on real-world datasets demonstrate the effectiveness of the proposed approach. To seek explainability on DACR, particularly the two attention mechanisms, the learned weights from them are investigated to determine how the attention mechanisms interpret “relatedness” and “importance” through the learned weights. In addition, qualitative analyses were conducted to testify that DACR could find necessary citations that were not noticed by the authors in the past due to the limitations of the keyword-based searching.


On Learning Interpreted Languages with Recurrent Models

Denis Paperno, Utrecht University, Department of Languages, Literature and Communication

Abstract Can recurrent neural nets, inspired by human sequential data processing, learn to understand language? We construct simplified data sets reflecting core properties of natural language as modeled in formal syntax and semantics: recursive syntactic structure and compositionality. We find LSTM and GRU networks to generalize to compositional interpretation well, but only in the most favorable learning settings, with a well-paced curriculum, extensive training data, and left-to-right (but not right-to-left) composition.


Boring Problems Are Sometimes the Most Interesting

Richard Sproat, Search Google, Japan

Abstract In a recent position paper, Turing Award Winners Yoshua Bengio, Geoffrey Hinton, and Yann LeCun make the case that symbolic methods are not needed in AI and that, while there are still many issues to be resolved, AI will be solved using purely neural methods. In this piece I issue a challenge: Demonstrate that a purely neural approach to the problem of text normalization is possible. Various groups have tried, but so far nobody has eliminated the problem of unrecoverable errors, errors where, due to insufficient training data or faulty generalization, the system substitutes some other reading for the correct one. Solutions have been proposed that involve a marriage of traditional finite-state methods with neural models, but thus far nobody has shown that the problem can be solved using neural methods alone. Though text normalization is hardly an “exciting” problem, I argue that until one can solve “boring” problems like that using purely AI methods, one cannot claim that AI is a success.


<section style="margin-top: 10px;margin-bottom: 10px;box-sizing: border-box;"><section style="display: inline-block;width: 577.792px;vertical-align: top;background-color: rgba(67, 163, 238, 0.07);box-shadow: rgb(0, 0, 0) 0px 0px 0px;box-sizing: border-box;"><section style="text-align: left;justify-content: flex-start;box-sizing: border-box;"><section style="display: inline-block;width: 161.778px;height: 4px;vertical-align: top;overflow: hidden;background-color: rgb(160, 197, 255);box-shadow: rgb(0, 0, 0) 0px 0px 0px;box-sizing: border-box;"><section><svg viewBox="0 0 1 1" style="float:left;line-height:0;width:0;vertical-align:top;"></svg></section></section></section><section style="display: inline-block;width: 577.792px;vertical-align: top;border-width: 0px 1px 1px;border-style: none none solid;border-color: rgb(62, 62, 62);box-sizing: border-box;"><section style="padding-right: 15px;padding-left: 15px;box-sizing: border-box;"><p style="white-space: normal;box-sizing: border-box;"><strong style="box-sizing: border-box;">To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP&nbsp;</strong><br></p></section></section><section style="box-sizing: border-box;"><section style="text-align: center;justify-content: center;transform: translate3d(5px, 0px, 0px);box-sizing: border-box;"><section style="padding-right: 15px;padding-left: 15px;text-align: justify;box-sizing: border-box;"><p style="white-space: normal;box-sizing: border-box;"><strong>Gözde Gül Şahin</strong>,&nbsp;Koç University, Computer Science and Engineering Department</p></section></section></section><section style="margin-top: 10px;box-sizing: border-box;"><section style="padding-right: 31px;padding-left: 31px;font-size: 15px;color: rgb(109, 109, 109);line-height: 1.8;letter-spacing: 1px;font-family: Optima-Regular, PingFangTC-light;box-sizing: border-box;"><p style="white-space: normal;box-sizing: border-box;"><strong style="box-sizing: border-box;">Abstract</strong>&nbsp;Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).</p><p style="white-space: normal;box-sizing: border-box;"><br></p></section></section></section></section>

Linear-Time Calculation of the Expected Sum of Edge Lengths in Random Projective Linearizations of Trees

Lluís Alemany-Puig, Universitat Politècnica de Catalunya Barcelona, Catalonia, Spain

Ramon Ferrer-i-CanchoUniversitat Politècnica de Catalunya Barcelona, Catalonia, Spain


Abstract The syntactic structure of a sentence is often represented using syntactic dependency trees. The sum of the distances between syntactically related words has been in the limelight for the past decades. Research on dependency distances led to the formulation of the principle of dependency distance minimization whereby words in sentences are ordered so as to minimize that sum. Numerous random baselines have been defined to carry out related quantitative studies on lan- guages. The simplest random baseline is the expected value of the sum in unconstrained random permutations of the words in the sentence, namely, when all the shufflings of the words of a sentence are allowed and equally likely. Here we focus on a popular baseline: random projective per- mutations of the words of the sentence, that is, permutations where the syntactic dependency structure is projective, a formal constraint that sentences satisfy often in languages. Thus far, the expectation of the sum of dependency distances in random projective shufflings of a sentence has been estimated approximately with a Monte Carlo procedure whose cost is of the order of Rn, where n is the number of words of the sentence and R is the number of samples; it is well known that the larger R is, the lower the error of the estimation but the larger the time cost. Here we pre- sent formulae to compute that expectation without error in time of the order of n. Furthermore, we show that star trees maximize it, and provide an algorithm to retrieve the trees that minimize it.


The Impact of Edge Displacement Vaserstein Distance on UD Parsing Performance

Mark AndersonUniversidade da Coruña, CITIC

Carlos Gómez-Rodríguez, Universidade da Coruña, CITIC

Abstract We contribute to the discussion on parsing performance in NLP by introducing a measurement that evaluates the differences between the distributions of edge displacement (the directed distance of edges) seen in training and test data. We hypothesize that this measurement will be related to differences observed in parsing performance across treebanks. We motivate this by building upon previous work and then attempt to falsify this hypothesis by using a number of statistical methods. We establish that there is a statistical correlation between this measurement and parsing performance even when controlling for potential covariants. We then use this to establish a sampling technique that gives us an adversarial and complementary split. This gives an idea of the lower and upper bounds of parsing systems for a given treebank in lieu of freshly sampled data. In a broader sense, the methodology presented here can act as a reference for future correlation-based exploratory work in NLP.



UDapter: Typology-based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling

Ahmet Üstün, University of Groningen Center for Language and Cognition

Arianna Bisazza, University of Groningen Center for Language and Cognition

Gosse Bouma, University of Groningen Center for Language and Cognition

Gertjan van Noord, University of Groningen Center for Language and Cognition


Abstract Recent advances in multilingual language modeling have brought the idea of a truly universal parser closer to reality. However, such models are still not immune to the “curse of multilinguality”: Cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel language adaptation approach by introducing contextual language adapters to a multilingual parser. Contextual language adapters make it possible to learn adapters via language embeddings while sharing model parameters across languages based on contextual parameter generation. Moreover, our method allows for an easy but effective integration of existing linguistic typology features into the parsing model. Because not all typological features are available for every language, we further combine typological feature prediction with parsing in a multi-task model that achieves very competitive parsing performance without the need for an external prediction system for missing features.



Tractable Parsing for CCGs of Bounded Degree

Lena Katharina Schiffer, Leipzig University Faculty of Mathematics and Computer Science

Marco Kuhlmann, Linköping University Department of Computer and Information Science

Giorgio Satta, University of Padua Department of Information Engineering


Abstract Unlike other mildly context-sensitive formalisms, Combinatory Categorial Grammar (CCG) cannot be parsed in polynomial time when the size of the grammar is taken into account. Refining this result, we show that the parsing complexity of CCG is exponential only in the maximum degree of composition. When that degree is fixed, parsing can be carried out in polynomial time. Our finding is interesting from a linguistic perspective because a bounded degree of composition has been suggested as a universal constraint on natural language grammar. Moreover, ours is the first complexity result for a version of CCG that includes substitution rules, which are used in practical grammars but have been ignored in theoretical work.



Investigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology

Rochelle Choenni, University of Amsterdam The Institute for Logic, Language and Computation (ILLC)

Ekaterina Shutova, University of Amsterdam The Institute for Logic, Language and Computation (ILLC) 


Abstract Multilingual sentence encoders have seen much success in cross-lingual model transfer for downstream NLP tasks. The success of this transfer is, however, dependent on the model’s ability to encode the patterns of cross-lingual similarity and variation. Yet, we know relatively little about the properties of individual languages or the general patterns of linguistic variation that the models encode. In this article, we investigate these questions by leveraging knowledge from the field of linguistic typology, which studies and documents structural and semantic variation across languages. We propose methods for separating language-specific subspaces within state-of-the-art multilingual sentence encoders (LASER, M-BERT, XLM, and XLM-R) with respect to a range of typological properties pertaining to lexical, morphological, and syntactic structure. Moreover, we investigate how typological information about languages is distributed across all layers of the models. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies. In addition, we propose a simple method to study how shared typological properties of languages are encoded in two state-of-the-art multilingual models—M-BERT and XLM-R. The results provide insight into their information-sharing mechanisms and suggest that these linguistic properties are encoded jointly across typologically similar languages in these models.



Survey of Low-Resource Machine Translation

Barry Haddow, University of Edinburgh School of Informatics

Rachel Bawden, Inria, France

Antonio Valerio Miceli Barone, University of Edinburgh School of Informatics

Jindřich Helcl, University of Edinburgh School of Informatics

Alexandra Birch, University of Edinburgh School of Informatics 


Abstract We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.



Position Information in Transformers: An Overview

Philipp Dufter, Center for Information and Language Processing, LMU Munich

Martin Schmitt, Center for Information and Language Processing, LMU Munich

Hinrich Schütze, Center for Information and Language Processing, LMU Munich


Abstract Transformers are arguably the main workhorse in recent natural language processing research. By definition, a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase that position information in Transformer is a vibrant and extensive research area; (2) enable the reader to compare existing methods by providing a unified notation and systematization of different approaches along important model dimensions; (3) indicate what characteristics of an application should be taken into account when selecting a position encoding; and (4) provide stimuli for future research.



Noun2Verb: Probabilistic Frame Semantics for Word Class Conversion

Lei Yu, University of Toronto Department of Computer Science

Yang Xu, University of Toronto Department of Computer Science Cognitive Science Program Vector Institute for Artificial Intelligence


Abstract Humans can flexibly extend word usages across different grammatical classes, a phenomenon known as word class conversion. Noun-to-verb conversion, or denominal verb (e.g., to Google a cheap flight), is one of the most prevalent forms of word class conversion. However, existing natural language processing systems are impoverished in interpreting and generating novel denominal verb usages. Previous work has suggested that novel denominal verb usages are comprehensible if the listener can compute the intended meaning based on shared knowledge with the speaker. Here we explore a computational formalism for this proposal couched in frame semantics. We present a formal framework, Noun2Verb, that simulates the production and comprehension of novel denominal verb usages by modeling shared knowledge of speaker and listener in semantic frames. We evaluate an incremental set of probabilistic models that learn to interpret and generate novel denominal verb usages via paraphrasing. We show that a model where the speaker and listener cooperatively learn the joint distribution over semantic frame elements better explains the empirical denominal verb usages than state-of-the-art language models, evaluated against data from (1) contemporary English in both adult and child speech, (2) contemporary Mandarin Chinese, and (3) the historical development of English. Our work grounds word class conversion in probabilistic frame semantics and bridges the gap between natural language processing systems and humans in lexical creativity.



Enhancing Lifelong Language Learning by Improving Pseudo-Sample Generation

Kasidis Kanwatchara, Chulalongkorn University, Department of Computer Engineering

Thanapapas Horsuwan, Chulalongkorn University, Department of Computer Engineering

Piyawat Lertvittayakumjorn, Imperial College London, Department of Computing

Boonserm Kijsirikul, Chulalongkorn University, Department of Computer Engineering

Peerapon Vateekul, Chulalongkorn University, Department of Computer Engineering


Abstract To achieve lifelong language learning, pseudo-rehearsal methods leverage samples generated from a language model to refresh the knowledge of previously learned tasks. Without proper controls, however, these methods could fail to retain the knowledge of complex tasks with longer texts since most of the generated samples are low in quality. To overcome the problem, we propose three specific contributions. First, we utilize double language models, each of which specializes in a specific part of the input, to produce high-quality pseudo samples. Second, we reduce the number of parameters used by applying adapter modules to enhance training efficiency. Third, we further improve the overall quality of pseudo samples using temporal ensembling and sample regeneration. The results show that our framework achieves significant improvement over baselines on multiple task sequences. Also, our pseudo sample analysis reveals helpful insights for designing even better pseudo-rehearsal methods in the future.



Nucleus Composition in Transition-based Dependency Parsing

Joakim Nivre, Uppsala University Department of Linguistics and Philology, RISE Research Institutes of Sweden

Ali Basirat, Linköping University, Department of Computer and Information Science

Luise Dürlich, Uppsala University, Department of Linguistics and Philology RISE Research Institutes of Sweden

Adam Moss, Uppsala University, Department of Linguistics and Philology


Abstract Dependency-based approaches to syntactic analysis assume that syntactic structure can be analyzed in terms of binary asymmetric dependency relations holding between elementary syntactic units. Computational models for dependency parsing almost universally assume that an elementary syntactic unit is a word, while the influential theory of Lucien Tesnière instead posits a more abstract notion of nucleus, which may be realized as one or more words. In this article, we investigate the effect of enriching computational parsing models with a concept of nucleus inspired by Tesnière. We begin by reviewing how the concept of nucleus can be defined in the framework of Universal Dependencies, which has become the de facto standard for training and evaluating supervised dependency parsers, and explaining how composition functions can be used to make neural transition-based dependency parsers aware of the nuclei thus defined. We then perform an extensive experimental study, using data from 20 languages to assess the impact of nucleus composition across languages with different typological characteristics, and utilizing a variety of analytical tools including ablation, linear mixed-effects models, diagnostic classifiers, and dimensionality reduction. The analysis reveals that nucleus composition gives small but consistent improvements in parsing accuracy for most languages, and that the improvement mainly concerns the analysis of main predicates, nominal dependents, clausal dependents, and coordination structures. Significant factors explaining the rate of improvement across languages include entropy in coordination structures and frequency of certain function words, in particular determiners. Analysis using dimensionality reduction and diagnostic classifiers suggests that nucleus composition increases the similarity of vectors representing nuclei of the same syntactic type.



Effective Approaches to Neural Query Language Identification

Xingzhang Ren, Alibaba DAMO Academy


Abstract As multilingual language teachers ourselves, we believe that this book brings an end to the ongoing debate regarding native speakers versus non-native speakers in language teaching research with a single, well-aimed blow. Nevertheless, situated in a complex, ecological context, the idea of being multilingual instructors as recommended by Kramsch and Zhang remains a highly challenging goal for many language teachers to pursue. We start this review by outlining what the book achieves, before we comment on some outstanding issues that still deserve more attention.


Key words xxx, xxx, xxx, xxx


Information Theory–based Compositional Distributional Semantics


Enrique Amigó, Universidad Nacional de Educación a Distancia (UNED)

Alejandro Ariza-Casabona, CLiC - UBICS, Universitat de Barcelona

Victor Fresno, Universidad Nacional de Educación a Distancia (UNED)


Abstract In the context of text representation, Compositional Distributional Semantics models aim to fuse the Distributional Hypothesis and the Principle of Compositionality. Text embedding is based on co-ocurrence distributions and the representations are in turn combined by compositional functions taking into account the text structure. However, the theoretical basis of compositional functions is still an open issue. In this article we define and study the notion of Information Theory–based Compositional Distributional Semantics (ICDS): (i) We first establish formal properties for embedding, composition, and similarity functions based on Shannon’s Information Theory; (ii) we analyze the existing approaches under this prism, checking whether or not they comply with the established desirable properties; (iii) we propose two parameterizable composition and similarity functions that generalize traditional approaches while fulfilling the formal properties; and finally (iv) we perform an empirical study on several textual similarity datasets that include sentences with a high and low lexical overlap, and on the similarity between words and their description. Our theoretical analysis and empirical results show that fulfilling formal properties affects positively the accuracy of text representation models in terms of correspondence (isometry) between the embedding and meaning spaces.



Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review

Ilia Kuznetsov, UKP Lab Technical University of Darmstadt, Department of Computer Science

Jan Buchmann, Technical University of Darmstadt, Department of Computer Science UKP Lab

Max Eichler, Technical University of Darmstadt, Department of Computer Science UKP Lab

Iryna Gurevych, Technical University of Darmstadt, Department of Computer Science UKP Lab


Abstract Peer review is a key component of the publishing process in most fields of science. Increasing submission rates put a strain on reviewing quality and efficiency, motivating the development of applications to support the reviewing and editorial work. While existing NLP studies focus on the analysis of individual texts, editorial assistance often requires modeling interactions between pairs of texts—yet general frameworks and datasets to support this scenario are missing. Relationships between texts are the core object of the intertextuality theory—a family of approaches in literary studies not yet operationalized in NLP. Inspired by prior theoretical work, we propose the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review–revise–and–resubmit cycle: pragmatic tagging, linking, and long-document version alignment. While peer review is used across the fields of science and publication formats, existing datasets solely focus on conference-style review in computer science. Addressing this, we instantiate our proposed model in the first annotated multidomain corpus in journal-style post-publication open peer review, and provide detailed insights into the practical aspects of intertextual annotation. Our resource is a major step toward multidomain, fine-grained applications of NLP in editorial support for peer review, and our intertextual framework paves the path for general-purpose modeling of text-based collaboration. We make our corpus, detailed annotation guidelines, and accompanying code publicly available.1



Hierarchical Interpretation of Neural Text Classification 

Hanqi Yan, Department of Computer Science, University of Warwick, UK

Lin Gui, Department of Informatics, King’s College London, UK

Yulan He, Department of Informatics, King’s College London, UK University of Warwick, UK The Alan Turing Institute, UK


Abstract Recent years have witnessed increasing interest in developing interpretable models in Natural Language Processing (NLP). Most existing models aim at identifying input features such as words or phrases important for model predictions. Neural models developed in NLP, however, often compose word semantics in a hierarchical manner. As such, interpretation by words or phrases only cannot faithfully explain model decisions in text classification. This article proposes a novel Hierarchical Interpretable Neural Text classifier, called HINT, which can automatically generate explanations of model predictions in the form of label-associated topics in a hierarchical manner. Model interpretation is no longer at the word level, but built on topics as the basic semantic unit. Experimental results on both review datasets and news datasets show that our proposed approach achieves text classification results on par with existing state-of-the-art text classifiers, and generates interpretations more faithful to model predictions and better understood by humans than other interpretable neural text classifiers.1



Neural Embedding Allocation: Distributed Representations of Topic Models 

Kamrun Naher Keya, University of Maryland, Baltimore County Department of Information Systems kkeya1@umbc

Yannis Papanikolaou, Healx Department of Research and Development yannis

James R. Foulds, University of Maryland, Baltimore County Department of Information Systems


Abstract We propose a method that uses neural embeddings to improve the performance of any given LDA-style topic model. Our method, called neural embedding allocation (NEA), deconstructs topic models (LDA or otherwise) into interpretable vector-space embeddings of words, topics, documents, authors, and so on, by learning neural embeddings to mimic the topic model. We demonstrate that NEA improves coherence scores of the original topic model by smoothing out the noisy topics when the number of topics is large. Furthermore, we show NEA’s effectiveness and generality in deconstructing and smoothing LDA, author-topic models, and the recent mixed membership skip-gram topic model and achieve better performance with the embeddings compared to several state-of-the-art models.



The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization 

Ildikó Pilán, Norwegian Computing Center Oslo, Norway

Pierre Lison, Norwegian Computing Center Oslo, Norway

Lilja Øvrelid, Language Technology Group University of Oslo, Norway

Anthi Papadopoulou, Language Technology Group University of Oslo, Norway

David Sánchez, Universitat Rovira i Virgili, CYBERCAT UNESCO Chair in Data Privacy, Spain david

Montserrat Batet, Universitat Rovira i Virgili, CYBERCAT UNESCO Chair in Data Privacy, Spain

Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected.


Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.




How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study 

Saeed Esmail, School of Computer Science, Tel Aviv University, Tel Aviv, Israel

Kfir Bar, School of Computer Science, College of Management Academic Studies, Rishon LeZion, Israel Basis Technology, MA, USA

Nachum Dershowitz, School of Computer Science, Tel Aviv University, Tel Aviv, Israel


Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization.


For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading.


L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira.

“Toderini’s History of Turkish Literature,” Analytical Review (1789)



Explainable Natural Language Processing 

George Chrysostomou, University of Sheffield

Abstract Explainable Natural Language Processing (NLP) is an emerging field, which has received significant attention from the NLP community in the last few years. At its core is the need to explain the predictions of machine learning models, now more frequently deployed and used in sensitive areas such as healthcare and law. The rapid developments in the area of explainable NLP have led to somewhat disconnected groups of studies working on these areas. This disconnect results in researchers adopting various definitions for similar problems, while also in certain cases enabling the re-creation of previous research, highlighting the need for a unified framework for explainable NLP.


How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study 

Saeed Esmail, School of Computer Science, Tel Aviv University, Tel Aviv, Israel

Kfir Bar, School of Computer Science, College of Management Academic Studies, Rishon LeZion, Israel Basis Technology, MA, USA

Nachum Dershowitz, School of Computer Science, Tel Aviv University, Tel Aviv, Israel


Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization.


For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading.


L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira.

“Toderini’s History of Turkish Literature,” Analytical Review (1789)





期刊简介

Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.

计算语言学是专门研究语言的计算和数学特性以及自然语言处理系统的设计和分析的历史最悠久的出版物。本刊为季刊,并且广受好评。为大学和工业界的语言学家、计算语言学家、人工智能和机器学习研究者、认知科学家、语音专家和哲学家提供关于语言研究所有方面的计算方面的最新信息。



官网地址:

https://direct.mit.edu/coli

本文来源:COMPUTATIONAL LINGUISTICS官网

点击“阅读原文”可跳转官网




课程推荐



重  磅|2022人大复印报刊转载(语言学)

2023-03-11

刊讯|《南京师范大学文学院学报》2022年刊文(语言学)

2023-03-11

今日一词|焦点敏感算子 Focus-sensitive Operator

2023-03-09

刊讯|《汉语学报》2023年第1期

2023-03-08

刊讯|SSCI 期刊 《应用语言学评论》 2022年第5-6期

2023-03-07

刊讯|《语言与翻译》2022年第2期

2023-03-06

好书推荐|《语料库语言学中的统计分析实用指南》(留言赠书)

2023-03-04

刊讯|SSCI 期刊《语料库语言学和语言学理论》2022年第3期

2023-03-04

语言学年报•期刊动态|《汉语学习》(2022)

2023-03-03

刊讯|《东方语言学》2022年第23辑

2023-03-02


欢迎加入

“语言学心得交流分享群”“语言学考博/考研/保研交流群”


请添加“心得君”入群务必备注“学校+研究方向/专业”

今日小编:有    常

审    核:心得小蔓

转载&合作请联系

"心得君"

微信:xindejun_yyxxd

点击“阅读原文”可跳转下载

继续滑动看下一个

刊讯|SSCI 期刊《计算语言学》2022年第48卷第1-4期

六万学者关注了→ 语言学心得
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存