See also the Google Scholar profiles of prof. Chris Develder or prof. Thomas Demeester. If you have trouble obtaining a copy of one of the papers below, please get in touch via chris.develder@ugent.be.
Optimizing language models for use in conversational agents requires large quantities of example dialogues. Increasingly, these dialogues are synthetically generated by using powerful large language models (LLMs), especially in domains with challenges to obtain authentic human data. One such domain is human resources (HR). In this context, we compare two LLM-based dialogue generation methods for the use case of generating HR job interviews, and assess whether one method generates higher-quality dialogues that are more challenging to distinguish from genuine human discourse. The first method uses a single prompt to generate the complete interview dialog. The second method uses two agents that converse with each other. To evaluate dialogue quality under each method, we ask a judge LLM to determine whether AI was used for interview generation, using pairwise interview comparisons. We demonstrate that despite a sixfold increase in token cost, interviews generated with the dual-prompt method achieve a win rate up to ten times higher than those generated with the single-prompt method. This difference remains consistent regardless of whether GPT-4o or Llama 3.3 70B is used for either interview generation or judging quality.
@inproceedings{debaer2025,
author = {De Baer, Joachim and Doğruöz, A. Seza and Demeester, Thomas and Develder, Chris},
title = {Single- vs. dual-prompt dialogue generation with LLMs for job interviews in human Resources},
booktitle = {Proc. 4th Generation Evaluation & Metrics (GEM) workshop at ACL 2025},
month = {31 Jul.},
year = {2025},
pages = {947--957},
address = {Vienna, Austria},
url = {https://aclanthology.org/2025.gem-1.74/}
}
Negative Prompting (NP) is widely utilized in diffusion models, particularly in text-to-image applications, to prevent the generation of undesired features. In this paper, we show that conventional NP is limited by the assumption of a constant guidance scale, which may lead to highly suboptimal results, or even complete failure, due to the non-stationarity and state-dependence of the reverse process. Based on this analysis, we derive a principled technique called Dynamic Negative Guidance, which relies on a near-optimal time and state dependent modulation of the guidance without requiring additional training. Unlike NP, negative guidance requires estimating the posterior class probability during the denoising process, which is achieved with limited additional computational overhead by tracking the discrete Markov Chain during the generative process. We evaluate the performance of DNG class-removal on MNIST and CIFAR10, where we show that DNG leads to higher safety, preservation of class balance and image quality when compared with baseline methods. Furthermore, we show that it is possible to use DNG with Stable Diffusion to obtain more accurate and less invasive guidance than NP.
@inproceedings{koulischer2025iclr,
author = {Koulischer, Felix and Deleu, Johannes and Raya, Gabriel and Demeester, Thomas and Ambrogioni, Luca},
title = {Dynamic negative guidance of diffusion models},
booktitle = {Proc. 13th Int. Conf. Learning Representations (ICLR 2025)},
month = {24--28 Apr.},
year = {2025},
address = {Singapore},
url = {https://openreview.net/forum?id=6p74UyAdLa}
}
The brittleness of finetuned language model performance on out-of-distribution (OOD) test samples in unseen domains has been well-studied for English, yet is unexplored for multi-lingual models. Therefore, we study generalization to OOD test data specifically in zero-shot cross-lingual transfer settings, analyzing performance impacts of both language and domain shifts between train and test data. We further assess the effectiveness of counterfactually augmented data (CAD) in improving OOD generalization for the cross-lingual setting, since CAD has been shown to benefit in a monolingual English setting. Finally, we propose two new approaches for OOD generalization that avoid the costly annotation process associated with CAD, by exploiting the power of recent large language models (LLMs). We experiment with 3 multilingual models, LaBSE, mBERT, and XLM-R trained on English IMDb movie reviews, and evaluate on OOD test sets in 13 languages: Amazon product reviews, Tweets, and Restaurant reviews. Results echo the OOD performance decline observed in the monolingual English setting. Further, (i) counterfactuals from the original high-resource language do improve OOD generalization in the low-resource language, and (ii) our newly proposed cost-effective approaches reach similar or up to +3.1% better accuracy than CAD for Amazon and Restaurant reviews.
@inproceedings{deraedt2023mrl,
author = {De Raedt, Maarten and Bitew, Semere Kiros and Godin, Fréderic and Demeester, Thomas and Develder, Chris},
title = {Zero-shot cross-lingual sentiment classification under distribution shift: An exploratory study},
booktitle = {Proc. 3rd Multiling. Represent. Learn. Workshop (MRL 2023) at EMNLP 2023},
month = {7 Dec.},
year = {2023},
pages = {5---66},
address = {Singapore},
doi = {10.18653/v1/2023.mrl-1.5}
}
Model interpretability and model editing are crucial goals in the age of large language models. Interestingly, there exists a link between these two goals: if a method is able to systematically edit model behavior with regard to a human concept of interest, this editor method can help make internal representations more interpretable by pointing towards relevant representations and systematically manipulating them.
@inproceedings{doosterlinck2023blackboxnlp,
author = {D'Oosterlinck, Karel and Demeester, Thomas and Develder, Chris and Potts, Christopher},
title = {Flexible model interpretability through natural language model editing},
booktitle = {Proc. 6th BlackboxNLP Workshop: Anal. and Interpret. Neural Netw. for NLP (BlackboxMLP 2023) at EMNLP 2023},
year = {2023},
address = {Singapore}
}
State-of-the-art coreference resolutions systems depend on multiple LLM calls per document and are thus prohibitively expensive for many use cases (e.g., information extraction with large corpora). The leading word-level coreference system (WL-coref) attains 96.6% of these SOTA systems' performance while being much more efficient. In this work, we identify a routine yet important failure case of WL-coref: dealing with conjoined mentions such as 'Tom and Mary'. We offer a simple yet effective solution that improves the performance on the OntoNotes test set by 0.9% F1, shrinking the gap between efficient word-level coreference resolution and expensive SOTA approaches by 34.6%. Our Conjunction-Aware Word-level coreference model (CAW-coref) and code is available at https://github.com/KarelDO/wl-coref.
@inproceedings{doosterlinck2023crac,
author = {Karel D'Oosterlinck and Semere Kiros Bitew and Brandon Papineau and Christopher Potts and Thomas Demeester and Chris Develder},
title = {CAW-coref: Conjunction-aware word-level coreference resolution},
booktitle = {Proc. 6th Workshop Comput. Models Ref. Anaphora and Coref. (CRAC 2023) at EMNLP 2023},
month = {6--7 Dec.},
year = {2023},
doi = {10.18653/v1/2023.crac-main.2}
}
Timely and accurate extraction of Adverse Drug Events (ADE) from biomedical literature is paramount for public safety, but involves slow and costly manual labor. We set out to improve drug safety monitoring (pharmacovigilance, PV) through the use of Natural Language Processing (NLP). We introduce BioDEX, a large-scale resource for Biomedical adverse Drug Event Extraction, rooted in the historical output of drug safety reporting in the U.S. BioDEX consists of 65k abstracts and 19k full-text biomedical papers with 256k associated document-level safety reports created by medical experts. The core features of these reports include the reported weight, age, and biological sex of a patient, a set of drugs taken by the patient, the drug dosages, the reactions experienced, and whether the reaction was life threatening. In this work, we consider the task of predicting the core information of the report given its originating paper. We estimate human performance to be 72.0% F1, whereas our best model achieves 62.3% F1, indicating significant headroom on this task. We also begin to explore ways in which these models could help professional PV reviewers. Our code and data are available: https://github.com/KarelDO/BioDEX
@inproceedings{doosterlinck2023emnlp,
author = {Karel D'Oosterlinck and François Remy and Johannes Deleu and Thomas Demeester and Chris Develder and Klim Zaporojets and Aneiss Ghodsi and Simon Ellershaw and Jack Collins and Christopher Potts},
title = {BioDEX: Large-scale biomedical adverse drug event extraction for real-world pharmacovigilance},
booktitle = {Findings of the ACL: EMNLP 2023},
month = {Dec.},
year = {2023},
pages = {13425–13454},
address = {Singapore},
doi = {10.18653/v1/2023.findings-emnlp.896}
}
Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron a activates on all and only input strings that refer to a concept picked out by the proposed explanation E. In the intervention mode, we construe E as a claim that neuron a is a causal mediator of the concept denoted by E. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis.
@inproceedings{huang2023blackboxnlp,
author = {Jing Huang and Atticus Geiger and Karel D’Oosterlinck and Zhengxuan Wu and Christopher Potts},
title = {Rigorously assessing natural language explanations of neurons},
booktitle = {Proc. 6th BlackboxNLP Workshop: Anal. and Interpret. Neural Netw. for NLP (BlackboxMLP 2023) at EMNLP 2023},
year = {2023},
pages = {317--331},
address = {Singapore},
doi = {10.18653/v1/2023.blackboxnlp-1.24}
}
The impact of person-job fit on job satisfaction and performance is widely acknowledged, which highlights the importance of providing workers with next steps at the right time in their career. This task of predicting the next step in a career is known as career path prediction, and has diverse applications such as turnover prevention and internal job mobility. Existing methods to career path prediction rely on large amounts of private career history data to model the interactions between job titles and companies. We propose leveraging the unexplored textual descriptions that are part of work experience sections in resumes. We introduce a structured dataset of 2,164 anonymized career histories, annotated with ESCO occupation labels. Based on this dataset, we present a novel representation learning approach, CareerBERT, specifically designed for work history data. We develop a skill-based model and a text-based model for career path prediction, which achieve 35.24% and 39.61% recall@10 respectively on our dataset. Finally, we show that both approaches are complementary as a hybrid approach achieves the strongest result with 43.01% recall@10.
@inproceedings{decorte2023recsys,
author = {Decorte, Jens-Joris and Van Haute, Jeroen and Deleu, Johannes and Develder, Chris and Demeester, Thomas},
title = {Career path prediction using resume representation learning and skill-based matching},
booktitle = {3rd Workshop Recomm. Syst. Human Resour. (RecSys in HR 2023) at ACM RecSys 2023},
month = {19 Sep.},
year = {2023},
pages = {1--9},
address = {Singapore},
url = {https://ceur-ws.org/Vol-3490/RecSysHR2023-paper_1.pdf}
}
Online job ads serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and e-recruitment processes. Since such ads are typically formatted in free text, natural language processing (NLP) technologies are required to automatically process them. We specifically focus on the task of detecting skills (mentioned literally, or implicitly described) and linking them to a large skill ontology, making it a challenging case of extreme multi-label classifi- cation (XMLC). Given that there is no sizable labeled (training) dataset are available for this specific XMLC task, we propose techniques to leverage general Large Language Models (LLMs). We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction, and present a contrastive learning strategy that proves effective in the task. Our results across three skill extraction benchmarks show a consistent increase of between 15 to 25 percentage points in R-Precision@5 compared to previously published results that relied solely on distant supervision through literal matches.
@inproceedings{decorte2023ai4hr,
author = {Decorte, Jens-Joris and Verlinden, Severine and Van Hautte, Jeroen and Deleu, Johannes and Develder, Chris and Demeester, Thomas},
title = {Extreme multi-label skill extraction training using large language models},
booktitle = {Proc. Int. Workshop AI For Human Resour. Public Employ. Serv. (AI4HR & PES) at ECML-PKDD 2023},
month = {18 Sep.},
year = {2023},
pages = {1--12},
address = {Torino, Italy}
}
Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks and have garnered significant attention from both researchers and practitioners. However, in an educational context, we still observe a performance gap in generating distractors -- i.e., plausible yet incorrect answers -- with LLMs for multiple-choice questions (MCQs). In this study, we propose a strategy for guiding LLMs such as ChatGPT, in generating relevant distractors by prompting them with question items automatically retrieved from a question bank as well-chosen in-context examples. We evaluate our LLM-based solutions using a quantitative assessment on an existing test set, as well as through quality annotations by human experts, i.e., teachers. We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is, outperforming the state-of-the-art model. We also show the gains of our approach in generating high-quality distractors by comparing it with a zero-shot ChatGPT and a few-shot ChatGPT prompted with static examples.
@inproceedings{bitew2023rkde,
author = {Bitew, Semere Kiros and Deleu, Johannes and Develder, Chris and Demeester, Thomas},
title = {Distractor generation for multiple-choice questions with predictive prompting and large language models},
booktitle = {Proc. 1st Int. Tut. Workshop on Responsible Knowledge Discovery in Education (RKDE 2023) at ECML-PKDD 2023},
month = {18 Sep.},
year = {2023},
pages = {1--16},
address = {Turin, Italy}
}
Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model N because it is trained to have the same actual input/output behavior as N while creating neural representations that can be intervened upon to simulate the counterfactual input/output behavior of N. Furthermore, we show that the best CPM for N performs comparably to N in making factual predictions, which means that the CPM can simply replace N, leading to more explainable deployed models.
@inproceedings{doosterlinck2023icml,
author = {Zhengxuan Wu and Karel D'Oosterlinck and Atticus Geiger and Amir Zur and Christopher Potts},
title = {Causal proxy models for concept-based model explanations},
booktitle = {Proc. 40th Int. Conf. Machine Learn. (ICML 2023)},
month = {23--29 Jul.},
year = {2023},
pages = {1--22},
address = {Honolulu, HI, USA},
url = {https://openreview.net/forum?id=1Hh1cIPJ7V}
}
Intent discovery is the task of inferring latent intents from a set of unlabeled utterances, and is a useful step towards the efficient creation of new conversational agents. We show that recent competitive methods in intent discovery can be outperformed by clustering utterances based on abstractive summaries, i.e., `labels', that retain the core elements while removing non-essential information. We contribute the IDAS approach, which collects a set of descriptive utterance labels by prompting a Large Language Model, starting from a well-chosen seed set of prototypical utterances, to bootstrap an In-Context Learning procedure to generate labels for non-prototypical utterances. The utterances and their resulting noisy labels are then encoded by a frozen pre-trained encoder, and subsequently clustered to recover the latent intents. For the unsupervised task (without any intent labels) IDAS outperforms the state-of-the-art by up to +7.42% in standard cluster metrics for the Banking, StackOverflow, and Transport datasets. For the semi-supervised task (with labels for a subset of intents) IDAS surpasses 2 recent methods on the CLINC benchmark without even using labeled data.
@inproceedings{deraedt2023idas,
author = {De Raedt, Maarten and Godin, Fréderic and Demeester, Thomas and Develder, Chris},
title = {IDAS: Intent Discovery with Abstractive Summarization},
booktitle = {Proc. 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023) at ACL 2023},
month = {14 Jul.},
year = {2023},
pages = {71--88},
address = {Toronto, Canada},
url = {https://aclanthology.org/2023.nlp4convai-1.7},
doi = {10.18653/v1/2023.nlp4convai-1.7}
}
Since performing exercises (including, e.g.,practice tests) forms a crucial component of learning, and creating such exercises requires non-trivial effort from the teacher. There is a great value in automatic exercise generation in digital tools in education. In this paper, we particularly focus on automatic creation of gap-filling exercises for language learning, specifically grammar exercises. Since providing any annotation in this domain requires human expert effort, we aim to avoid it entirely and explore the task of converting existing texts into new gap-filling exercises, purely based on an example exercise, without explicit instruction or detailed annotation of the intended grammar topics. We contribute (i) a novel neural network architecture specifically designed for the aforementioned gap-filling exercise generation task, and (ii) a real-world benchmark dataset for French grammar. We show that our model for this French grammar gap-filling exercise generation outperforms a competitive baseline classifier by 8% in F1 percentage points, achieving an average F1 score of 82%. Our model implementation and the dataset are made publicly available to foster future research, thus offering a standardized evaluation and baseline solution of the proposed partially annotated data prediction task in grammar exercise creation.
@inproceedings{bitew2023bea,
author = {Bitew, Semere Kiros and Deleu, Johannes and Doğruöz, A. Seza and Develder, Chris and Demeester, Thomas},
title = {Learning from partially annotated data: Example-aware creation of gap-filling exercises for language learning},
booktitle = {Proc. 18th Workshop Innovative Use of NLP for Building Educational Applications (BEA 2023) at ACL 2023},
month = {13 Jul.},
year = {2023},
pages = {598--609},
address = {Toronto, Canada},
url = {https://aclanthology.org/2023.bea-1.51},
doi = {10.18653/v1/2023.bea-1.51}
}
Question Generation (QG) systems have shown promising results in reducing the time and effort required to create questions for students. Typically, a first step in QG is to select the content to design a question for. In an educational setting, it is crucial that the resulting questions cover the most relevant/important pieces of knowledge the student should have acquired. Yet, current QG systems either consider just a single sentence or paragraph (thus do not include a selection step), or do not consider this educational viewpoint of content selection. Aiming to fill this research gap with a solution for educational document level QG, we thus propose to select contents for QG based on relevance and topic diversity. We demonstrate the effectiveness of our proposed content selection strategy for QG on 2 educational datasets. In our performance assessment, we also highlight limitations of existing QG evaluation metrics in light of the content selection problem.
@inproceedings{hadifar2023eacl,
author = {Hadifar, Amir and Bitew, Semere Kiros and Deleu, Johannes and Hoste, Veronique and Develder, Chris and Demeester, Thomas},
title = {Diverse content selection for educational question generation},
booktitle = {Proc. 17th Conf. Eur. Chapter Associat. Comput. Linguist.: Stud. Research Workshop (EACL SRW 2023},
month = {2--6 May},
year = {2023},
pages = {123--133},
address = {Dubrovnik, Croatia},
url = {https://aclanthology.org/2023.eacl-srw.13},
doi = {10.18653/v1/2023.eacl-srw.13}
}
For text classification tasks, finetuned language models perform remarkably well. Yet, they tend to rely on spurious patterns in training data, thus limiting their performance on out-of-distribution (OOD) test data. Among recent models aiming to avoid this spurious pattern problem, adding extra counterfactual samples to the training data has proven to be very effective. Yet, counterfactual data generation is costly since it relies on human annotation. Thus, we propose a novel solution that only requires annotation of a small fraction (e.g., 1%) of the original training data, and uses automatic generation of extra counterfactuals in an encoding vector space. We demonstrate the effectiveness of our approach in sentiment classification, using IMDb data for training and other sets for OOD tests (i.e., Amazon, SemEval and Yelp). We achieve noticeable accuracy improvements by adding only 1% manual counterfactuals: +3% compared to adding +100% in-distribution training samples, +1.3% compared to alternate counterfactual approaches.
@inproceedings{deraedt2022emnlp,
author = {De Raedt, Maarten and Godin, Fréderic and Develder, Chris and Demeester, Thomas},
title = {Robustifying sentiment classification by maximally exploiting few counterfactuals},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Processing (EMNLP 2022)},
month = {7--11 Dec.},
year = {2022},
pages = {11386–11400},
address = {Abu Dhabi, UAE},
url = {https://aclanthology.org/2022.emnlp-main.783}
}
Bayesian Networks may be appealing for clinical decision-making due to their
inclusion of causal knowledge, but their practical adoption remains limited as a
result of their inability to deal with unstructured data. While neural networks do
not have this limitation, they are not interpretable and are inherently unable to deal
with causal structure in the input space. Our goal is to build neural networks that
combine the advantages of both approaches. Motivated by the perspective to inject
causal knowledge while training such neural networks, this work presents initial
steps in that direction. We demonstrate how a neural network can be trained to
output conditional probabilities, providing approximately the same functionality
as a Bayesian Network. Additionally, we propose two training strategies that allow
encoding the independence relations inferred from a given causal structure into the
neural network. We present initial results in a proof-of-concept setting, showing
that the neural model acts as an understudy to its Bayesian Network counterpart,
approximating its probabilistic and causal properties.
@inproceedings{rabaey2022neurips,
author = {Rabaey, Paloma and De Boom, Cedric and Demeester, Thomas},
title = {Neural Bayesian network understudy},
booktitle = {Proc. Workshop Causal Mach. Learn. Real-World Impact (CML4Impact 2022) at NeurIPS 2022},
month = {2 Dec.},
year = {2022},
address = {New Orleans, LA, USA}
}
In our continuously evolving world, entities change over time and new, previously non-existing or unknown, entities appear. We study how this evolutionary scenario impacts the performance on a well established entity linking (EL) task. For that study, we introduce TempEL, an entity linking dataset that consists of time-stratified English Wikipedia snapshots from 2013 to 2022, from which we collect both anchor mentions of entities, and these target entities’ descriptions. By capturing such temporal aspects, our newly introduced TempEL resource contrasts with currently existing entity linking datasets, which are composed of fixed mentions linked to a single static version of a target Knowledge Base (e.g., Wikipedia 2010 for CoNLL-AIDA). Indeed, for each of our collected temporal snapshots, TempEL contains links to entities that are continual, i.e., occur in all of the years, as well as completely new entities that appear for the first time at some point. Thus, we enable to quantify the performance of current state-of-the-art EL models for: (i) entities that are subject to changes over time in their Knowledge Base descriptions as well as their mentions’ contexts, and (ii) newly created entities that were previously non-existing (e.g., at the time the EL model was trained). Our experimental results show that in terms of temporal performance degradation, (i) continual entities suffer a decrease of up to 3.1% EL accuracy, while (ii) for new entities this accuracy drop is up to 17.9%. This highlights the challenge of the introduced TempEL dataset and opens new research prospects in the area of time-evolving entity disambiguation.
@inproceedings{Zaporojets2022NeurIPS,
author = {Zaporojets, Klim and Kaffee, Lucie-Aimée and Demeester, Thomas and Develder, Chris and Augenstein, Isabelle},
title = {TempEL: Linking dynamically evolving and newly emerging entities},
booktitle = {Proc. 36th Conf. Neural Inf. Process. Sys. (NeurIPS 2022)},
month = {28 Nov. -- 9 Dec.},
year = {2022},
address = {New Orleans, LA, USA},
url = {https://openreview.net/forum?id=vrnqr3PG4yB}
}
The increasing size and complexity of modern ML systems has improved their predictive capabilities but made their behavior harder to explain. Many techniques for model explanation have been developed in response, but we lack clear criteria for assessing these techniques. In this paper, we cast model explanation as the causal inference problem of estimating causal effects of real-world concepts on the output behavior of ML models given actual input data. We introduce CEBaB, a new benchmark dataset for assessing concept-based explanation methods in Natural Language Processing (NLP). CEBaB consists of short restaurant reviews with human-generated counterfactual reviews in which an aspect (food, noise, ambiance, service) of the dining experience was modified. Original and counterfactual reviews are annotated with multiply-validated sentiment ratings at the aspect-level and review-level. The rich structure of CEBaB allows us to go beyond input features to study the effects of abstract, real-world concepts on model behavior. We use CEBaB to compare the quality of a range of concept-based explanation methods covering different assumptions and conceptions of the problem, and we seek to establish natural metrics for comparative assessments of these methods.
@inproceedings{abraham2022,
author = {Abraham, Eldar David and D'Oosterlinck, Karel and Feder, Amir and Gat, Yair and Geiger, Atticus and Potts, Christopher and Reichart, Roi and Wu, Zhengxuan},
title = {CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior},
booktitle = {Proc. 36th Conf. Neural Inf. Process. Sys. (NeurIPS 2022)},
month = {28 Nov.--9 Dec.},
year = {2022},
address = {New Orleans, LA, USA},
url = {https://proceedings.neurips.cc/paper_files/paper/2022/hash/701ec28790b29a5bc33832b7bdc4c3b6-Abstract-Conference.html}
}
The ability to track fine-grained emotions in customer service dialogues has many real-world applications, but has not been studied extensively. This paper measures the potential of prediction models on that task, based on a real-world dataset of Dutch Twitter conversations in the domain of customer service. We find that modeling emotion trajectories has a small, but measurable benefit compared to predictions based on isolated turns. The models used in our study are shown to generalize well to different companies and economic sectors.
@inproceedings{labat2022,
author = {Labat, Sofie and Hadifar, Amir and Demeester, Thomas and Hoste, Véronique},
title = {An emotional journey: Detecting emotion trajectories in Dutch customer service dialogues},
booktitle = {Proc. 8th Workshop Noisy User-generated Text (W-NUT 2022) at COLING 2022},
month = {Oct. 16},
year = {2022},
pages = {106--112},
address = {Geongjy, Republic of Korea},
url = {https://aclanthology.org/2022.wnut-1.12/}
}
Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three different strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.
@inproceedings{Decorte2022RecSysHR,
author = {Decorte, Jens-Joris and Van Hautte, Jeroen and Deleu, Johannes and Develder, Chris and Demeester, T.},
title = {Design of negative sampling strategies for distantly supervised skill extraction},
booktitle = {Proc. 2nd Workshop Recomm. Sys. Hum. Resour. at RecSys 2022 (RecSys in HR 2022)},
month = {22 Sep.},
year = {2022},
address = {Seattle, WA, USA}
}
This pilot study employs the Wizard of Oz technique to collect a corpus of written human-computer conversations in the domain of customer service. The resulting dataset contains 192 conversations and is used to test three hypotheses related to the expression and annotation of emotions. First, we hypothesize that there is a discrepancy between the emotion annotations of the participant (the experiencer) and the annotations of our external annotator (the observer). Furthermore, we hypothesize that the personality of the participants has an influence on the emotions they expressed, and on the way they evaluated (annotated) these emotions. We found that for an external, trained annotator, not all emotion labels were equally easy to work with. We also noticed that the trained annotator had a tendency to opt for emotion labels that were more centered in the valence-arousal space, while participants made more `extreme' annotations. For the second hypothesis, we discovered a positive correlation between the personality trait extraversion and the emotion dimensions valence and dominance in our sample. Finally, for the third premise, we observed a positive correlation between the internal-external agreement on emotion labels and the personality traits conscientiousness and extraversion. Our insights and findings will be used in future research to conduct a larger Wizard of Oz experiment.
@inproceedings{labat2022lrec,
author = {Labat, Sofie and Ackaert, Naomi and Demeester, Thomas and Hoste, Veronique},
title = {Variation in the expression and annotation of emotions: a Wizard of Oz pilot study},
booktitle = {Proc. 1st Workshop Perspectivist Approaches to NLP @LREC2022 (NLPerspectives 2022)},
month = {20 Jun.},
year = {2022},
pages = {66--72},
address = {Marseille, France},
url = {https://aclanthology.org/2022.nlperspectives-1.9/}
}
This work presents the contribution from the text-to-Knowledge team of Ghent University (UGent-T2K)1 to the MultiDoc2Dial shared task on modeling dialogs grounded in multiple documents. We propose a pipeline system, comprising (1) document retrieval, (2) passage retrieval, and (3) response generation. We engineered these individual components mainly by, for (1)-(2), combining multiple ranking models and adding a final LambdaMART reranker, and, for (3), by adopting a Fusion-in-Decoder (FiD) model. We thus significantly boost the baseline system’s performance (over +10 points for both F1 and SacreBLEU). Further, error analysis reveals two major failure cases, to be addressed in future work: (i) in case of topic shift within the dialog, retrieval often fails to select the correct grounding document(s), and (ii) generation sometimes fails to use the correctly retrieved grounding passage.
@inproceedings{jiang2022acl,
author = {Jiang, Yiwei and Hadifar, Amir and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {UGent-T2K at the 2nd DialDoc shared task: A retrieval-focused dialog system grounded in multiple documents},
booktitle = {Proc. DialDoc Workshop at ACL 2022},
month = {May 26},
year = {2022},
pages = {1--8},
address = {Dublin, Ireland},
doi = {10.18653/v1/2022.dialdoc-1.12}
}
We consider the task of document-level entity linking (EL), where it is important to make onsistent decisions for entity mentions over the full document jointly. We aim to leverage explicit “connections” among mentions within the document itself: we propose to join EL and coreference resolution (coref) in a single structured prediction task over directed trees and use a globally normalized model to solve it. This contrasts with related works where two separate models are trained for each of the tasks and additional logic is required to merge the outputs. Experimental results on two datasets show a boost of up to +5% F1-score on both coref and EL tasks, compared to their standalone counterparts. For a subset of hard cases, with individual mentions lacking the correct EL in their candidate entity list, we obtain a +50% increase in accuracy.
@inproceedings{zaporojets2022acl,
author = {Zaporojets, Klim and Deleu, Johannes and Jiang, Yiwei and Demeester, Thomas and Develder, Chris},
title = {Towards consistent document-level entity linking: Joint Models for entity linking and coreference resolution},
booktitle = {Proc. 60th Annual Meet. Assoc. Comput. Linguist. (ACL 2022)},
month = {22--27 May},
year = {2022},
pages = {1--7},
address = {Dublin, Ireland},
doi = {10.18653/v1/2022.acl-short.88}
}
Large annotated corpora for coreference resolution are available for few languages. For machine translation, however, strong black-box systems exist for many languages. We empirically explore the appealing idea of leveraging such translation tools for bootstrapping coreference resolution in languages with limited resources. Two scenarios are analyzed, in which a large coreference corpus in a high-resource language is used for coreference predictions in a smaller language, i.e., by machine translating either the training corpus, or the test data. In our empirical evaluation of coreference resolution using the two scenarios on several medium-resource languages, we find no improvement over monolingual baseline models. Our analysis of the various sources of error inherent to the studied scenarios, reveals that in fact the quality of contemporary machine translation tools is the main limiting factor.
@inproceedings{bitew2021crac,
author = {Bitew, Semere Kiros and Deleu, Johannes and Develder, Chris and Demeester, Thomas},
title = {Lazy low-resource coreference resolution: A study on leveraging black-box translation tools},
booktitle = {Proc. 4th Workshop Comput. Models of Reference, Anaphora and Coreference (CRAC 2021) at EMNLP 2021},
month = {11 Nov.},
year = {2021},
pages = {1--6},
address = {Punta Cana, Domenican Republic},
url = {https://aclanthology.org/2021.crac-1.6/}
}
Powerful sentence encoders trained for multiple languages are on the rise. These systems are capable of embedding a wide range of linuistic properties into vector representations. While explicit probing tasks can be used to verify the presence of specific linguistic properties, it is unclear whether the vector represen- tations can be manipulated to indirectly steer such properties. For efficient learning, we i vestigate the use of a geometric mapping in embedding space to transform linguistic prop- erties, without any tuning of the pre-trained sentence encoder or decoder. We validate our approach on three linguistic properties using a pre-trained multilingual autoencoder and ana- lyze the results in both monolingual and cross- lingual settings.
@inproceedings{deraedt2021emnlp,
author = {De Raedt, Maarten and Godin, Fréderic and Buteneers, Pieter and Develder, Chris and Demeester, Thomas},
title = {A simple geometric method for cross-lingual linguistic transformations with pre-trained autoencoders},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Processing (EMNLP 2021)},
month = {7--11 Nov.},
year = {2021},
address = {Punta Cana, Domenican Republic},
url = {https://aclanthology.org/2021.emnlp-main.792/}
}
Job titles form a cornerstone of today’s human resources (HR) processes. Within online recruitment, they allow candidates to understand the contents of a vacancy at a glance, while internal HR departments use them to organize and structure many of their processes. As job titles are a compact, convenient, and readily available data source, modeling them with high accuracy can greatly benefit many HR tech applications. In this paper, we propose a neural representation model for job titles, by augmenting a pre-trained language model with co-occurrence information from skill labels extracted from vacancies. Our JobBERT method leads to considerable improvements compared to using generic sentence encoders, for the task of job title normalization, for which we release a new evaluation benchmark.
@inproceedings{decorte2021feast,
author = {Decorte, Jens-Joris and Van Hautte, Jeroen and Demeester, Thomas and Develder, Chris},
title = {JobBERT: Understanding job titles through skills},
booktitle = {Proc. Int. Workshop Fair, Effective and Sustainable Talent at ECML-PKDD (FEAST 2021)},
month = {13--17 Sep.},
year = {2021},
address = {Bilbao, Spain}
}
We consider a joint information extraction (IE) model, solving named entity recognition, coreference resolution and relation extraction jointly over the whole document. In particu- lar, we study how to inject information from a knowledge base (KB) in such IE model, based on unsupervised entity linking. The used KB entity representations are learned from either (i) hyperlinked text documents (Wikipedia), or (ii) a knowledge graph (Wikidata), and ap- pear complementary in raising IE performance. Representations of corresponding entity link- ing (EL) candidates are added to text span rep- resentations of the input document, and we ex- periment with (i) taking a weighted average of the EL candidate representations based on their prior (in Wikipedia), and (ii) using an attention scheme over the EL candidate list. Results demonstrate an increase of up to 5% F1-score for the evaluated IE tasks on two datasets. Despite a strong performance of the prior-based model, our quantitative and quali- tative analysis reveals the advantage of using the attention-based approach.
@inproceedings{verlinden2021,
author = {Verlinden, Severine and Zaporojets, Klim and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Injecting knowledge base information into end-to-end joint entity and relation extraction and coreference resolution},
booktitle = {Findings of the ACL: ACL-IJCNLP 2021},
month = {1--6 Aug.},
year = {2021},
address = {Bangkok, Thailand},
doi = {10.18653/v1/2021.findings-acl.171}
}
In online domain-specific customer service applications, many companies struggle to deploy advanced NLP models successfully, due to the limited availability of and noise in their datasets. While prior research demonstrated the potential of migrating large open-domain pretrained models for domain-specific tasks, the appropriate (pre)training strategies have not yet been rigorously evaluated in such social media customer service settings, especially under multilingual conditions. We address this gap by (i) collecting a multilingual social media corpus containing customer service conversations (865k tweets), (ii) comparing various pipelines of pretraining and fine- tuning approaches, (iii) applying them on 5 different end tasks. We show that pretraining a generic multilingual transformer model on our in-domain dataset, before finetuning on specific end tasks, consistently boosts performance, especially in non-English settings.
@inproceedings{hadifar2021naacl,
author = {Hadifar, Amir and Labat, Sofie and Hoste, Veronique and Develder, Chris and Demeester, Thomas},
title = {A million tweets are worth a few points: Tuning transformers for customer support tasks},
booktitle = {Proc. Ann. Conf. North American Chapter Assoc. Comp. Linguist. (NAACL 2021)},
month = {6--11 Jun.},
year = {2021},
address = {Online},
url = {https://www.aclweb.org/anthology/2021.naacl-main.21/}
}
We propose a newly annotated dataset for information extraction on recipes. Unlike previous approaches to machine comprehension of procedural texts, we avoid a priori pre-defining domain-specific predicates to recognize (e.g., the primitive instructions in MILK) and focus on basic understanding of the expressed semantics rather than directly reduce them to a simplified state representation (e.g., ProPara). We thus frame the semantic comprehension of procedural text such as recipes, as fairly generic NLP subtasks, covering (i) entity recognition (ingredients, tools and actions), (ii) relation extraction (what ingredients and tools are involved in the actions), and (iii) zero anaphora resolution (link actions to implicit arguments, e.g., results from previous recipe steps). Further, our Recipe Instruction Semantic Corpus (RISeC) dataset includes textual descriptions for the zero anaphora, to facilitate language generation thereof. Besides the dataset itself, we contribute a pipeline neural architecture that addresses entity and relation extraction as well as identification of zero anaphora. These basic building blocks can facilitate more advanced downstream applications (e.g., question answering, conversational agents).
@inproceedings{jiang2020aacl,
author = {Jiang, Yiwei and Zaporojets, Klim and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Recipe instruction semantics corpus (RISeC): Resolving semantic structure and zero anaphora in recipes},
booktitle = {Proc. 1st Conf. Asia-Pacific Chapter of the Assoc. Comput. Linguist. and 10th Int. Joint Conf. Natural Lang. Processing (AACL-IJCNLP 2020)},
month = {4--7 Dec.},
year = {2020},
pages = {821--826},
address = {Online},
url = {https://www.aclweb.org/anthology/2020.aacl-main.82}
}
The goal of the entity recognition and relation extraction task is to discover relational structures of entity mentions from unstructured texts. It is a central problem in information extraction since it is critical for tasks such as knowledge base population and question answering. In this work, we focus on extending the training procedure of our newly proposed general purpose joint model [4] for entity recognition and relation extraction with adversarial training (AT) [2]. Our model performs the two tasks of entity recognition and relation extraction simultaneously. It achieves state-of-the-art performance in a number of different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch) without relying on any manually engineered features nor additional NLP tools. In summary, our proposed model: (i) does not rely on external NLP tools nor hand-crafted features, (ii) entities and relations within the same text fragment (typically a sentence) are extracted simultaneously, where (iii) an entity can be involved in multiple relations at once. To evaluate the proposed AT method, we perform the same set of experiments while we apply AT on top of our joint model. Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness.
@inproceedings{bekoulis2019benelearn,
author = {Bekoulis, Giannis and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Adversarial perturbations for joint entity and relation extraction},
booktitle = {Proc. 28th Belgian Dutch Conf. Machine Learn. (BeneLearn 2019)},
month = {6--8 Nov.},
year = {2019},
address = {Brussels, Belgium},
url = {http://ceur-ws.org/Vol-2491/abstract5.pdf}
}
The overall goal of neuro-symbolic computation is to integrate high-level reasoning with low-level perception. We argue (1) that neuro-symbolic computation should integrate neural networks with the two most prominent methods for reasoning, that is, logic and probability, and (2) that neuro-symbolic integrated methods should have the pure neural, logical and probabilistic methods as special cases. We examine the state-of-the-art with regard to these claims and briefly position our own contribution DeepProbLog in this perspective.
@inproceedings{deraedt2019,
author = {De Raedt, Luc and Manhaeve, Robin and Dumančić, Sebastijan and Demeester, Thomas and Kimmig, Angelika},
title = {Neuro-Symbolic = Neural + Logical + Probabilistic},
booktitle = {Proc. 14th Int. Workshop Neural-Symbolic Learn. and Reasoning (NeSy 2019 @ IJCAI 2019)},
month = {12 Aug.},
year = {2019},
address = {Macao, China}
}
Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.
@inproceedings{hadifar2019repl4nlp,
author = {Hadifar, Amir and Sterckx, Lucas and Demeester, Thomas and Develder, Chris},
title = {A self-training approach for short text clustering},
booktitle = {Proc. 4th Workshop Represent. Learn. for NLP (RepL4NLP) at ACL 2019},
month = {2 Aug.},
year = {2019},
pages = {194--199},
address = {Florence, Italy},
url = {https://www.aclweb.org/anthology/papers/W/W19/W19-4322/}
}
This paper describes IDLab’s text classifica-tion systems submitted to Task A as part of the CLPsych 2019 shared task. The aim of this shared task was to develop automated sys-tems that predict the degree of suicide risk of people based on their posts on Reddit. Bag-of-words features, emotion features and post-level predictions are used to derive user-levelpredictions. Linear models and ensembles of these models are used to predict final scores. We find that predicting fine-grained risk levels is much more difficult than flagging potentially at-risk users. Furthermore, we do not find clear added value from building richer ensembles compared to simple baselines, given the available training data and the nature of the prediction task.
@inproceedings{bitew2019clpsych,
author = {Bitew, Semere Kiros and Giannis Bekoulis and Johannes Deleu and Lucas Sterckx and Klim Zaporojets and Thomas Demeester and Chris Develder},
title = {Predicting suicide risk from online postings in Reddit: The UGent-IDLab submission to the CLPysch 2019 Shared Task A},
booktitle = {Proc. 6th Ann. Workshop on Comput. Ling. Clin. Psychol. (CLPsych 2019) at NAACL-HLT 2019},
month = {6 Jun.},
year = {2019},
pages = {158--161},
address = {Minneapolis, MN, USA},
url = {https://www.aclweb.org/anthology/papers/W/W19/W19-3019/},
doi = {10.18653/v1/W19-3019}
}
This paper introduces improved methods for sub-event detection in social media streams ,by applying neural sequence models not only on the level of individual posts, but also directly on the stream level. Current approaches to identify sub-events within a given event (e.g., a goal during a soccer match), essentially do not exploit the sequential nature of social media streams. We address this shortcoming by framing the sub-event detection problem in social media streams as a sequence labeling task and adopt a neural sequence architecture that explicitly accounts for the chronological order of posts. Specifically, we (i) establish aneural baseline that outperforms a graph-based state-of-the-art method for binary sub-event detection (2.7% F1 improvement), as well as (ii) demonstrate superiority of a recurrent neural network model on the posts sequence level for labeled sub-events (2.4% F1 improvement over non-sequential models).
@inproceedings{Bekoulis2019NAACL,
author = {Bekoulis, Giannis and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Sub-event detection from Twitter streams as a sequence labeling problem},
booktitle = {Proc. Ann. Conf. North American Chapter Assoc. Comp. Linguist. (NAACL-HLT 2019)},
month = {3--5 Jun.},
year = {2019},
pages = {745--750},
address = {Minneapolis, MN, USA},
url = {https://www.aclweb.org/anthology/papers/N/N19/N19-1081/},
doi = {10.18653/v1/N19-1081}
}
Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data. We show how to use AT for the tasks of entity recognition and relation extraction. In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the state-of-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch).
@inproceedings{bekoulis2018emnlp,
author = {Bekoulis, Giannis and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Adversarial training for multi-context joint entity and relation extraction},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Processing (EMNLP 2018)},
month = {31 Oct. -- 4 Nov.},
year = {2018},
pages = {2830--36},
address = {Brussels, Belgium},
url = {https://www.aclweb.org/anthology/papers/D/D18/D18-1307/},
doi = {10.18653/v1/D18-1307}
}
Character-level features are currently used in different neural network-based natural language processing algorithms. However, little is known about the character-level patterns those models learn. Moreover, models are often compared only quantitatively while a qualitative analysis is missing. In this paper, we investigate which character-level patterns neural networks learn and if those patterns coincide with manually-defined word segmentations and annotations. To that end, we extend the contextual decomposition technique (Murdoch et al. 2018) to convolutional neural networks which allows us to compare convolutional neural networks and bidirectional long short-term memory networks. We evaluate and compare these models for the task of morphological tagging on three morphologically different languages and show that these models implicitly discover understandable linguistic rules.
@inproceedings{godin2018emnlp,
author = {Godin, Frederic and Kris Demuynck and Joni Dambre and De Neve, Wesley and Thomas Demeester},
title = {Explaining character-aware neural networks for word-level prediction: Do they discover linguistic rules?},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Processing (EMNLP 2018)},
month = {31 Oct. -- 4 Nov.},
year = {2018},
pages = {3275--3284},
address = {Brussels, Belgium},
url = {https://www.aclweb.org/anthology/D18-1365},
doi = {10.18653/v1/D18-1365}
}
Inducing sparseness while training neural networks has been shown to yield models with a lower memory footprint but similar effectiveness to dense models. However, sparseness is typically induced starting from a dense model, and thus this advantage does not hold during training. We propose techniques to enforce sparseness upfront in recurrent sequence models for NLP applications, to also benefit training. First, in language modeling, we show how to increase hidden state sizes in recurrent layers without increasing the number of parameters, leading to more expressive models. Second, for sequence labeling, we show that word embeddings with predefined sparseness lead to similar performance as dense embeddings, at a fraction of the number of trainable parameters.
@inproceedings{demeester2018conll,
author = {Demeester, Thomas and Deleu, Johannes and Godin, Frederic and Develder, Chris},
title = {Predefined sparseness in recurrent sequence models},
booktitle = {Proc. SIGNLL Conf. Comput. Lang. Learn. (CoNLL 2018)},
month = {31 Oct. -- 1 Nov.},
year = {2018},
pages = {324--333},
address = {Brussels, Belgium},
url = {https://www.aclweb.org/anthology/papers/K/K18/K18-1032/},
doi = {10.18653/v1/K18-1032}
}
Many Machine Reading and Natural Language Understanding tasks require reading supporting text in order to answer questions. For example, in Question Answering, the supporting text can be newswire or Wikipedia articles; in Natural Language Inference, premises can be seen as the supporting text and hypotheses as questions. Providing a set of useful primitives operating in a single framework of related tasks would allow for expressive modelling, and easier model comparison and replication. To that end, we present Jack the Reader (Jack), a framework for Machine Reading that allows for quick model prototyping by component reuse, evaluation of new models on existing datasets as well as integrating new datasets and applying them on a growing set of implemented baseline models. Jack is currently supporting (but not limited to) three tasks: Question Answering, Natural Language Inference, and Link Prediction. It is developed with the aim of increasing research efficiency and code reuse.
@inproceedings{weissenborn2018,
author = {Dirk Weissenborn and Pasquale Minervini and Tim Dettmers and Isabelle Augenstein and Johannes Welbl and Tim Rocktaschel and Matko Bosnjak and Jeff Mitchell and Thomas Demeester and Pontus Stenetorp and Sebastian Riedel},
title = {Jack the Reader - A machine reading framework},
booktitle = {Proc. 56th Annual. Meeting Assoc. Comput. Ling. - Demos Track (ACL 2018)},
month = {15--20 Jul.},
year = {2018},
address = {Melbourne, Australia}
}
This paper describes the IDLab system submitted to Task A of the CLPsych 2018 shared task. The goal of this task is predicting psychological health of children based on language used in hand-written essays and socio-demographic control variables. Our entry uses word- and character-based features as well as lexicon-based features and features derived from the essays such as the quality of the language. We apply linear models, gradient boosting as well as neural-network based regressors (feed-forward, CNNs and RNNs) to predict scores. We then make ensembles of our best performing models using a weighted average.
@inproceedings{zaporojets2018clpsych,
author = {Zaporojets, Klim and Lucas Sterckx and Johannes Deleu and Thomas Demeester and Chris Develder},
title = {Predicting psychological health from childhood essays: The UGent-IDLab CLPsych 2018 shared task system},
booktitle = {Proc. 5th Ann. Workshop on Comput. Ling. Clin. Psychol. (CLPsych 2018) at NAACL-HLT 2018},
month = {5 Jun.},
year = {2018},
pages = {119--125},
address = {New Orleans, LA, USA},
url = {https://www.aclweb.org/anthology/papers/W/W18/W18-0613/},
doi = {10.18653/v1/W18-0613}
}
In adversarial training, a set of models learn together by pursuing competing goals, usually defined on single data instances. However, in relational learning and other non-i.i.d domains, goals can also be defined over sets of instances. For example, a link predictor for the is-a relation needs to be consistent with the transitivity property: if is-a(x_1, x_2) and is-a(x_2, x_3) hold, is-a(x_1, x_3) needs to hold as well. Here we use such assumptions for deriving an inconsistency loss, measuring the degree to which the model violates the assumptions on an adversarially-generated set of examples. The training objective is defined as a minimax problem, where an adversary finds the most offending adversarial examples by maximising the inconsistency loss, and the model is trained by jointly minimising a supervised loss and the inconsistency loss on the adversarial examples. This yields the first method that can use function-free Horn clauses (as in Datalog) to regularise any neural link predictor, with complexity independent of the domain size. We show that for several link prediction models, the optimisation problem faced by the adversary has efficient closed-form solutions. Experiments on link prediction benchmarks indicate that given suitable prior knowledge, our method can significantly improve neural link predictors on all relevant metrics.
@inproceedings{Minervini2017,
author = {Minervini, Pasquale and Demeester, Thomas and Rocktäschel, Tim and Riedel, Sebastian},
title = {Adversarial sets for regularising neural link predictors},
booktitle = {Proc. 33rd Conf. Uncertainty in Artificial Intelligence (UAI 2017)},
month = {Aug. 11--15},
year = {2017},
address = {Sydney, Australia}
}
Comprehending lyrics, as found in songs and poems, can pose a challenge to human and machine readers alike. This motivates the need for systems that can understand the ambiguity and jargon found in such creative texts, and provide commentary to aid readers in reaching the correct interpretation.
We introduce the task of automated lyric annotation (ALA). Like text simplification, a goal of ALA is to rephrase the original text in a more easily understandable manner. However, in ALA the system must often include additional information to clarify niche terminology and abstract concepts. To stimulate research on this task, we release a large collection of crowdsourced annotations for song lyrics. We analyze the performance of translation and retrieval models on this task, measuring performance with both automated and human evaluation. We find that each model captures a unique type of information important to the task.
@inproceedings{Sterckx2017EMNLP,
author = {Sterckx, Lucas and and Jason Naradowsky and Bill Byrne and Thomas Demeester and Develder, Chris},
title = {Break it down for me: A study in automated lyric annotation},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Processing (EMNLP 2017)},
month = {7--11 Sep.},
year = {2017},
pages = {2064--70},
address = {Copenhagen, Denmark},
url = {https://www.aclweb.org/anthology/papers/D/D17/D17-1220/},
doi = {10.18653/v1/D17-1220}
}
In this paper, we address the (to the best of our knowledge) new problem of extracting a structured description of real estate properties from their natural language descriptions in classifieds. We survey and present several models to (a) identify important entities of a property (e.g., rooms) from classifieds and (b) structure them into a tree format, with the entities as nodes and edges representing a part-of relation. Experiments show that a graph-based system deriving the tree from an initially fully connected entity graph, outperforms a transition-based system starting from only the entity nodes, since it better reconstructs the tree.
@inproceedings{Bekoulis2017EACL,
author = {Bekoulis, Giannis and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Reconstructing the house from the ad: Structured prediction on real estate classifieds},
booktitle = {Proc. 15th Conf. Eur. Chapter Assoc. Comput. Ling. (EACL 2017), Vol. 2},
month = {3--7 Apr.},
year = {2017},
pages = {274--279},
address = {Valencia, Spain},
url = {https://www.aclweb.org/anthology/papers/E/E17/E17-2044/}
}
The problem of noisy and unbalanced train- ing data for supervised keyphrase extraction results from the subjectivity of keyphrase assignment, which we quantify by crowdsourcing keyphrases for news and fashion magazine articles with many annotators per document. We show that annotators exhibit substantial disagreement, meaning that single annotator data could lead to very different training sets for supervised keyphrase extractors. Thus, annotations from single authors or readers lead to noisy training data and poor extraction performance of the resulting supervised extractor. We provide a simple but effective solution to still work with such data by reweighting the importance of unlabeled candidate phrases in a two stage Positive Unlabeled Learning setting. We show that performance of trained keyphrase extractors approximates a classifier trained on articles labeled by multiple annotators, leading to higher average F1scores and better rankings of keyphrases. We apply this strategy to a variety of test collections from different backgrounds and show improvements over strong baseline models.
@inproceedings{sterckx2016emnlp,
author = {Sterckx, Lucas and Caragea, Cornelia and Demeester, Thomas and Develder, Chris},
title = {Supervised keyphrase extraction as positive unlabeled learning},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Proc. (EMNLP 2016)},
month = {1--5 Nov.},
year = {2016},
pages = {1924--29},
address = {Austin, TX, USA},
url = {https://www.aclweb.org/anthology/papers/D/D16/D16-1198/},
doi = {10.18653/v1/D16-1198}
}
Methods based on representation learning currently hold the state-of-the-art in many natural language processing and knowledge base inference tasks. Yet, a major challenge is how to efficiently incorporate commonsense knowledge into such models. A recent approach regularizes relation and entity representations by propositionalization of first-order logic rules. However, propositionalization does not scale beyond domains with only few entities and rules. In this paper we present a highly efficient method for incorporating implication rules into distributed representations for automated knowledge base construction. We map entity-tuple embeddings into an approximately Boolean space and encourage a partial ordering over relation embeddings based on implication rules mined from WordNet. Surprisingly, we find that the strong restriction of the entity-tuple embedding space does not hurt the expressiveness of the model and even acts as a regularizer that improves generalization. By incorporating few commonsense rules, we achieve an increase of 2 percentage points mean average precision over a matrix factorization baseline, while observing a negligible increase in runtime.
@inproceedings{demeester2016emnlp,
author = {Demeester, Thomas and Rocktäschel, Tim and Riedel, Sebastian},
title = {Lifted rule injection for relation embeddings},
booktitle = {Proc. Conf. Empirical Methods in Natural Lang. Proc. (EMNLP 2016)},
month = {1--5 Nov.},
year = {2016},
pages = {1389--1399},
address = {Austin, TX, USA},
url = {https://www.aclweb.org/anthology/D16-1146},
doi = {10.18653/v1/D16-1146}
}
We present four training and prediction schedules from the same character-level recurrent neural network. The efficiency of these schedules is tested in terms of model effectiveness as a function of training time and amount of training data seen. We show that the choice of training and prediction schedule potentially has a considerable impact on the prediction effectiveness for a given training budget.
@inproceedings{deboom2016deml,
author = {De Boom, Cedric and Leroux, Sam and Bohez, Steven and Simoens, Pieter and Demeester, Thomas and Dhoedt, Bart},
title = {Efficiency evaluation of character-level RNN training schedules},
booktitle = {Proc. ICML 2016 Workshop Data Efficient Machine Learn. (DEML 2016)},
month = {24 Jun.},
year = {2016}
}
Methods for automated knowledge base construction often rely on trained fixed-length vector representations of relations and entities to predict facts. Recent work showed that such representations can be regularized to inject first-order logic formulae. This enables to incorporate domain-knowledge for improved prediction of facts, especially for uncommon relations. However, current approaches rely on propositionalization of formulae and thus do not scale to large sets of formulae or knowledge bases with many facts. Here we propose a method that imposes first-order constraints directly on relation representations, avoiding costly grounding of formulae. We show that our approach works well for implications between pairs of relations on artificial datasets.
@inproceedings{demeester2016akbc,
author = {Demeester, Thomas and Rocktäschel, Tim and Riedel, Sebastian},
title = {Regularizing relation representations by first-order implications},
booktitle = {Proc. 5th Workshop Autom. Knowl. Base Constr. (AKBC 2016)},
month = {17 Jun.},
year = {2016},
pages = {75--80},
address = {San Diego, CA, USA},
url = {https://www.aclweb.org/anthology/W16-1314},
doi = {10.18653/v1/W16-1314}
}
The searchability of video content is often limited to the descriptions authors and/or annotators care to provide. The level of description can range from absolutely nothing to fine-grained annotations at the level of frames. Based on these annotations, certain parts of the video content are more searchable than others.
Within the context of the STEAMER project, we developed an innovative end-to-end system that attempts to tackle the problem of unsupervised retrieval of news video content, leveraging multiple information streams and deep neural networks. In particular, we extracted keyphrases and named entities from transcripts, subsequently refining these keyphrases and named entities based on their visual appearance in the news video content. Moreover, to allow for fine-grained frame-level annotations, we temporally located high-confidence keyphrases in the news video content. To that end, we had to tackle challenges such as the automatic construction of training sets and the automatic assessment of keyphrase imageability.
In this paper, we discuss the main components of our end-to-end system, capable of transforming textual and visual information into fine-grained video annotations.
@inproceedings{vandersmissen2016,
author = {Vandersmissen, Baptist and Sterckx, Lucas and Demeester, Thomas and Jalalvand, Azarakhsh and De Neve, Wesley and Van de Walle, Rik},
title = {An automated end-to-end pipeline for fine-grained video annotation using deep neural networks},
booktitle = {Proc. ACM Int. Conf. Multimedia Retr. (ICMR 2016)},
month = {6--9 Jun.},
year = {2016},
address = {New York, NY, USA},
doi = {10.1145/2911996.2912028}
}
This paper presents the system of the UGENT IBCN team for the TAC KBP 2015 cold start (slot filling variant) task. This was the team’s second participation. The slot filling system uses distant supervision to generate training data combined with feature labeling and semi-supervision, and two different types of classifiers. We show that the noise reduction step significantly improves precision, and propose an application of word embeddings for slot filling.
@inproceedings{sterckx2015tac,
author = {Lucas Sterckx and Thomas Demeester and Johannes Deleu and Chris Develder},
title = {Ghent University-IBCN participation in the TAC KBP 2015 cold start slot filling task},
booktitle = {Proc. 8th Text Analysis Conf. (TAC 2015)},
month = {16--17 Nov.},
year = {2015},
address = {Gaithersburg, MD, USA},
url = {https://tac.nist.gov/publications/2015/participant.papers/TAC2015.UGENT_IBCN.proceedings.pdf}
}
Levering data on social media, such as Twitter and Facebook, requires information retrieval algorithms to become able to relate very short text fragments to each other. Traditional text similarity methods such as tf-idf cosine-similarity, based on word overlap, mostly fail to produce good results in this case, since word overlap is little or non-existent. Recently, distributed word representations, or word embeddings, have been shown to successfully allow words to match on the semantic level. In order to pair short text fragments -- as a concatenation of separate words - an adequate distributed sentence representation is needed, in existing literature often obtained by naively combining the individual word representations. We therefore investigated several text representations as a combination of word embeddings in the context of semantic pair matching. This paper investigates the effectiveness of several such naive techniques, as well as traditional tf-idf similarity, for fragments of different lengths. Our main contribution is a first step towards a hybrid method that combines the strength of dense distributed representations - as opposed to sparse term matching - with the strength of tf-idf based methods to automatically reduce the impact of less informative terms. Our new approach outperforms the existing techniques in a toy experimental set-up, leading to the conclusion that the combination of word embeddings and tf-idf information might lead to a better model for semantic content within very short text fragments.
@inproceedings{deboom2015icdmw,
author = {De Boom, Cedric and Van Canneyt, Steven and Bohez, Steven and Demeester, Thomas and Dhoedt, Bart},
title = {Learning semantic similarity for very short texts},
booktitle = {Proc. IEEE Int. Conf. Data Min. Workshop (ICDMW 2015)},
month = {15--17 Nov.},
year = {2015},
address = {Atlantic City, NJ, USA},
doi = {10.1109/ICDMW.2015.86}
}
Information extraction (IE) systems discover structured in- formation from natural language text, to enable much richer querying and data mining than possible directly over the unstructured text. Unfortunately, IE is generally a com- putationally expensive process, and hence improving its ef- ficiency, so that it scales over large volumes of text, is of critical importance. State-of-the-art approaches for scaling the IE process focus on one text collection at a time. These approaches prioritize the extraction effort by learning key- word queries to identify the “useful” documents for the IE task at hand, namely, those that lead to the extraction of structured “tuples.” These approaches, however, do not at- tempt to predict which text collections are useful for the IE task—and hence merit further processing—and which ones will not contribute any useful output—and hence should be ignored altogether, for efficiency. In this paper, we focus on an especially valuable family of text sources, the so-called deep web collections, whose (remote) contents are only ac- cessible via querying. Specifically, we introduce and study techniques for ranking deep web collections for an IE task, to prioritize the extraction effort by focusing on collections with substantial numbers of useful documents for the task. We study both (adaptations of) state-of-the-art resource selec- tion strategies for distributed information retrieval, as well as IE-specific approaches. Our large-scale experimental eval- uation over realistic deep web collections, and for several different IE tasks, shows the merits and limitations of the alternative families of approaches, and provides a roadmap for addressing this critically important building block for efficient, scalable information extraction.
@inproceedings{Barrio2015CIKM,
author = {Barrio, Pablo and Gravano, Luis and Develder, Chris},
title = {Ranking deep web text collections for scalable information extraction},
booktitle = {Proc. 24th ACM Int. Conf. Inf. Knowl. Management (CIKM 2015)},
month = {19--23 Oct.},
year = {2015},
pages = {153--162},
address = {Melbourne, Australia},
doi = {10.1145/2806416.2806581}
}
We explore how the unsupervised extraction of topic-related keywords benefits from combining multiple topic models. We show that averaging multiple topic models, inferred from different corpora, leads to more accurate keyphrases than when using a single topic model and other state-of-the-art techniques. The experiments confirm the intuitive idea that a prerequisite for the significant benefit of combining multiple models is that the models should be sufficiently different, i.e., they should provide distinct contexts in terms of topical word importance.
@inproceedings{Sterckx2015WWWa,
author = {Lucas Sterckx and Thomas Demeester and Johannes Deleu and Chris Develder},
title = {When topic models disagree: Keyphrase extraction with multiple topic models},
booktitle = {Proc. 24th Int. World Wide Web Conf. (WWW 2015)},
month = {18--22 May},
year = {2015},
pages = {123--124},
address = {Florence, Italy},
doi = {10.1145/2740908.2742731}
}
We propose an improvement on a state-of-the-art keyphrase extraction algorithm, Topical PageRank (TPR), incorporating topical information from topic models. While the original algorithm requires a random walk for each topic in the topic model being used, ours is independent of the topic model, computing but a single PageRank for each text regardless of the amount of topics in the model. This increases the speed drastically and enables it for use on large collections of text using vast topic models, while not altering performance of the original algorithm.
@inproceedings{Sterckx2015WWWb,
author = {Lucas Sterckx and Thomas Demeester and Johannes Deleu and Chris Develder},
title = {Topical word importance for fast keyphrase extraction},
booktitle = {Proc. 24th Int. World Wide Web Conf. (WWW 2015)},
month = {18--22 May},
year = {2015},
pages = {121--122},
address = {Florence, Italy},
doi = {10.1145/2740908.2742730}
}
This paper presents ‘FedWeb Greatest Hits’, a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.
@inproceedings{demeester2015fedweb,
author = {Demeester, Thomas and Trieschnigg, Dolf and Zhou, Ke and Nguyen, Dong and Hiemstra, Djoerd},
title = {FedWeb greatest hits: Presenting the new test collection for federated web search},
booktitle = {Proc. 24th Int. World Wide Web Conf. (WWW 2015)},
month = {18--22 May},
year = {2015},
address = {Florence, Italy},
doi = {10.1145/2740908.2742755}
}
The use of external databases to generate training data, also known as Distant Supervision, has become an effective way to train supervised relation extractors but this approach inherently suffers from noise. In this paper we propose a method for noise reduction in distantly supervised training data, using a discriminative classifier and semantic similarity between the contexts of the training examples. We describe an active learning strategy which exploits hierarchical clustering of the candidate training samples. To further improve the effectiveness of this approach, we study the use of several methods for dimensionality reduction of the training samples. We find that semantic clustering of training data combined with cluster-based active learning allows filtering the training data, hence facilitating the creation of a clean training set for relation extraction, at a reduced manual labeling cost.
@inproceedings{Sterckx2014AKBC,
author = {Sterckx, Lucas and Demeester, Thomas and Deleu, Johannes and Develder, Chris},
title = {Using semantic clustering and active learning for noise reduction in distant supervision},
booktitle = {Proc. 4th Workshop on Automated Knowledge Base Construction (AKBC 2014) at NIPS 2014},
month = {13 Dec.},
year = {2014},
address = {Montreal, Canada}
}
Research on evaluation of IR systems has led to the insight that a robust evaluation strategy requires tests on a large number of events/queries. However, especially for event detection, the number of manually labeled events may be limited. In this paper we investigate how to optimize the evaluation strategy in those cases to maximize robustness. We also introduce two new vector space models for event detection that aim to incorporate bursty information of terms and compare these with existing models. Our experiments show that by using graded relevance levels we can reduce the impact of subjectivity and ambiguity of event detection evaluation. We also show that although user disagreement is significant, it has no real impact on the ranking of the results.
@inproceedings{Feys2014FIRE,
author = {Feys, Matthias and Demeester, Thomas and Fortuna, Blaz and Deleu, Johannes and Develder, Chris},
title = {On the robustness of event detection evaluation: A case study},
booktitle = {Proc. Forum for Inf. Retr. Evaluation (FIRE 2014)},
month = {5--7 Dec.},
year = {2014},
address = {Bangalore, India}
}
This paper analyzes two important conditions that are usually taken for granted in the evaluation of information retrieval systems: the test queries should be representative for the intended application scenario, and a sufficient amount of queries are needed to robustly assess system performance, as well as discern performance differ- ences between systems. Both issues have important consequences, as studied in this paper for the specific case of Entity Linking systems. We investigate two methods for automatic query generation, and show them to have a vast impact on evaluated system perfor- mance. We further demonstrate the effect a query set’s size has on its ability to faithfully distinguish systems, and propose a method for assessing the possible impact on system performance adding a specific number of queries to the set might have.
@inproceedings{Mertens2014FIRE,
author = {Mertens, Laurent and Demeester, Thomas and Deleu, Johannes and Feys, Matthias and Develder, Chris},
title = {Entity linking: Test collections revisited},
booktitle = {Proc. Forum for Inf. Retr. Evaluation (FIRE 2014)},
month = {5--7 Dec.},
year = {2014},
address = {Bangalore, India}
}
This paper presents the system of the UGENT IBCN team for the TAC KBP 2014 slot filling and cold start (slot filling variant) tasks. This was the team’s first participation in both tasks. The slot filling system uses distant supervision to generate training data combined with a noise reduction step, and two different types of classifiers. We show that the noise reduction step significantly improves precision, and propose an application of word embeddings for slot filling.
@inproceedings{Feys2014TAC,
author = {Feys, Matthias and Sterckx, Lucas and Mertens, Laurent and Deleu, Johannes and Demeester, Thomas and Develder, Chris},
title = {Ghent University-IBCN participation in TAC-KBP 2014 slot filling and cold start tasks},
booktitle = {Proc. 7th Text Analysis Conf. (TAC 2014)},
month = {17--18 Nov.},
year = {2014},
address = {Gaithersburg, MD, USA}
}
Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In this work we propose different approaches to define the set of relevant verticals based on document judgments. We correlate the collection-based relevant verticals obtained from these approaches to the real user vertical intent, and show that they can be aligned relatively well. The set of relevant verticals defined by those approaches could therefore serve as an approximate but reliable ground-truth for evaluating vertical selection, avoiding the need for collecting explicit user vertical intent, and vice versa.
@inproceedings{zhou2014,
author = {Zhou, Ke and Demeester, Thomas and Nguyen, Dong and Hiemstra, Djoerd and Trieschnigg, Dolf},
title = {Aligning vertical collection relevance with user intent},
booktitle = {Proc. 23rd ACM Int. Conf. Inf. Knowl. Management (CIKM 2014)},
month = {3--7 Nov.},
year = {2014},
pages = {1915--1918},
address = {Shanghai, China},
doi = {10.1145/2661829.2661941}
}
The TREC Federated Web Search track facilitates research on federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 Resource Selection and Results Merging tasks are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging tasks, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared.
@inproceedings{demeester2014trec,
author = {Demeester, Thomas and Drieschnigg, Dong and Zhou, Ke and Hiemstra, Djoerd},
title = {Overview of the TREC 2014 federated web search track},
booktitle = {Proc. 23rd Text Retr. Conf. (TREC 2014)},
month = {19--21 Nov.},
year = {2014},
address = {Gaithersburg, MD, USA},
url = {https://trec.nist.gov/pubs/trec23/papers/overview-federated.pdf}
}
Understanding and reasoning about textual data is one of the important topics in artificial intelligence and is being addressed by various research communities, ranging from knowledge representation, over natural language processing to text mining. Each community provides a different set of often overlapping intuitions, tools and methodologies for working with text.
Event processing from news and social media can be seen as a subtopic of text understanding [1, 2, 3, 4]. It comprises different tasks, including New and Retrospective Event Discovery, Event Type Classification and Event Template Extraction.
Research on event processing requires access to annotated data covering different tasks in the event processing pipeline. Over the last decades, several datasets have been created covering event discovery and event extraction, e.g., [5, 6, 7]. These datasets are rather limited in scope. For example, they contain articles from only few selected sources, or they contain a limited number of annotated events with a high selection bias (e.g., towards larger or well defined events like natural disasters or terrorism). Using such limited datasets to evaluate solutions for the event processing tasks may lead to favoring approaches that do not work well on real-world datasets. The main reasons for these limitations are (1) limited access to data resources and (2) the required and expensive manual annotations.
The main contributions we are working towards are (1) a systematic methodology for efficiently creating a large golden standard of manually annotated events over a large corpus of news articles with a realistic distribution over the covered topics and events, and (2) a resulting annotated corpus a resulting annotated corpus of 10,000 English general news articles embedded in 31 million news articles.
@inproceedings{Fortuna2014,
author = {Fortuna, Blaz and Demeester, Thomas and Develder, Chris},
title = {Towards large-scale event detection and extraction from news},
booktitle = {Proc. Large-scale Online Learn. and Decision Making Workshop (LSOLDM 2014)},
month = {10--12 Sep.},
year = {2014},
address = {Windsor, UK}
}
How useful are topic models based on song lyrics for applications in music information retrieval? Unsupervised topic models on text corpora are often difficult to interpret. Based on a large collection of lyrics, we investigate how well automatically generated topics are related to manual topic annotations. We propose to use the kurtosis metric to align unsupervised topics with a reference model of supervised topics. This metric is well-suited for topic assessments, as it turns out to be more strongly correlated with manual topic quality scores than existing measures for semantic coherence. We also show how it can be used for a detailed graphical topic quality assessment.
@inproceedings{Sterckx2014ECIR,
author = {Sterckx, Lucas and Demeester, Thomas and Deleu, Johannes and Mertens, Laurent and Develder, Chris},
title = {Assessing quality of unsupervised topics in song lyrics},
booktitle = {Proc. 36th Eur. Conf. Inf. Retr. (ECIR 2014)},
month = {13--16 Apr.},
year = {2014},
address = {Amsterdam, The Netherlands},
doi = {10.1007/978-3-319-06028-6_55}
}
The task of the SNOW 2014 Data Challenge is to mine Twit- ter streams to provide journalists a set of headlines and complementary information that summarize the most newswor- thy topics for a number of given time intervals. We propose a 4-step approach to solve this. First, a classifier is trained to determine whether a Twitter user is likely to post tweets about newsworthy stories. Second, tweets posted by these users during the time interval of interest are clustered into topics. For this clustering, the cosine similarity between a boosted tf-idf representation of the tweets is used. Third, we use a classifier to estimate the confidence that the obtained topics are newsworthy. Finally, for each obtained newswor- thy topic, a descriptive headline is generated together with relevant keywords, tweets and pictures. Experimental re- sults show the effectiveness of the proposed methodology.
@inproceedings{vancanneyt2014snow,
author = {Van Canneyt, Steven and Feys, Matthias and Schockaert, Steven and Demeester, Thomas and Develder, Chris and Dhoedt, Bart},
title = {Detecting newsworthy topics in Twitter},
booktitle = {Proc. 2nd Workshop on Social News on the Web at WWW 2014 (SNOW 2014)},
month = {8 Apr.},
year = {2014},
address = {Seoul, Korea}
}
In order to express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. This paper proposes a method to capture this user disagreement and integrate it into the evaluation procedure.
First, we present experiments that investigate the user disagreement. After that, a probabilistic model is proposed that results in a weighting of the relevance levels with a probabilistic interpretation. This is followed by a validity analysis, and an explanation of how to integrate the model with well-established evaluation metrics. Finally, we discuss a specific application of the model, in the estimation of suitable combined page and snippet relevance weights from Web search assessments.
@inproceedings{Demeester2014WSDM,
author = {Demeester, Thomas and Robin Aly and Djoerd Hiemstra and Dong Nguyen and Dolf Trieschnigg and Chris Develder},
title = {Exploiting user disagreement for web search evaluation: An experimental approach},
booktitle = {Proc. 7th ACM Int. Conf. Web Search and Data Min. (WSDM 2014)},
month = {24--28 Feb.},
year = {2014},
address = {New York, NY, USA},
note = {Acceptance rate: 17% (64/376)},
doi = {10.1145/2556195.2556268}
}
This article describes the system used by the UGent-IBCN team for participating in the Text Analysis Conference (TAC) 2013 English Entity-Linking task. We kept the overall rule-based workflow of our last year’s submission, but significantly altered individual
components. Most importantly, these changes include improved document pre-processing, new ways of candidate selection, and completely redesigned scoring and NIL-detection mechanisms. Finally, we provide detailed data of our system’s performance.
@inproceedings{Mertens13_TAC,
author = {Mertens, Laurent and Demeester, Thomas and Deleu, Johannes and Develder, Chris},
title = {UGent participation in the TAC 2013 entity-linking task},
booktitle = {Proc. Text Analysis Conference (TAC 2013)},
month = {18--19 Nov.},
year = {2013},
pages = {1--12},
address = {Gaithersburg, MD, USA},
url = {https://tac.nist.gov//publications/2013/participant.papers/UGENT_IBCN.TAC2013.proceedings.pdf}
}
The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb
2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants’ individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.
@inproceedings{demeester2013trec,
author = {Demeester, Thomas and Drieschnigg, Dong and Zhou, Ke and Hiemstra, Djoerd},
title = {Overview of the TREC 2013 federated web search track},
booktitle = {Proc. 22nd Text Retr. Conf. (TREC 2013)},
month = {19--22 Nov.},
year = {2013},
address = {Gaithersburg, MD, USA},
url = {https://trec.nist.gov/pubs/trec22/papers/FEDERATED.OVERVIEW.pdf}
}
We describe the participation of the Lowlands at the Web Track and the FedWeb track of TREC 2013. For the Web Track we used the Mirex Map-Reduce library with out-of-thebox approaches and for the FedWeb Track we adapted our shard selection method Taily for resource selection. Here, our results were above median and close to the maximum performance achieved.
@inproceedings{aly2013,
author = {Aly, Robin and Hiemstra, Djoerd and Trieschnigg, Dolf and Demeester, Thomas},
title = {Mirex and Taily at TREC 2013},
booktitle = {Proc. 22nd Text Retr. Conf. (TREC 2013)},
month = {19--23 Nov.},
year = {2013},
address = {Gaithersburg, MD, USA},
url = {https://trec.nist.gov/pubs/trec22/papers/lowlands-web-federated.pdf}
}
Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.
@inproceedings{aly2013sigir,
author = {Aly, Robin and Hiemstra, Djoerd and Demeester, Thomas},
title = {Taily: Shard selection using the tail of score distributions},
booktitle = {Proc. 36th Int. ACM SIGIR Conf. Research},
month = {28 Jul.--1 Aug.},
year = {2013},
pages = {673--682},
address = {Dublin, Ireland},
doi = {10.1145/2484028.2484033}
}
How well can the relevance of a page be predicted, purely based on snippets? This would be highly useful in a Federated Web Search setting where caching large amounts of result snippets is more feasible than caching entire pages. The experiments reported in this pa- per make use of result snippets and pages from a diverse set of actual Web search engines. A linear classifier is trained to predict the snippet- based user estimate of page relevance, but also, to predict the actual page relevance, again based on snippets alone. The presented results confirm the validity of the proposed approach and provide promising insights into future result merging strategies for a Federated Web Search setting.
@inproceedings{Demeester2013ECIR,
author = {Demeester, Thomas and Nguyen, Dong and Trieschnigg, Dolf and Develder, Chris and Hiemstra, Djoerd},
title = {Snippet-based relevance predictions for federated web search},
booktitle = {Proc. 35th Eur. Conf. Inf. Retr. (ECIR 2013)},
month = {24--27 Mar.},
year = {2013},
pages = {697--700},
address = {Moscow, Russia},
doi = {10.1007/978-3-642-36973-5_63}
}
What is the likelihood that a Web page is considered relevant to a query, given the relevance assessment of the corresponding snippet? Using a new federated IR test collection that contains search results from over a hundred search engines on the internet, we are able to investigate such research questions from a global perspective. Our test collection covers the main Web search engines like Google, Yahoo!, and Bing, as well as a number of smaller search engines dedicated to multimedia, shopping, etc., and as such reflects a realistic Web environment.
Using a large set of relevance assessments, we are able to investigate the connection between snippet quality and page relevance. The dataset is strongly inhomogeneous, and although the assessors’ consistency is shown to be satisfying, care is required when comparing resources. To this end, a number of probabilistic quantities, based on snippet and page relevance, are introduced and evaluated.
@inproceedings{Demeester2012AIRS,
author = {Demeester, Thomas and Dong Nguyen and Dolf Trieschnigg and Chris Develder and Djoerd Hiemstra},
title = {What snippets say about pages in federated web search},
booktitle = {Proc. 8th Asia Inf. Retr. Soc. Conf. (AIRS 2012)},
month = {17--19 Dec.},
year = {2012},
address = {Tianjin, China},
doi = {10.1007/978-3-642-35341-3_21}
}
This article describes in detail the system used by the UGent-IBCN team for participating in the Text Analysis Conference (TAC) 2012 Mono-Lingual Entity-Linking task. The pre- sented system is essentially rule-based, following a generic framework that is highly optimised for each label (i.e. with different rules for persons, organisations, and locations). The main contribution of this work is in identifying a number of label-specific issues and presenting simple heuristic solutions that yet allow building an efficient and effective system. These treated issues include resolving abbreviated organisation names, resolving popular nicknames, or taking into account American vs British spelling.
@inproceedings{Mertens2012TAC,
author = {Mertens, Laurent and Demeester, Thomas and Deleu, Johannes and Demeester, Piet and Develder, Chris},
title = {UGent participation in the TAC 2012 entity-linking task},
booktitle = {Proc. 5th Text Analysis Conf. (TAC 2012)},
month = {14-15 Nov.},
year = {2012},
address = {Gaithersburg, MD, USA}
}
In this paper, we describe the search system, developed at Ghent University for the TREC 2012 Microblog Track in order to rank Twitter messages or ‘tweets’ from a fixed corpus in response to a number of search requests. Our system ranks the tweets based on a Logistic Regression classifier trained with data from the Microblog Track 2011. The features used for training the classifier include local tweets features, but also, query expansion and tweet expansion features, based on external Web data, which appear to significantly improve results.
@inproceedings{VanDuc2012TREC,
author = {Van Duc, Thong Hoang and Demeester, Thomas and Deleu, Johannes and Develder, Chris},
title = {UGent participation in the Microblog Track 2012},
booktitle = {Proc. Text Retr. Conf. (TREC 2012)},
month = {6--9 Nov.},
year = {2012},
address = {Gaithersburg, MD}
}
Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.
@inproceedings{nguyen2012cikm,
author = {Nguyen, Dong and Demeester, Thomas and Trieschnigg, Dolf and Hiemstra, Djoerd},
title = {Federated search in the wild: The combined power of over a hundred search engines},
booktitle = {Proc. 21st ACM Int. Conf. Inf. Knowl. Management (CIKM 2012)},
month = {29 Oct. - 2 Nov.},
year = {2012},
pages = {1874--1878},
address = {Maui, HI, USA},
doi = {10.1145/2396761.2398535}
}
This paper describes a number of specific issues that we needed to deal with, in order to compose an accurate Named Entity Recognition tool on multimedia archives in Dutch. The considered data consists of archivation metadata from video collections, and large newspaper collections. For the video collections, the main challenge is to cope with a lack of capitalization in the metadata. To this end, specific capitalization features are calculated from Wikipedia. For the newspaper collections, the main concern is to create a system that maintains its performance over the course of many years. For that goal, special clustering features allow dealing with words that have not been encountered in training data. Results for the different components of the tool are reported on the target data, as well as on publicly available test data.
@inproceedings{deleu2012dir,
author = {Deleu, Johannes and De Moor, An and Demeester, Thomas and Vermeulen, Brecht and Demeester, Piet},
title = {Named entity recognition on flemish audio-visual and news-paper archives},
booktitle = {Proc. 12th Dutch-Belgian Inf. Retr. Workshop (DIR 2012)},
month = {23--24 Feb.},
year = {2012},
pages = {38--41},
address = {Ghent, Belgium}
}
In modern automated information extraction systems, Named Entity Disambiguation (NED) techniques are becoming increasingly important. The ambiguity of person names leads to a decrease in the output quality of search engines. This paper presents a two-stage rule-based NED model, based on a local and global context of the mentioned persons. A number of experiments with different scoring functions are reported, as well as a specific evaluation method to estimate the efficiency of the model on a real-life data collection in an unsupervised way.
@inproceedings{Mertens2012,
author = {Mertens, Laurent and Demeester, Thomas and Deleu, Johannes and Develder, Chris and Demeester, Piet},
title = {Context-based person identification for news collections},
booktitle = {Proc. 12th Dutch-Belgian Inf. Retr. Workshop (DIR 2012)},
month = {23--24 Feb.},
year = {2012},
pages = {26--29},
address = {Ghent, Belgium}
}
We designed a role ontology-enhanced multimedia search enginewhere the user can search and subsequently filter news items withqueries and filter options describing the roles of the people whoappear in the items, specifically politicians. The system makes useof a separate knowledge base with domain information on politics.We demonstrate that when a user fails to recollect the name of a politician, role-based queries combined with filter options tailoredto the query and the result set, lead the user fast to both the namehe failed to recollect and the intended results in the multimediadatabase.
@inproceedings{vandamme2012dir,
author = {Vandamme, Stijn and Wauters, Tim and Demeester, Thomas and De Turck, Filip},
title = {Implementation and evaluation of query filtering in a role ontology-enhanced search engine},
booktitle = {Proc. 12th Dutch-Belgian Inf. Retr. Workshop (DIR 2012)},
month = {23--24 Feb.},
year = {2012},
pages = {34--37},
address = {Ghent, Belgium}
}
In order to allow for flexible search and asset management on the textual metadata of multimedia archives, the extraction of information and especially named entities is an essential step. Practically, they are of great help for applications like facetted search, input assistance, search suggestions, linking assets, etc. This paper describes MediaHaven, a Media Asset Management (MAM) system, commercialized by Zeticon, a spin-off of Ghent University-IBBT. MediaHaven incorporates an advanced NER and categorisation system to improve the user experience
@inproceedings{vandenbossche2012dir,
author = {Van Den Bossche, Bruno and Vermeulen, Brecht and Deleu, Johannes and Demeester, Thomas and Demeester, Piet},
title = {MediaHaven: Multimedia asset management with integrated NER and categorization},
booktitle = {Proc. 12th Dutch-Belgian Inf. Retr. Workshop (DIR 2012)},
month = {23--24 Feb.},
year = {2012},
pages = {85--86},
address = {Ghent, Belgium}
}
This paper focuses on an automatic and accurate approach for finding similar users in social networks. Many types of social networks could benefit from such techniques, but the focus in this paper is on online photo services. The similarity between users needs to be considered on two different levels, i.e., the semantic similarity (or correspondence in tagging behavior), and the similarity in terms of social relations. In recent work, heuristic formulas were introduced for the tag commonness (TC) and the link strength (LS), with an adaptive combination scheme to describe how relevant each of these similarity aspects are for particular users, in order to define the user similarity. This paper presents an experiment, where a Learning-to-Rank approach is used to find suitable combinations of TC and LS related parameter values, hence taking into account the proficiency of users to tag their photos, and their noticeability in the online community, in order to obtain an overall user similarity. The user experiments show that the results with this learning-to-rank approach are significantly better than with a former, heuristic, approach.
@inproceedings{VanDuc2012DIR,
author = {Van Duc, Thong Hoang and Demeester, Thomas and Develder, Chris and Shin, Hyoseop},
title = {Effectiveness of learning to rank for finding user similarity in social media},
booktitle = {Proc. 12th Dutch-Belgian Inf. Retr. Workshop (DIR 2012)},
month = {23--24 Feb.},
year = {2012},
pages = {30--33},
address = {Ghent, Belgium}
}
Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work.
@inproceedings{aly2011,
author = {Aly, Robin and Demeester, T.},
title = {Towards a better understanding of the relationship between probabilistic models in IR},
booktitle = {Proc. 3rd Int. Conf. Theory Inf. Retr. (ICTIR 2011)},
month = {12--14 Sep.},
year = {2011},
pages = {164--175},
address = {Bertorino, Italy},
doi = {10.1007/978-3-642-23318-0_16}
}