Skip to main navigation Skip to search Skip to main content

MoDeST: A dataset for Multi Domain Scientific Title Generation

Research output: Contribution to journalArticlepeer-review

Abstract

Title generation is a crucial task in scientific writing, serving as the first point of contact for readers and impacting an article's visibility and citation frequency. In this paper, we present a novel multi-domain and multilingual dataset, MoDeST (Multi-Domain Scientific Title Generation), designed to advance research in scientific title generation. Unlike existing datasets, which are often limited to a single language or domain, MoDeST includes both English and Turkish titles across various academic disciplines, such as social sciences, medical science, science and engineering, and more. We explore the challenges of generating concise and informative titles using different sources (keywords, abstract, and full article) and evaluate the performance of large language models (LLMs) in zero-, few-shot and supervised fine-tuning (SFT) settings. We conduct human evaluation and calculate correlations between human judgement and automatic metrics, identifying the most effective evaluation metric for assessing title quality. Experiments show that fine-tuning significantly improves LLM performance for title generation, with LLaMA-3.1 8B, 70B, Aya-expanse 8B, and 32B achieving scores of 40.12, 45.21, 45.31, and 47.22 for Turkish, and 45.10, 49.10, 40.02, and 48.54 for English, respectively. Moreover, we find that abstract is the most effective input source for generating titles. Additionally, we analyse domain-specific challenges and the impact of cross-lingual generation, highlighting the need for tailored models for different domains. Our dataset, with its broad representation, ensures applicability across various academic disciplines, enhancing its utility for multi-domain and multilingual title generation while also benefiting the broader NLP community and related tasks.

Original languageEnglish
Article number113557
JournalKnowledge-Based Systems
Volume321
DOIs
Publication statusPublished - 28 Jun 2025

Keywords

  • Dataset
  • LLM
  • Multi-domain
  • Multilingual
  • Scientific Title Generation

Fingerprint

Dive into the research topics of 'MoDeST: A dataset for Multi Domain Scientific Title Generation'. Together they form a unique fingerprint.

Cite this