题名

Evaluation of Text Cluster Naming with Generative Large Language Models

DOI

10.6339/24-JDS1149

作者

Alexander J. Preiss;Caren A. Arbeit;Anthony Berghammer;John Bollenbacher;John V. McCarthy;Madeline G. Brom;Mike Enger;Nicholas Rios Villacorta;Shaquavia Straughn

关键词

cluster profiling ; large language model ; natural language processing ; text clustering ; topic modeling ; unsupervised learning

期刊名称

Journal of Data Science

卷期/出版年月

22卷3期(2024 / 07 / 01)

页次

376 - 392

内容语文

英文

中文摘要

Text clustering can streamline many labor-intensive tasks, but it creates a new challenge: efficiently labeling and interpreting the clusters. Generative large language models (LLMs) are a promising option to automate the process of naming text clusters, which could significantly streamline workflows, especially in domains with large datasets and esoteric language. In this study, we assessed the ability of GPT-3.5-turbo to generate names for clusters of texts and compared these to human-generated text cluster names. We clustered two benchmark datasets, each from a specialized domain: research abstracts and clinical patient notes. We generated names for each cluster using four prompting strategies (different ways of including information about the cluster in the prompt used to get LLM responses). For both datasets, the best prompting strategy beat the manual approach across all quality domains. However, name quality varied by prompting strategy and dataset. We conclude that practitioners should consider trying automated cluster naming to avoid bottlenecks or when the scale of the effort is enough to take advantage of the cost savings offered by automation, as detailed in our supplemental blueprint for using LLM cluster naming. However, to get the best performance, it is vital to test a variety of prompting strategies and perform a small test to identify which one performs best on each project's unique data.

主题分类 基礎與應用科學 > 資訊科學
基礎與應用科學 > 統計