A graduate of English literature turned data scientist has pioneered a novel method enabling large language models (LLMs) to analyze and interpret short pieces of text. This innovation targets snippets like social media bios, customer feedback, and online posts related to disaster events—areas where context is often limited.
In today’s digital age, short text dominates online communication. However, such fragments pose significant challenges for analysis due to their lack of shared vocabulary or context, making it hard for AI to detect patterns or group similar content effectively.
To address this, new research leverages LLMs to organize vast datasets of short text into coherent clusters. These clusters distill millions of tweets or comments into digestible groups generated by the model.
PhD student Justin Miller spearheaded this approach, demonstrating its effectiveness by analyzing nearly 40,000 Twitter (X) biographies from accounts discussing US President Donald Trump over two days in September 2020. Miller’s model grouped these biographies into 10 distinct categories, assigning scores within each category to identify likely occupations, political leanings, and even emoji usage.
The study, published in Royal Society Open Science, highlights the model’s ability to produce human-intuitive clusters. According to Miller, the human-centered design ensures that the clusters align with how people naturally interpret text. For example, content about family, work, or politics was grouped in ways that felt logical and accessible.
Miller noted that generative AI, including tools like ChatGPT, could even outperform human reviewers in naming clusters, offering clearer and more consistent interpretations of patterns within the data.
As a doctoral candidate in the School of Physics and a member of the Computational Social Sciences lab, Miller envisions his tool simplifying complex datasets, aiding decision-making, and enhancing search and organization processes.
Press release – University of Sydney