Automated classification of scientific publications using Large Language Models

Timeframe and Context
The project was carried out from November 2025 to January 2026 at the Institute for Applied AI and Robotics.
The goal was to support a systematic literature review on generative AI in robotics by automating the pre-screening of scientific publications. In total, 3,769 documents had to be reviewed, which otherwise would have required manual reading and assessment.
To reduce this workload, a Large Language Model was assigned two classification tasks in order to distinguish relevant from irrelevant publications and enable an efficient pre-selection.
As the sole developer, I bore full responsibility for the project, from conceptual design to the development and evaluation of the classification solution.
Implementation and Tech Stack
The implementation involved developing a classification pipeline in Python using LangChain. This pipeline enabled the two classification tasks to be applied efficiently to the scientific publications.
A locally hosted Large Language Model was used as the foundation and integrated via Ollama, enabling privacy-friendly data processing. To ensure classification quality, the prompts were systematically evaluated using accuracy, precision, and recall, and iteratively improved.
Challenges and Results
Ensuring high classification quality was a key challenge, particularly with regard to balancing precision and recall. To achieve reliable results, the prompts were iteratively optimized and systematically evaluated.
The defined thresholds for both classification tasks were successfully exceeded (precision > 0.9, recall > 0.95, accuracy > 0.95). As a result, the number of publications requiring manual review was significantly reduced. Of the original 3,769 publications, more than half were automatically filtered out, leaving 1,656 documents for manual screening.