Towards Enhancing Data Science Agents with Semantics

Sayed Hoseini1, Maximilian Ibbels1, Maximilian Knoll1 and Christoph Quix1, 2

  1. Hochschule Niederrhein University of Applied Sciences, Krefeld, Germany
    sayed.hoseini@hsnr.de
  2. Fraunhofer Institute for Applied Information Technology FIT, St. Augustin, Germany
    christoph.quix@fit.fraunhofer.de

Abstract

Data lakes, initially designed for storing heterogeneous datasets, have recently been extended with ML capabilities to unify data science tasks within a single platform. However, they still lack essential ML-specific features, limiting their effectiveness for end-to-end automation. Automated Machine Learning (AutoML) and Large Language Models (LLMs) offer potential solutions by streamlining various stages of the ML pipeline, yet both have significant limitations. This paper presents an integration of AutoML frameworks and LLMs within a data lake system. We introduce a metadata model to capture data analytics processes, a Python package wrapping existing AutoML libraries, and a module utilizing LLMs to automate ML tasks. A comparative evaluation indicates that AutoML simplifies pipeline creation but limits user control and lacks robust data preprocessing support. LLMs can automate individual tasks, such as code generation, but struggle to orchestrate complete workflows effectively. Both approaches risk staying as basic prototypes that still need manual improvement. The primary challenge lies in managing task interdependencies within ML pipelines. Retrieval-augmented generation enables dynamic access to external information but may overlook structured data relationships, leading to incomplete or redundant results. Therefore, we propose an extended vision that integrates multi-agent frameworks for data science with knowledge graphs that capture historical experience from previous ML experiments. We present preliminary results for developing comprehensive, context-aware ML agents and their integration into our data lake system SEDAR.

Key words

AutoML, LLMs, Semantic Data Lake, MLOps, Data Science Agents

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS250320078H

Publication information

Volume 23, Issue 1 (January 2026)
Year of Publication: 2026
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Hoseini, S., Ibbels, M., Knoll, M., Quix, C.: Towards Enhancing Data Science Agents with Semantics. Computer Science and Information Systems, Vol. 23, No. 1, 419-441. (2026), https://doi.org/10.2298/CSIS250320078H