publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMsChris Egersdoerfer, Arnav Sareen, Jean Luca Bez, Suren Byna, Dongkuan DK Xu, and Dong DaiIn 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Milan, Italy, 2025
As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. The recent rapid progress in large language models (LLMs) opens the door to creating an automated tool that democratizes trustworthy I/O performance diagnosis capabilities to domain scientists. However, LLMs face significant challenges in this task, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions. In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates various new designs, including a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to continue asking questions about the diagnosis. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Based on this test suite, extensive evaluations were conducted, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.
2024
- ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsChris Egersdoerfer, Arnav Sareen, Jean Luca Bez, Suren Byna, and Dong DaiIn Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, Santa Clara, CA, USA, 2024
Effectively leveraging the complex software and hardware I/O stacks of HPC systems to deliver needed I/O performance has been a challenging task for domain scientists. To identify and address I/O issues in their applications, scientists largely rely on I/O experts to analyze the recorded I/O traces of their applications and provide insights into the potential issues. However, due to the limited number of I/O experts and the growing demand for data-intensive applications across the wide spectrum of sciences, inaccessibility has become a major bottleneck hindering scientists from maximizing their productivity. Inspired by the recent rapid progress of large language models (LLMs), in this work we propose IO Navigator (ION), an LLM-based framework that takes a recorded I/O trace of an application as input and leverages the in-context learning, chain-of-thought, and code generation capabilities of LLMs to comprehensively analyze the I/O trace and provide diagnosis of potential I/O issues. Similar to an I/O expert, ION provides detailed justifications for the diagnosis and an interactive interface for scientists to ask detailed questions about the diagnosis. We illustrate ION’s applicability by assessing it on a set of controlled I/O traces generated with different I/O issues. We also demonstrate that ION can match state-of-the-art I/O optimization tools and provide more insightful and adaptive diagnoses for real applications. We believe ION, with its full capabilities, has the potential to become a powerful tool for scientists to navigate through complex I/O subsystems in the future.