TY - JOUR
T1 - Evaluating Large Language Models for Enhanced Fuzzing: An Analysis Framework for LLM-Driven Seed Generation
AU - Black, Gavin
AU - Vaidyan, Varghese Mathew
AU - Comert, Gurcan
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Fuzzing is a crucial technique for detecting software defects by dynamically generating and testing program inputs. This study introduces a framework designed to assess the application of Large Language Models (LLMs) to automate the generation of effective seed inputs for fuzzing, particularly in the Python programming environment where traditional approaches are less effective. Utilizing the Atheris fuzzing framework, we created over 38,000 seed inputs from LLMs targeted at 50 Python functions from widely-used libraries. Our findings underscore the critical role of LLM selection in seed effectiveness. In certain cases, seeds generated by LLMs rivaled or surpassed traditional fuzzing campaigns, with a corpus of fewer than 100 LLM-generated entries outperforming over 100,000 conventionally produced inputs. These seeds significantly improved code coverage and instruction count during fuzzing sessions, illustrating the efficacy of our framework in facilitating an automated, scalable approach to evaluating LLM effectiveness. The results, validated through linear regression analysis, demonstrate that selecting the appropriate LLM based on its training and capabilities is essential for optimizing fuzzing efficiency and facilitates the testing of future LLM versions.
AB - Fuzzing is a crucial technique for detecting software defects by dynamically generating and testing program inputs. This study introduces a framework designed to assess the application of Large Language Models (LLMs) to automate the generation of effective seed inputs for fuzzing, particularly in the Python programming environment where traditional approaches are less effective. Utilizing the Atheris fuzzing framework, we created over 38,000 seed inputs from LLMs targeted at 50 Python functions from widely-used libraries. Our findings underscore the critical role of LLM selection in seed effectiveness. In certain cases, seeds generated by LLMs rivaled or surpassed traditional fuzzing campaigns, with a corpus of fewer than 100 LLM-generated entries outperforming over 100,000 conventionally produced inputs. These seeds significantly improved code coverage and instruction count during fuzzing sessions, illustrating the efficacy of our framework in facilitating an automated, scalable approach to evaluating LLM effectiveness. The results, validated through linear regression analysis, demonstrate that selecting the appropriate LLM based on its training and capabilities is essential for optimizing fuzzing efficiency and facilitates the testing of future LLM versions.
KW - Fuzzing
KW - large language models
KW - machine learning
KW - python
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85207447112&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85207447112&origin=inward
U2 - 10.1109/ACCESS.2024.3484947
DO - 10.1109/ACCESS.2024.3484947
M3 - Article
SN - 2169-3536
VL - 12
SP - 156065
EP - 156081
JO - IEEE Access
JF - IEEE Access
ER -