Abstract
Security-focused program testing typically focuses on crash detection and code coverage while overlooking additional system behaviors that can impact program confidentiality and availability. To address this gap, we propose a statistical framework that combines embedding-based anomaly detection, resource usage metrics, and resource-state distance measures to systematically profile software behaviors beyond traditional coverage-based methods. Leveraging over 5 million labeled samples from 50 Python programs, we evaluate how these independent scoring terms distinguish among different sources of input, including Large Language Model (LLM)-generated inputs, and demonstrate how standard statistical tests (e.g., Kolmogorov—Smirnov and Kendall’s τ ) confirm their effectiveness. Our findings show that LLM-generated samples can trigger diverse behaviors but are often less effective at exploring resource usage dynamics (CPU, memory) compared with conventional fuzzing. However, combining LLM outputs with existing techniques broadens behavior coverage and reveals commonalities between commercial LLM outputs. We provide open-source tools for this evaluation framework, demonstrating the potential to refine software testing by integrating behavior metrics into security-testing workflows.
| Original language | English |
|---|---|
| Pages (from-to) | 87928-87940 |
| Number of pages | 13 |
| Journal | IEEE Access |
| Volume | 13 |
| Issue number | Issue |
| DOIs | |
| State | Published - Jan 1 2025 |
Keywords
- Software profiling
- anomaly detection
- fuzzing techniques
- large language models
- program behavior analysis
- resource usage metrics
Fingerprint
Dive into the research topics of 'From LLMs to Randomness: Analyzing Program Input Efficacy With Resource and Language Metrics'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver