Concolic, symbolic, formal - what do we need to test AI services?

By Andrey Shtanov - anshtanov@edu.hse.ru

Fig. 1. “software designer tests prototype robots on a futuristic factory, made with Bing Image Creator [10]

The integration of Artificial Intelligence (AI) services across diverse domains, including healthcare, finance, autonomous systems, and entertainment, has become a pivotal aspect of contemporary technology. However, as these services are increasingly utilized in critical applications, ensuring their reliability and robustness has emerged as a critical priority. Furthermore, AI services introduce new challenges to educational processes [1], resulting in various ethical considerations [2] [3]. Unlike traditional software, AI systems present distinctive challenges in testing due to their inherent complexity, dynamic behavior, and vast, often non-deterministic, input spaces. These complexities demand specialized testing methodologies that can effectively identify latent vulnerabilities and ensure the reliability of AI systems. In this context, three prominent testing approaches — symbolic execution, concolic testing, and formal methods — offer distinct advantages for validating AI services. Symbolic execution systematically explores potential execution paths, concolic testing combines symbolic and concrete methods for efficient test case generation, and formal methods provide mathematically rigorous means to specify and verify system behavior.

This essay examines the efficacy of these methods in the context of AI service testing. While each approach has its strengths and limitations, the argument is made that a hybrid testing strategy leveraging the complementary advantages of symbolic, concolic, and formal methods is essential for comprehensive validation. By integrating these methodologies, we can address the multifaceted challenges of testing AI systems and ensure their safe and reliable operation in real-world scenarios.

To ensure the reliability of AI services, robust testing methodologies capable of addressing the inherent complexity and dynamic behaviors of such systems are required. Three prominent approaches — Symbolic Execution, Concolic Testing, and Formal Methods — offer distinct strategies for achieving this goal.

Symbolic Execution involves analyzing programs by treating inputs as symbolic values rather than concrete data. This technique systematically explores program paths by evaluating expressions symbolically, enabling the detection of errors across various execution scenarios. However, its application to AI systems is challenged by the vast state spaces and complex decision logic inherent in such systems, which can lead to state space explosion and increased computational demands. For instance, KLEE demonstrates high coverage in testing traditional software but faces scalability issues when applied to large-scale AI models (Cadar et al., 2008). The computational intensity of symbolic execution highlights its limitations when confronted with the expansive input spaces typical of AI services.

Concolic Testing, a hybrid of concrete and symbolic execution, mitigates the limitations of pure symbolic execution by alternating between concrete program execution and symbolic analysis. This approach effectively generates test cases that achieve high code coverage and uncover edge cases. In the context of deep neural networks (DNNs), the DeepConcolic tool has demonstrated significant advancements, achieving high neuron coverage and identifying adversarial examples (Sun et al., 2018). The tool’s iterative method of combining concrete evaluation and symbolic analysis ensures robust test input generation, even for high-dimensional input spaces such as those used in image classification tasks. This dual capability makes concolic testing a particularly versatile method, capable of addressing the challenges posed by AI systems’ complexity and dynamic nature. [4] [5]

Formal Methods employ mathematical models to specify and verify system behaviors, offering a rigorous framework for ensuring correctness. They are particularly effective in scenarios requiring strong guarantees, such as safety-critical systems. Techniques like Reluplex have been developed to verify properties of DNNs by leveraging satisfiability modulo theories (SMT) solvers (Katz et al., 2017). However, formal methods often struggle with scalability due to the computational intensity of verifying complex AI models. Despite this, their precision in specifying and verifying behaviors makes them invaluable in domains where errors can have significant consequences. [6]

Each of these testing methodologies presents unique strengths and limitations when applied to AI services. Symbolic Execution excels in comprehensive program path exploration but struggles with scalability in complex AI systems. Concolic Testing balances concrete and symbolic analysis, making it effective for high-dimensional inputs and identifying edge cases, but may require significant computational resources for large-scale AI systems. Formal Methods provide unparalleled precision in verifying system behavior but are often impractical for large-scale AI applications due to their computational demands.

Given these considerations, a hybrid approach integrating Symbolic Execution, Concolic Testing, and Formal Methods appears most promising. By leveraging the complementary strengths of these methodologies, we can address their individual limitations and ensure the reliability and robustness of AI services. For example, combining DeepConcolic’s coverage capabilities with the precision of formal verification tools like Reluplex could enhance testing frameworks and promote trust in AI systems across various applications. Such an integrated approach allows for a nuanced understanding and validation of AI systems, ensuring robustness across diverse scenarios.

The findings from the analysis of testing methods reveal critical insights into the multifaceted requirements for validating AI services. Symbolic Execution, Concolic Testing, and Formal Methods each address specific aspects of the challenges inherent in AI testing. However, their individual limitations underscore the necessity of a comprehensive approach. Symbolic Execution’s ability to explore program paths systematically offers unparalleled depth in identifying potential vulnerabilities. However, its limited scalability makes it less practical for large-scale AI models. Concolic Testing bridges this gap by balancing symbolic and concrete testing, providing broader coverage while maintaining practical efficiency. Its ability to uncover edge cases by alternately executing symbolic and concrete test cases makes it invaluable for high-dimensional AI systems. Meanwhile, Formal Methods contribute rigorous verification capabilities, ensuring adherence to predefined specifications, particularly in safety-critical applications. Yet, their computational overhead restricts their applicability to smaller models.

Integrating these methods presents a viable path forward. By leveraging the exploratory power of Symbolic Execution, the efficiency and coverage of Concolic Testing, and the precision of Formal Methods, a hybrid testing framework can comprehensively address the diverse challenges posed by AI systems. For instance, such an approach could ensure that both high-level specifications and low-level operational paths are thoroughly validated, enhancing trust in AI deployments. Additionally, this integration fosters adaptability, enabling the framework to evolve alongside advancements in AI technologies.

The implications of these findings extend beyond individual methodologies. They highlight the evolving needs of AI testing frameworks, suggesting that future research should prioritize the development of scalable hybrid approaches. Tools like DeepConcolic demonstrate the feasibility of combining techniques to achieve high coverage and robustness. Expanding this paradigm could lead to more adaptive and reliable AI validation practices, ultimately fostering greater confidence in AI applications across industries. Furthermore, collaborative efforts to refine these methods and create standardized frameworks can enhance their accessibility and applicability, ensuring that AI systems remain dependable in various contexts.

This essay has explored the essential role of Symbolic Execution, Concolic Testing, and Formal Methods in testing AI services, emphasizing their individual strengths and limitations. While Symbolic Execution provides comprehensive path exploration, Concolic Testing offers balanced efficiency and coverage, and Formal Methods deliver rigorous precision, each faces unique challenges in scalability or applicability.

The analysis underscores the necessity of a hybrid testing approach that integrates these methodologies to address their respective limitations. Such an approach can provide a comprehensive solution for validating AI systems, ensuring their reliability and robustness in increasingly complex applications. By combining these methods, we can achieve a holistic perspective that caters to the multifaceted nature of AI systems, bridging gaps and optimizing performance.

Future research should focus on enhancing the scalability of these methods and developing integrated frameworks that combine their strengths. Additionally, efforts should be directed towards creating user-friendly tools that allow developers and testers to apply these methodologies seamlessly. By doing so, we can create more adaptive and effective testing solutions that meet the evolving demands of AI technologies, ultimately fostering greater trust and confidence in their deployment. As AI continues to advance, these testing paradigms must evolve in tandem, ensuring that innovation is matched by reliability.

  1. Unesco.org, use of AI in education: Deciding on the future we want, https://www.unesco.org/en/articles/use-ai-education-deciding-future-we-want.
  2. Random Trees, ethical Considerations in Generative AI, https://randomtrees.medium.com/ethical-considerations-in-generative-ai-112004eef101
  3. Objectoriented.ru, essay on the topic of ethical considerations in AI development, https://objectoriented.ru/arch:2024:ethical-considerations-in-requirements-engineering
  4. Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic Testing for Deep Neural Networks. In Proceedings of the 2018 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE ’18), September 3–7, 2018, Montpellier, France. ACM,NewYork,NY,USA,11pages., https://doi.org/10.1145/3238147.3238172
  5. Wikipedia.org, Concolic Testing, https://en.wikipedia.org/wiki/Concolic_testing
  6. Patrickalphac.medium.com, Formal Verification & Symbolic Execution | The Security Silver Bullet? , https://patrickalphac.medium.com/formal-verification-symbolic-execution-the-security-silver-bullet-38e0ac9072eb
  7. ChatGPT-4o Canvas tool, basic structure and ideas for various thesis was proposed by ChatGPT-4o with this prompt: “I need to write an essay for my Advanced Software Design course in the academic style. The topic is “Concolic, symbolic, formal - what do we need to test AI services?”. Help me by creating a small starting point consisting of possible thesises and instructions for each structure point: Introduction, Analysis (consisting of Literature Review, Methodology, Results/Findings), Discussion and Conclusion” , https://chatgpt.com
  8. ChatGPT-4o, some parts were additionally reviewed by ChatGPT 4o with this prompt: “Proofread my academic essay for any logical errors and suggest overall improvements: <text>”. Some corrections were accepted, https://chatgpt.com
  9. DeepL’s Writing tool, some parts were additionally reviewed by DeepL’s Writing tool for grammar errors and style misuse. Some corrections were accepted, https://www.deepl.com/ru/write
  10. Bing Image Creator, fig. 1 was created with the following prompt: software designer tests prototype robots on a futuristic factory. 4K, Ultra-realistic, depth of field, https://www.bing.com/images/create/