
The open-source framework for rigorous large language model evaluation and safety testing.
Inspect is a state-of-the-art open-source evaluation framework developed by the UK AI Safety Institute to standardize the measurement of large language model (LLM) capabilities and safety profiles. Built on a modular Python architecture, Inspect allows researchers and AI architects to define 'Tasks' comprising three core components: Solvers (the logic driving the model), Scorers (the metrics for success), and Datasets (the evaluation samples). Its technical architecture is specifically designed to handle complex, multi-turn agentic workflows where models must use tools, interact with sandboxed environments, and solve multi-step problems. By 2026, Inspect has transitioned from a government research tool to the industry standard for enterprise LLM validation, bridging the gap between raw model performance and production-ready safety requirements. It provides native support for virtually all major model providers, including OpenAI, Anthropic, Google, and local vLLM/Ollama deployments, ensuring a unified interface for cross-model benchmarking. The framework's ability to generate high-fidelity 'Inspect Logs' enables deep forensic analysis of model reasoning paths, which is critical for compliance with emerging global AI regulations like the EU AI Act.
Executes model-generated code in isolated Docker containers to safely test agentic capabilities without risking host system integrity.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
Allows for complex evaluation pipelines where one model critiques another, or where multiple scoring metrics are aggregated into a weighted safety score.
A built-in web-based GUI for deep-diving into individual evaluation trials, showing full prompt/response history and internal solver state.
A middleware-like architecture where developers can chain multiple solvers (e.g., Chain-of-Thought, Self-Correction) before reaching the scorer.
Pre-configured tasks for evaluating common safety risks like cyber-offense, chemical/biological weapon knowledge, and persuasion.
Asynchronous execution of evaluation trials across multiple model instances to maximize throughput and minimize wall-clock time.
Logs are stored in a standardized JSON format that can be easily ingested by downstream observability platforms like Arize or LangSmith.
Ensuring a model doesn't provide instructions for illegal activities before it is released to the public.
Registry Updated:2/7/2026
Refine safety filters and re-test until the failure rate is <0.1%.
Choosing the best embedding model and chunking strategy for a retrieval-augmented generation system.
Verifying that an AI agent can correctly use a database tool to answer user queries without errors.