Metrics Overview
Assay ships with 18 evaluation metrics across six categories. Every metric returns a score between 0 and 1 and a boolean passed based on a configurable threshold (default: 0.5).
All metrics
| Metric | Category | Required Fields | LLM Required |
|---|---|---|---|
| AnswerRelevancyMetric | RAG | input, actualOutput | Yes |
| FaithfulnessMetric | RAG | input, actualOutput, context | Yes |
| HallucinationMetric | RAG | input, actualOutput, context | Yes |
| ContextualPrecisionMetric | RAG | input, actualOutput, expectedOutput, context | Yes |
| ContextualRecallMetric | RAG | input, actualOutput, expectedOutput, context | Yes |
| ContextualRelevancyMetric | RAG | input, actualOutput, context | Yes |
| ToolCorrectnessMetric | Agentic | input, toolsCalled, expectedTools | No |
| TaskCompletionMetric | Agentic | input, actualOutput | Yes |
| GoalAccuracyMetric | Agentic | input, actualOutput | Yes |
| KnowledgeRetentionMetric | Conversational | ConversationalTestCase | Yes |
| ConversationCompletenessMetric | Conversational | ConversationalTestCase | Yes |
| RoleAdherenceMetric | Conversational | ConversationalTestCase | Yes |
| BiasMetric | Safety | input, actualOutput | Yes |
| ToxicityMetric | Safety | input, actualOutput | Yes |
| GEval | Custom | input, actualOutput (+ custom) | Yes |
| SummarizationMetric | Custom | input, actualOutput | Yes |
| ExactMatchMetric | Non-LLM | actualOutput, expectedOutput | No |
| JsonCorrectnessMetric | Non-LLM | actualOutput, expectedOutput | No |
Usage pattern
All metrics follow the same pattern:
typescript
import { evaluate, FaithfulnessMetric } from "@assay-ai/core";
const results = await evaluate({
testCases: [
{
input: "What is TypeScript?",
actualOutput: "TypeScript is a typed superset of JavaScript.",
context: ["TypeScript is a typed superset of JavaScript by Microsoft."],
},
],
metrics: [new FaithfulnessMetric({ threshold: 0.7 })],
});Categories
- RAG Metrics -- Evaluate retrieval-augmented generation pipelines for faithfulness, relevancy, and hallucination.
- Agentic Metrics -- Evaluate AI agent tool usage, task completion, and goal accuracy.
- Conversational Metrics -- Evaluate multi-turn conversations for knowledge retention, completeness, and role adherence.
- Safety Metrics -- Detect bias and toxicity in LLM outputs.
- Custom Metrics -- Define your own evaluation criteria with GEval or extend BaseMetric.
- Non-LLM Metrics -- Deterministic metrics that do not require an LLM API key.