Skip to content

Metrics Overview

Assay ships with 18 evaluation metrics across six categories. Every metric returns a score between 0 and 1 and a boolean passed based on a configurable threshold (default: 0.5).

All metrics

MetricCategoryRequired FieldsLLM Required
AnswerRelevancyMetricRAGinput, actualOutputYes
FaithfulnessMetricRAGinput, actualOutput, contextYes
HallucinationMetricRAGinput, actualOutput, contextYes
ContextualPrecisionMetricRAGinput, actualOutput, expectedOutput, contextYes
ContextualRecallMetricRAGinput, actualOutput, expectedOutput, contextYes
ContextualRelevancyMetricRAGinput, actualOutput, contextYes
ToolCorrectnessMetricAgenticinput, toolsCalled, expectedToolsNo
TaskCompletionMetricAgenticinput, actualOutputYes
GoalAccuracyMetricAgenticinput, actualOutputYes
KnowledgeRetentionMetricConversationalConversationalTestCaseYes
ConversationCompletenessMetricConversationalConversationalTestCaseYes
RoleAdherenceMetricConversationalConversationalTestCaseYes
BiasMetricSafetyinput, actualOutputYes
ToxicityMetricSafetyinput, actualOutputYes
GEvalCustominput, actualOutput (+ custom)Yes
SummarizationMetricCustominput, actualOutputYes
ExactMatchMetricNon-LLMactualOutput, expectedOutputNo
JsonCorrectnessMetricNon-LLMactualOutput, expectedOutputNo

Usage pattern

All metrics follow the same pattern:

typescript
import { evaluate, FaithfulnessMetric } from "@assay-ai/core";

const results = await evaluate({
  testCases: [
    {
      input: "What is TypeScript?",
      actualOutput: "TypeScript is a typed superset of JavaScript.",
      context: ["TypeScript is a typed superset of JavaScript by Microsoft."],
    },
  ],
  metrics: [new FaithfulnessMetric({ threshold: 0.7 })],
});

Categories

  • RAG Metrics -- Evaluate retrieval-augmented generation pipelines for faithfulness, relevancy, and hallucination.
  • Agentic Metrics -- Evaluate AI agent tool usage, task completion, and goal accuracy.
  • Conversational Metrics -- Evaluate multi-turn conversations for knowledge retention, completeness, and role adherence.
  • Safety Metrics -- Detect bias and toxicity in LLM outputs.
  • Custom Metrics -- Define your own evaluation criteria with GEval or extend BaseMetric.
  • Non-LLM Metrics -- Deterministic metrics that do not require an LLM API key.

Released under the MIT License.