Skip to main content
Roark ships with a comprehensive set of system metrics that are automatically available in every project. These metrics are powered by purpose-built, specialized models — not generic LLMs — designed to extract precise signal from conversational audio and transcripts. System metrics require no configuration. Add them to an analysis package, attach a metric policy, and start collecting data immediately.
All system metrics listed below are powered by specialized models purpose-built for voice AI analysis. This means they are faster, more consistent, and more cost-effective than general-purpose LLM evaluation.

Metric Types Overview

Roark supports four ways to define metrics. System metrics use the first type, and you can create your own using any of the four:
TypeHow it worksUse case
System (Specialized Models)Purpose-built models analyze audio and transcript signalsPerformance, interruptions, sentiment, compliance, call quality
LLM as JudgeAn LLM evaluates the conversation against a natural-language prompt you writeCustom business logic, subjective quality checks, task completion
PatternRegex or keyword matching against transcript textDetecting specific phrases, prohibited words, required disclosures
FormulaCombine existing metrics using boolean logic and weighted expressionsComposite scores, pass/fail rules based on multiple metrics

LLM as Judge

Define a metric with a natural-language prompt. Roark Prism — our evaluation model optimized for voice AI — scores each call against your prompt and returns a typed result (boolean, scale, classification, count, etc.).
"Did the agent verify the caller's identity before proceeding?"  →  Boolean
"Rate the agent's empathy on a 1-5 scale"                       →  Scale
"What was the primary call reason?"                              →  Classification
Create LLM as Judge metrics in the dashboard or via the SDK.

Pattern Detection

Match specific patterns in the transcript using keywords or regex. Useful for detecting required phrases, prohibited language, or specific conversational markers without LLM overhead.

Formula Metrics

Combine multiple metrics into a single composite score using boolean logic and weighted expressions. For example, define a “Call Success” metric that requires frustration_score < 3 AND instruction_follow = TRUE. Formula metric builder Formula metrics let you build layered quality gates from your existing metrics without writing any code. Learn more about creating metrics in Custom Metrics.

System Metrics Reference

All system metrics below are collected automatically when included in an analysis package. Each metric shows its output type, scope, and the specialized model that powers it. Scope legend:
  • Global — one value per call
  • Per-participant — separate values for agent and customer

Core Analysis

Timing and interaction metrics extracted from audio diarization and transcript alignment.
Powered by Roark Vibe — our core voice analysis model.
MetricDescriptionOutputScope
call_durationTotal duration of the callNumeric (seconds)Global
response_timeTime between speaking turnsNumeric (seconds)Per-participant
time_to_first_wordTime from call start to first spoken wordNumeric (seconds)Per-participant
silence_durationDuration of each silence periodNumeric (seconds)Per-participant
turn_durationDuration of each speaking turnNumeric (seconds)Per-participant
word_countNumber of words spokenCountPer-participant
talk_to_listen_ratioRatio of time a participant spends talking vs total call durationNumericPer-participant
speaking_rateWords spoken per minute by a participantNumeric (wpm)Per-participant
turn_countNumber of speaking turns by a participantCountPer-participant
latency_spike_countNumber of response gaps exceeding 3 secondsCountPer-participant
longest_pauseLongest gap between consecutive segments in the callNumeric (seconds)Global

Sentiment & Emotion

Emotion and sentiment analysis from vocal features and acoustic signals.
Powered by Hume Expression Measurement — a specialized vocal emotion model.
MetricDescriptionOutputScope
sentiment_scoreSentiment rating on a 1–9 scale (1 = negative, 9 = positive)Scale (1-9)Per-participant
emotion_labelDetected emotion label from 64+ emotionsClassificationPer-participant
dominant_emotionMost frequent emotion across the callClassificationPer-participant
vocal_cue_labelDetected vocal cue or expression labelClassificationPer-participant

Interruptions

Detailed interruption and overlap analysis from speaker diarization.
Powered by Roark Interruptions — a specialized overlap detection model.
MetricDescriptionOutputScope
interruptionWhether overlapping speech occurred on a segmentBooleanPer-participant
interruption_durationDuration of overlapping speechNumeric (seconds)Per-participant
interruption_countTotal number of interruptionsCountGlobal
first_interruption_timeTime into call of first interruptionOffset (seconds)Global
overtalk_ratioRatio of overlapping speech duration to total call durationNumericGlobal
agent_interruption_countNumber of times the agent interrupted the customerCountGlobal
incorrect_agent_interruption_countNumber of agent interruptions classified as inappropriateCountGlobal
incorrect_interruption_rateProportion of agent interruptions that were inappropriateScale (0-1)Global
customer_barge_in_countNumber of times the customer attempted to interrupt the agentCountGlobal
failed_barge_inWhether a customer interruption attempt failed because the agent did not yieldBooleanPer-participant
failed_barge_in_countNumber of customer interruption attempts where the agent did not yieldCountGlobal
failed_barge_in_rateProportion of customer interruption attempts that failedScale (0-1)Global
interruption_appropriatenessWhether an agent interruption was appropriate based on conversational contextBooleanPer-participant
pre_interruption_speaker_durationHow long the interrupted speaker had been talking before being interruptedNumeric (seconds)Per-participant
agent_cutoffWhether the agent started speaking while the customer was mid-sentenceBooleanPer-participant
agent_cutoff_countNumber of times the agent cut off the customer mid-sentenceCountGlobal

Quality

Experience quality scoring from conversational signals.
Powered by Roark Quality Analysis and Roark Prism — specialized models for quality assessment.
MetricDescriptionOutputScope
frustration_scoreCustomer frustration level (1 = none, 5 = severe)Scale (1-5)Per-participant
user_effort_scoreHow much effort the customer exerted to accomplish their goal (1 = effortless, 5 = very difficult)Scale (1-5)Per-participant
call_outcomeOverall outcome: Resolved, Unresolved, Escalated, Dropped, or Follow-up RequiredClassificationGlobal
instruction_followHow well the agent followed its given instructions (1 = not followed, 5 = fully followed)Scale (1-5)Global
redundant_question_countQuestions where the agent asked for information already providedCountGlobal
missed_response_countMoments where a participant should have responded but did notCountPer-participant
comprehension_failureWhether the agent misunderstood what the customer saidBooleanPer-participant
comprehension_failure_countNumber of times the agent misunderstood the customerCountGlobal

Repetition Detection

Conversational loop and repetition analysis.
Powered by Roark Prism — our evaluation model optimized for voice AI.
MetricDescriptionOutputScope
repetition_densityRatio of repeated turns to total turns (0–1)NumericPer-participant
loop_countNumber of distinct conversational loops detectedCountPer-participant

Tool Invocations

Analysis of function/tool calling behavior during conversations.
Powered by Roark Vibe and Roark Prism.
MetricDescriptionOutputScope
tool_invocation_countTotal number of tool/function calls made during the conversationCountGlobal
tool_invocation_correctWhether the agent invoked the correct tools at the appropriate timesBooleanGlobal
tool_invocation_order_correctWhether tools were called in the correct logical sequenceBooleanGlobal
tool_invocation_parameters_correctWhether correct parameters were passed to each tool invocationBooleanGlobal
tool_invocation_result_correctWhether the agent correctly interpreted and used tool resultsBooleanGlobal

Compliance

Regulatory and safety evaluation metrics for AI agent conversations.
Powered by Roark Prism — customizable with your own compliance requirements.
MetricDescriptionOutputScope
compliance_disclosure_completenessWhether all required disclosures were delivered (recording notice, AI identity, licensing)Scale (1-5)Global
compliance_prohibited_languageWhether the agent used prohibited language (unauthorized guarantees, medical/legal advice, discrimination)BooleanGlobal
compliance_pii_handlingHow properly the agent handled personally identifiable informationScale (1-5)Global
compliance_consent_collectionWhether required consent was obtained before data collection or recordingBooleanGlobal
compliance_escalation_adherenceWhether the agent properly escalated to a human when requiredBooleanGlobal
compliance_scope_adherenceWhether the agent stayed within its defined scope of topicsScale (1-5)Global
compliance_prompt_injection_resistanceWhether the agent resisted attempts to override its instructions or jailbreakBooleanGlobal
compliance_identity_consistencyWhether the agent maintained its assigned identity and disclosed its AI natureBooleanGlobal
compliance_hallucination_boundaryWhether the agent avoided fabricating information and deferred when unsureScale (1-5)Global

Voicemail Detection

Voicemail detection and handling quality assessment.
Powered by Roark Prism.
MetricDescriptionOutputScope
voicemail_detectedWhether the call reached a voicemail system rather than a live personBooleanGlobal
voicemail_agent_left_messageWhether the agent left a voicemail messageBooleanGlobal
voicemail_handling_scoreQuality of voicemail handling (beep detection, message clarity, completeness)Scale (1-5)Global

Accent Detection

English accent identification from audio signals.
Powered by Roark Accent ID — a specialized accent classification model supporting 16 English accent variants.
MetricDescriptionOutputScope
accentDetected English accent (US, British, Australian, Canadian, Indian, Irish, Scottish, Welsh, African, New Zealand, Hong Kong, Malaysian, Philippine, Singaporean, Bermudian, South Atlantic)ClassificationPer-participant
accent_stabilityHow stable the detected accent is across segments (1.0 = consistent, lower = varies)Numeric (0-1)Per-participant
For a detailed walkthrough on using accent metrics, see the Accent Detection recipe.

Call Quality (DNSMOS)

Speech quality assessment using the ITU-T P.808/P.835 Mean Opinion Score (MOS) scale.
Powered by Roark DNSMOS — a specialized speech quality model based on the ITU-T standard.
MetricDescriptionOutputScope
speech_quality_overallOverall perceived speech quality (P.835 OVRL). Combines signal and background noise quality.Scale (1-5)Global
speech_quality_signalQuality of the speech signal itself (P.835 SIG). Measures distortion, codec artifacts, and clarity.Scale (1-5)Global
speech_quality_backgroundBackground environment quality (P.835 BAK). Higher = cleaner background.Scale (1-5)Global
speech_quality_mosITU-T P.808 Mean Opinion Score. Single overall quality rating from the audio signal.Scale (1-5)Global

What’s Next

Custom Metrics

Create custom LLM as Judge, Pattern, and Formula metrics

Playground

Test metrics interactively against real calls

Metric Policies

Automate metric collection with conditions-based rules

Thresholds

Define pass/fail criteria for your metrics