Playground

Overview

The Playground is an interactive environment in the Roark dashboard where you can test custom metrics against real calls before adding them to policies or running them at scale. Iterate on metric prompts quickly without affecting production data.

What You Can Do

Test custom metrics — Write an LLM prompt and see how it evaluates against a real call
Preview metric output — See the exact value, confidence score, and reasoning before deploying
Iterate quickly — Adjust prompts and re-run instantly without creating metric definitions
Validate before deploying — Ensure your metrics produce the expected results on representative calls

Getting Started

Open the Playground

Navigate to Playground in your Roark dashboard sidebar.

Select a Call

Choose an existing call from your project to test against. Pick a call that represents the type of conversation your metric will evaluate.

Configure Your Metric

Set the output type (boolean, scale, classification, etc.) and write your LLM evaluation prompt. The prompt should clearly describe what the metric should measure.

Run the Test

Click Run to evaluate the metric against the selected call. Review the output value, confidence score, and reasoning.

Iterate and Deploy

Adjust your prompt or configuration and re-run until the metric produces the results you expect. Once satisfied, you can:

Create the metric definition via the dashboard or SDK
Run it on more calls using a metric collection job
Automate it by adding it to a metric policy

Testing Thresholds

The Playground also lets you test thresholds on your metrics. After running a metric, you can configure a pass/fail condition (e.g., >= 7) and instantly see whether the call would pass or fail — without creating a metric definition first. This is a quick way to validate that your threshold logic produces the expected results before adding it to a policy or run plan.

Thresholds Guide

Learn about operators, aggregation modes, and participant role filtering

Tips

Use Representative Calls

Test against calls that reflect the variety of conversations your metric will encounter in production.

Be Specific in Prompts

Clear, specific prompts produce more consistent results. Include examples of what constitutes a positive or negative result.

Test Edge Cases

Try your metric against calls where the answer is ambiguous to see how it handles uncertainty.

Getting Started

Observability

Metrics

Simulations

Integrations

SDKs & Libraries

Resources

Overview

What You Can Do

Getting Started

Testing Thresholds

Thresholds Guide

Tips

What’s Next

Metrics

Collection Jobs

Getting Started

Observability

Metrics

Simulations

Integrations

SDKs & Libraries

Resources

​Overview

​What You Can Do

​Getting Started

​Testing Thresholds

Thresholds Guide

​Tips

​What’s Next

Metrics

Collection Jobs

Overview

What You Can Do

Getting Started

Testing Thresholds

Tips

What’s Next