Features

Human Evals

Send traces to human reviewers with a custom evaluation form, then aggregate the results.

Some quality questions can't be answered by a rule or a classifier — they need a person. Human Evals lets you route real traces to reviewers with a structured form, collect their judgments, and aggregate the results into quality metrics. It's how you turn "this output felt off" into something measurable.


How an evaluation works

You create an evaluation, and Neatlogs dispatches matching items to reviewers to score against a form you define. An item is usually a whole trace, but for existing traces you can also scope an evaluation down to individual spans.

  1. Name and describe the evaluation.
  2. Choose a trace source:
    • Existing — score traces (or specific spans) already captured, optionally narrowed by a filter.
    • Future — score new traces as they arrive over a collection window, with sampling and daily limits so reviewers aren't flooded.
  3. Build the form — a set of custom questions spanning ratings, linear scales, multiple choice, checkboxes, dropdowns, grids, and free text. Start from a reusable evaluation template, describe the form in plain language and let Neatlogs generate it, or build one from scratch.
  4. Assign reviewers — pick the team members who will score the items.
  5. Launch. Items are dispatched on your chosen schedule; reviewers see them in their assigned queue.

Reviewing

Assigned reviewers work through their queue: each item shows the trace and the evaluation form. Reviewers don't need to understand the underlying span data — they answer the questions you wrote. Evaluations can carry a review deadline so a round of scoring completes on time — eval-wide for one-time and existing-trace evaluations, or per batch for recurring ones.


Results

As responses come in, Neatlogs aggregates them — overall summaries, per-question breakdowns, and per-batch views for recurring evaluations. Recurring batches can also generate a one-click AI summary report of the round, and you can export raw responses as CSV or JSON. This closes the loop between what your agent does in production and how good a human judges it to be.

Human Evals complements the lightweight voting you do while debugging in Traces. Use voting for quick, ad-hoc signal as you browse; use Human Evals for structured, assigned review against a defined rubric.

On this page