Experiments
Prompt version control, a sandbox to test changes, and evaluation datasets.
Experiments is where you iterate on your agent's behavior without touching production. It has three parts:
- Prompts — version-controlled storage for every prompt your agent uses
- Playground — a sandbox to test prompt changes before they go live
- Datasets — a labelled evaluation layer built from your production traces (coming soon)
Prompts
Your prompts are the most important variable in how your agent behaves. Experiments treats them like code — every change is a version, every version is labelled, and rolling back is one click.
Each prompt has a name, a full version history, and labels like production or staging that your application uses to fetch the right version at runtime. When you're ready to ship a change, you move the label — no code change, no redeploy.
# Your application always fetches by label, not by version number
prompt = neatlogs.get_prompt("support-system-prompt", label="production")The version history shows what changed, when, and why. If a new version causes regressions in your traces, you can see exactly what the prompt looked like before and promote an older version back immediately.
# Promote version 2 back to production
neatlogs.update_prompt(
name="support-system-prompt",
version=2,
new_labels=["production"],
)See Prompt Templates for the full SDK reference.
Playground
Once you have a prompt version you want to test, the Playground lets you run it before it touches production. Open any version, fill in the template variables, pick a model, and run it. The response comes back in the same view — no side effects on live traces, no changes to what your agent is actually serving.
The typical workflow: write a new version, run it against a range of inputs until the outputs look right, then save it and move the production label over. What you test in the Playground is exactly what your agent will get when you promote it.
Datasets
Coming soon.
Datasets will connect your production traces to your evaluation workflow. Spans you vote on while debugging feed directly into a labelled dataset — closing the loop between what your agent does in production and what you evaluate it against, without a separate labelling process.