AI agent observability tools compared: the audit trail for a misbehaving agent
AI agent observability tools record traces of every model call, prompt, tool use, and cost, giving you the audit trail needed to detect a compromised or misbehaving agent. Langfuse, Helicone, and Arize Phoenix are open source and self-hostable; LangSmith is hosted with a self-host tier; OpenLLMetry is pure OpenTelemetry instrumentation that ships to any backend.
Independent SEO consultant & AI practitioner who builds and tests these tools.
AI agent observability tools compared: the audit trail for a misbehaving agent
AI agent observability tools record traces of every model call, prompt, tool use, and cost, giving you the audit trail needed to detect a compromised or misbehaving agent. Langfuse, Helicone, and Arize Phoenix are open source and self-hostable; LangSmith is hosted with a self-host tier; OpenLLMetry is pure OpenTelemetry instrumentation that ships to any backend. This comparison is based on each tool’s public documentation, not benchmarks we ran.
TL;DR:
- Langfuse (repo): MIT licensed open-source LLM engineering platform; tracing, prompt management, evals, cost tracking; OpenTelemetry integration; self-host or Langfuse Cloud.
- Helicone (repo): Apache 2.0, proxy and AI gateway based; traces, sessions, cost, latency, prompt versioning; self-host via Docker or Helm.
- Arize Phoenix (repo): Elastic License 2.0, built on OpenTelemetry and OpenInference; tracing, evaluation, prompt management; runs locally or in containers.
- LangSmith (docs): hosted platform with cloud, hybrid, and self-hosted tiers; traces, prompts, costs, evals; works beyond LangChain.
- OpenLLMetry (repo): Apache 2.0 OpenTelemetry extensions that instrument LLMs and export to any OTel backend.
- The security point: a trace is an audit log. Pair these with least-privilege for AI agents.
Why does observability matter for agent security?
An autonomous agent reads untrusted input, makes model calls, and invokes tools that touch real systems. If it is hijacked by prompt injection or simply misbehaves, you need to know what it did and why. Observability gives you that audit trail. A trace records the chain: the prompt that came in, the model’s reasoning, the tool calls it made, and the cost and latency of each step.
That recorded history is what turns an opaque incident into a forensic one. When an agent suddenly calls a tool it should not, exfiltrates data, or burns through tokens, the trace shows the exact prompt that triggered it and the sequence that followed. Without traces you are guessing. With them you can detect the deviation, see the blast radius, and feed evidence into a response. This is the detection layer behind detect a compromised AI agent.
What does each observability tool do?
The table below summarises each tool from its public documentation and stated features, not from benchmarks we measured. Licences and capabilities change, so confirm on each project’s own pages before you commit.
| Tool | Open source / licence | OpenTelemetry | Self-host | Hosted | What it captures (per its docs) |
|---|---|---|---|---|---|
| Langfuse | Open source, MIT (enterprise folders excluded) | Listed as an integration | Yes: Docker Compose, VMs, Kubernetes/Helm | Yes: Langfuse Cloud, free tier | Traces of LLM calls, retrieval, embeddings, agent actions; prompt management with versioning; evaluations; cost tracking |
| Helicone | Open source, Apache 2.0 | Async logging via OpenLLMetry | Yes: Docker Compose or Helm | Yes: helicone.ai, free tier | Traces and sessions for agents and chatbots; cost, latency, quality; prompt versioning from production data; eval integrations |
| Arize Phoenix | Open source, Elastic License 2.0 | Built on OpenTelemetry, uses OpenInference | Yes: local, Jupyter, containers | Yes: cloud at app.phoenix.arize.com | OTel-based traces of app runtime; LLM-driven evaluation; prompt management; versioned datasets and experiments |
| LangSmith | Hosted platform (not stated as open source) | Referenced conceptually, not a stated export path | Yes: self-hosted tier | Yes: cloud and hybrid | Full traces from individual spans to production metrics; prompts and outputs; cost via trace pricing; evaluations |
| OpenLLMetry | Open source, Apache 2.0 | Yes: extensions on top of OpenTelemetry | N/A (instrumentation library) | N/A (exports to your backend) | Instruments LLM providers, vector DBs, and frameworks; exports traces to any OTel backend (Datadog, Grafana, and others) |
Which are open source and self-hostable?
Three are both. Per their repositories, Langfuse is MIT licensed, Helicone is Apache 2.0, and Phoenix is under the Elastic License 2.0, and all three can run inside your own infrastructure: Langfuse via Docker Compose, VMs, or Kubernetes; Helicone via Docker Compose or Helm; Phoenix locally, in a notebook, or in containers. For a security team, self-hosting matters because your traces, which contain prompts and possibly sensitive data, never leave your perimeter. Check the Elastic License 2.0 terms against your own use before you embed Phoenix in a product.
Which support OpenTelemetry?
OpenTelemetry support is the difference between a closed dashboard and an audit trail you control. Per its documentation, Arize Phoenix is built on OpenTelemetry, Langfuse lists OpenTelemetry as an integration, and OpenLLMetry is itself a set of OpenTelemetry extensions that exports to any compatible backend such as Datadog or Grafana. Helicone documents async logging via OpenLLMetry. LangSmith’s documentation references OpenTelemetry conceptually rather than as a stated export path, so confirm current support on its own pages.
How do proxy and instrumentation models differ?
The tools capture traces in different ways. Helicone is proxy and gateway based: per its docs you change the baseURL in your code so requests route through Helicone, which then logs them. OpenLLMetry instruments your code directly through OpenTelemetry, capturing spans in-process and shipping them onward. Langfuse and Phoenix support SDK-based instrumentation, and Phoenix layers OpenInference on top. The proxy model is quick to wire up; in-process instrumentation gives finer-grained spans without routing traffic through a third party.
How do you choose an observability tool?
Start from your constraint, not the brand.
- If you want open source, self-hosted, and full features in one place: Langfuse covers tracing, prompts, evals, and cost under MIT.
- If you want the fastest wiring via a proxy gateway: Helicone lets you change a base URL and start logging.
- If you are standardised on OpenTelemetry and want eval tooling: Arize Phoenix is built on OTel and runs anywhere.
- If you are deep in the LangChain ecosystem and want a managed platform: LangSmith offers cloud, hybrid, and self-hosted tiers, and works beyond LangChain.
- If you already have an observability backend and just need LLM spans: OpenLLMetry instruments and exports to it.
Whatever you pick, remember that observability is detection, not prevention. A trace tells you an agent misbehaved; it does not stop it. Pair it with scoped permissions so a hijacked agent cannot do much in the first place, and read audit your AI agent setup to find the gaps before an attacker does.
Where to go next
Treat your observability tool as the logging layer of a wider defence. Scope what an agent can touch with least-privilege for AI agents, learn the signals of a hijack in detect a compromised AI agent, and find the holes in your wiring with audit your AI agent setup. Browse more write-ups in the tools directory and the guides library. The comparison above reflects each tool’s public documentation and stated features, not benchmarks we measured; always confirm current capabilities, licences, and pricing on the vendor’s own pages before you commit.
Frequently asked questions
What is AI agent observability?
Observability means recording a detailed trace of what an agent did: every model call, prompt, tool invocation, latency, and cost. Per these tools' documentation, that trace is what lets you reconstruct an agent's behaviour after the fact and spot anomalies.
How does observability help security?
A trace is an audit log. If an agent is hijacked by prompt injection or starts calling tools it should not, the recorded traces let you detect the deviation, see which prompt triggered it, and scope the blast radius. Without traces you are blind.
Which AI observability tools are open source?
Per their repositories, Langfuse is MIT licensed, Helicone is Apache 2.0, and Arize Phoenix is under the Elastic License 2.0. OpenLLMetry is Apache 2.0 instrumentation. LangSmith is a hosted platform offering cloud, hybrid, and self-hosted deployment.
Do these tools support OpenTelemetry?
Several do. Per their docs, Langfuse lists OpenTelemetry as an integration, Arize Phoenix is built on OpenTelemetry, and OpenLLMetry is a set of extensions on top of OpenTelemetry that exports to any compatible backend.
Can I self-host an observability tool?
Yes for most. Per their documentation, Langfuse self-hosts via Docker Compose, VMs, or Kubernetes; Helicone self-hosts via Docker Compose or Helm; Phoenix runs locally or in containers. LangSmith offers a self-hosted deployment tier alongside cloud.