AI agent observability tools compared: the audit trail for a misbehaving agent

AI agent observability tools record traces of every model call, prompt, tool use, and cost, giving you the audit trail needed to detect a compromised or misbehaving agent. Langfuse, Helicone, and Arize Phoenix are open source and self-hostable; LangSmith is hosted with a self-host tier; OpenLLMetry is pure OpenTelemetry instrumentation that ships to any backend.

By Sunny Patel Updated 21 June 2026

Independent SEO consultant & AI practitioner who builds and tests these tools.

AI agent observability tools compared: the audit trail for a misbehaving agent

AI agent observability tools record traces of every model call, prompt, tool use, and cost, giving you the audit trail needed to detect a compromised or misbehaving agent. Langfuse, Helicone, and Arize Phoenix are open source and self-hostable; LangSmith is hosted with a self-host tier; OpenLLMetry is pure OpenTelemetry instrumentation that ships to any backend. This comparison is based on each tool’s public documentation, not benchmarks we ran.

TL;DR:

Langfuse (repo): MIT licensed open-source LLM engineering platform; tracing, prompt management, evals, cost tracking; OpenTelemetry integration; self-host or Langfuse Cloud.
Helicone (repo): Apache 2.0, proxy and AI gateway based; traces, sessions, cost, latency, prompt versioning; self-host via Docker or Helm.
Arize Phoenix (repo): Elastic License 2.0, built on OpenTelemetry and OpenInference; tracing, evaluation, prompt management; runs locally or in containers.
LangSmith (docs): hosted platform with cloud, hybrid, and self-hosted tiers; traces, prompts, costs, evals; works beyond LangChain.
OpenLLMetry (repo): Apache 2.0 OpenTelemetry extensions that instrument LLMs and export to any OTel backend.
The security point: a trace is an audit log. Pair these with least-privilege for AI agents.

Why does observability matter for agent security?

An autonomous agent reads untrusted input, makes model calls, and invokes tools that touch real systems. If it is hijacked by prompt injection or simply misbehaves, you need to know what it did and why. Observability gives you that audit trail. A trace records the chain: the prompt that came in, the model’s reasoning, the tool calls it made, and the cost and latency of each step.

That recorded history is what turns an opaque incident into a forensic one. When an agent suddenly calls a tool it should not, exfiltrates data, or burns through tokens, the trace shows the exact prompt that triggered it and the sequence that followed. Without traces you are guessing. With them you can detect the deviation, see the blast radius, and feed evidence into a response. This is the detection layer behind detect a compromised AI agent.

What does each observability tool do?

The table below summarises each tool from its public documentation and stated features, not from benchmarks we measured. Licences and capabilities change, so confirm on each project’s own pages before you commit.

Tool	Open source / licence	OpenTelemetry	Self-host	Hosted	What it captures (per its docs)
Langfuse	Open source, MIT (enterprise folders excluded)	Listed as an integration	Yes: Docker Compose, VMs, Kubernetes/Helm	Yes: Langfuse Cloud, free tier	Traces of LLM calls, retrieval, embeddings, agent actions; prompt management with versioning; evaluations; cost tracking
Helicone	Open source, Apache 2.0	Async logging via OpenLLMetry	Yes: Docker Compose or Helm	Yes: helicone.ai, free tier	Traces and sessions for agents and chatbots; cost, latency, quality; prompt versioning from production data; eval integrations
Arize Phoenix	Open source, Elastic License 2.0	Built on OpenTelemetry, uses OpenInference	Yes: local, Jupyter, containers	Yes: cloud at app.phoenix.arize.com	OTel-based traces of app runtime; LLM-driven evaluation; prompt management; versioned datasets and experiments
LangSmith	Hosted platform (not stated as open source)	Referenced conceptually, not a stated export path	Yes: self-hosted tier	Yes: cloud and hybrid	Full traces from individual spans to production metrics; prompts and outputs; cost via trace pricing; evaluations
OpenLLMetry	Open source, Apache 2.0	Yes: extensions on top of OpenTelemetry	N/A (instrumentation library)	N/A (exports to your backend)	Instruments LLM providers, vector DBs, and frameworks; exports traces to any OTel backend (Datadog, Grafana, and others)

Which are open source and self-hostable?

Three are both. Per their repositories, Langfuse is MIT licensed, Helicone is Apache 2.0, and Phoenix is under the Elastic License 2.0, and all three can run inside your own infrastructure: Langfuse via Docker Compose, VMs, or Kubernetes; Helicone via Docker Compose or Helm; Phoenix locally, in a notebook, or in containers. For a security team, self-hosting matters because your traces, which contain prompts and possibly sensitive data, never leave your perimeter. Check the Elastic License 2.0 terms against your own use before you embed Phoenix in a product.

Which support OpenTelemetry?

OpenTelemetry support is the difference between a closed dashboard and an audit trail you control. Per its documentation, Arize Phoenix is built on OpenTelemetry, Langfuse lists OpenTelemetry as an integration, and OpenLLMetry is itself a set of OpenTelemetry extensions that exports to any compatible backend such as Datadog or Grafana. Helicone documents async logging via OpenLLMetry. LangSmith’s documentation references OpenTelemetry conceptually rather than as a stated export path, so confirm current support on its own pages.

How do proxy and instrumentation models differ?

The tools capture traces in different ways. Helicone is proxy and gateway based: per its docs you change the baseURL in your code so requests route through Helicone, which then logs them. OpenLLMetry instruments your code directly through OpenTelemetry, capturing spans in-process and shipping them onward. Langfuse and Phoenix support SDK-based instrumentation, and Phoenix layers OpenInference on top. The proxy model is quick to wire up; in-process instrumentation gives finer-grained spans without routing traffic through a third party.

How do you choose an observability tool?

Start from your constraint, not the brand.

If you want open source, self-hosted, and full features in one place: Langfuse covers tracing, prompts, evals, and cost under MIT.
If you want the fastest wiring via a proxy gateway: Helicone lets you change a base URL and start logging.
If you are standardised on OpenTelemetry and want eval tooling: Arize Phoenix is built on OTel and runs anywhere.
If you are deep in the LangChain ecosystem and want a managed platform: LangSmith offers cloud, hybrid, and self-hosted tiers, and works beyond LangChain.
If you already have an observability backend and just need LLM spans: OpenLLMetry instruments and exports to it.

Whatever you pick, remember that observability is detection, not prevention. A trace tells you an agent misbehaved; it does not stop it. Pair it with scoped permissions so a hijacked agent cannot do much in the first place, and read audit your AI agent setup to find the gaps before an attacker does.

Where to go next

Treat your observability tool as the logging layer of a wider defence. Scope what an agent can touch with least-privilege for AI agents, learn the signals of a hijack in detect a compromised AI agent, and find the holes in your wiring with audit your AI agent setup. Browse more write-ups in the tools directory and the guides library. The comparison above reflects each tool’s public documentation and stated features, not benchmarks we measured; always confirm current capabilities, licences, and pricing on the vendor’s own pages before you commit.

Frequently asked questions

What is AI agent observability?

Observability means recording a detailed trace of what an agent did: every model call, prompt, tool invocation, latency, and cost. Per these tools' documentation, that trace is what lets you reconstruct an agent's behaviour after the fact and spot anomalies.

How does observability help security?

A trace is an audit log. If an agent is hijacked by prompt injection or starts calling tools it should not, the recorded traces let you detect the deviation, see which prompt triggered it, and scope the blast radius. Without traces you are blind.

Which AI observability tools are open source?

Per their repositories, Langfuse is MIT licensed, Helicone is Apache 2.0, and Arize Phoenix is under the Elastic License 2.0. OpenLLMetry is Apache 2.0 instrumentation. LangSmith is a hosted platform offering cloud, hybrid, and self-hosted deployment.

Do these tools support OpenTelemetry?

Several do. Per their docs, Langfuse lists OpenTelemetry as an integration, Arize Phoenix is built on OpenTelemetry, and OpenLLMetry is a set of extensions on top of OpenTelemetry that exports to any compatible backend.

Can I self-host an observability tool?

Yes for most. Per their documentation, Langfuse self-hosts via Docker Compose, VMs, or Kubernetes; Helicone self-hosts via Docker Compose or Helm; Phoenix runs locally or in containers. LangSmith offers a self-hosted deployment tier alongside cloud.

AI agent observability tools compared: the audit trail for a misbehaving agent

Why does observability matter for agent security?

What does each observability tool do?

Which are open source and self-hostable?

Which support OpenTelemetry?

How do proxy and instrumentation models differ?

How do you choose an observability tool?

Where to go next

Frequently asked questions

Related reading