Observability Needs an Architecture Reset

AI agents are rewriting how observability data gets consumed. The architecture that worked for human-scale dashboards can’t keep up. It’s time to start over.

Before: monolithic clusters where ingest, query, and storage compete for resources. After: elastic architecture with separated layers that scale independently.

OLAP Modernized. Observability Didn't.

Every other corner of data infrastructure moved on a generation ago. Observability is the last holdout.

Analytics / OLAP Evolved To
Columnar storage
Object storage (S3) as the data layer
Separated storage and compute
Elastic compute
Observability Is Still
Disk-heavy, SSD-bound
Cluster-bound (ingest/query/storage coupled)
Operationally complex
Fragmented across signals

Telemetry volumes are compounding. AI systems are generating and consuming more observability data than ever. Meanwhile, data warehouses separated compute from storage years ago. OLAP engines run on object storage. Lakehouses decoupled ingest from query a generation ago.

Observability is still coupling everything into monolithic clusters backed by local SSDs, replicated three ways for durability, provisioned for peak load around the clock. The result: overprovisioned clusters, query timeouts under load, painful scaling, a constant ops burden, and an ever-growing bill that forces you to downsample or drop data just to stay within budget.

Agents Change Everything

AI agents are going to consume observability data at a fundamentally different scale than humans do. The architecture has to be ready.

Human Operators
LatencySeconds acceptable
QueriesDozens per incident
Data processedSampled, downsampled
Data storedCompromised by cost
Resolution1min+ aggregations
AI Agents
LatencyMilliseconds expected
QueriesThousands per incident
Data processedFull resolution, all of it
Data storedEverything, always
ResolutionRaw, sub-second

A human opens a dashboard, scans a few charts, maybe runs a handful of ad-hoc queries. An agent investigating the same alert will fire hundreds or thousands of queries — exploring dimensions, correlating signals across metrics, logs, and traces, testing hypotheses in parallel.

Debugging is now a machine-scale problem. Disk-based architectures physically cannot serve this workload without massive over-provisioning. You'd need 10–100x the cluster capacity to handle agent-scale query volumes, and you'd need it sitting idle the rest of the time. Elastic compute is the only architecture where this kind of bursty workload is economically viable.

What If You Started From Scratch?

If you were designing an observability backend today, with no legacy baggage, what would it look like?

Object Storage for Everything

Infinite capacity, 11 nines of durability, a fraction of the cost of SSDs. No replication needed — S3 handles it.

Elastic Compute

Compute scales out instantly for heavy dashboards or incident investigations, and scales to zero when nobody’s querying. No fixed clusters. No idle tax.

No Over-Provisioning

Pay for the work you actually do, not for capacity sitting idle waiting for a spike that may never come.

Zero Ops

No clusters to manage. No disks to monitor. No rebalancing, no capacity planning, no 3 am pages about the observability system itself.

High Resolution by Default

No metrics downsampling. No trace sampling. Full log retention. Storage is cheap enough that there’s no reason to throw any signal away.

Open Interfaces

Standard query and ingest protocols. Your existing dashboards, alerts, and instrumentation should just work.

Elastic Architecture from First Principles

Oodle isn’t a lift-and-shift of an on-prem database into the cloud. Every layer was designed for object storage and elastic compute from day one.

Oodle Architecture Overview

Custom Columnar Format — 600x Compression

A purpose-designed storage format for observability data. 600x compression means each S3 GET returns 600x more useful data per byte transferred — the key insight that makes sub-second queries on object storage possible.

Separated Compute & Storage — Necessary, Not Sufficient

Ingest, storage, and query are fully decoupled. But separation alone is not enough — if your query layer is still a fixed-size cluster, a sudden heavy query can still crash the system. To truly scale, you need elastic compute.

Elastic Query Engine

Query compute spins up on demand, fans out in parallel, and releases resources when done. A sudden heavy query gets its own compute automatically — no cluster to crash, no capacity ceiling. Customers report 2–10x faster queries.

Intelligent Caching

Hot data and frequently-accessed results are cached in memory, warming automatically from usage patterns. Dashboard refreshes and alert evaluations hit cache directly — single-digit millisecond latency, zero compute invocations.

No Global Index

No massive inverted indexes consuming RAM. Lightweight, purpose-built metadata optimized for observability query patterns. Any tag, any cardinality — no performance cliff.

See the architecture in action

New Capabilities Unlocked

High Resolution Data

Full fidelity across every signal. With 600x compression on S3, there is no architectural reason to throw data away. High resolution isn’t a premium feature — it’s the baseline.

  • No metrics downsampling. A 30-second CPU spike that caused a cascade of timeouts won’t disappear into a 5-minute average that looks perfectly normal.
  • No trace sampling. Every trace is captured. An agent investigating an incident can follow the exact request that failed — not a statistical sample of requests that didn’t.
  • Full log retention. No dropping logs to control costs. The context you need for root cause analysis is always there when you need it.

Humans couldn’t consume all this data. Agents can.

AI Agents Can Query Freely

Dashboards load instantly. Range queries over weeks of high-cardinality data come back in milliseconds. An AI agent investigating an incident can fire hundreds of exploratory queries without crashing the system. Each query gets its own isolated compute, so a heavy query never impacts anything else running at the same time. Traditional systems force you to rate-limit agents or risk taking down dashboards for everyone. Elastic compute makes that trade-off disappear.

Ingest and Query Are Fully Isolated

A traffic spike that doubles your ingest volume has zero impact on query performance. A heavy dashboard refresh doesn’t slow down data ingestion. In traditional systems, ingest and query compete for the same CPU, memory, and disk I/O — so a surge in one degrades the other. With fully separated paths, each scales independently without interference.

Cost Efficient by Design

No idle compute running 24/7. No 3x SSD replication for durability — S3 handles that natively. Object storage pricing for your data. You pay for the queries you actually run, not for capacity sitting idle. No more choosing between sampling traces or blowing your budget — the architecture is cheap enough that you never have to trade visibility for cost.

Long Retention

Keep months or years of full-resolution data without breaking the bank. AI agents investigating an incident can compare current behavior against historical baselines from weeks or months ago — catching slow-burn regressions and seasonal patterns that short retention windows would miss entirely.

Zero Ops

No clusters to manage. No capacity planning. No rebalancing. No disks to monitor. No 3 am pages about the observability system itself. Infrastructure that manages itself so your team works on the product, not the plumbing.

Open Standards. No Lock-In.

Proprietary query languages and closed formats create switching costs by design. Open standards remove them.

Standard Query Protocols

PromQL for metrics. LogQL for logs. No proprietary query language to learn, no vendor-specific syntax to migrate away from.

Standard Ingest Protocols

OpenTelemetry (OTLP), Prometheus remote write, and common log formats. Your existing instrumentation just works.

Your Dashboards Work

Grafana dashboards, existing alerts, and recording rules carry over without changes. No rewrite required.

Easy Way In, Easy Way Out

No proprietary agent format. No data held hostage. If you ever want to leave, your data and queries are already in standard formats.

Own Your Data

Flexible deployment models to match your security, compliance, and data residency requirements. Your observability data is yours.

SaaS

Easiest

Fully managed by Oodle. Zero infrastructure to run. Zero ops. Start ingesting in minutes. We handle everything — storage, compute, upgrades, scaling.

Best for teams that want to focus entirely on their product.

Bring Your Own Bucket

BYO-B

Oodle runs as a managed service, but all your observability data is stored in your own S3 bucket. You always have full access to your raw data — even if you stop using Oodle.

Best for teams that need data ownership with zero ops.

Bring Your Own Cloud

BYO-C

Oodle runs entirely within your AWS account. Your data never leaves your VPC. Full control over networking, encryption, and access policies. Meets the strictest compliance requirements.

Best for regulated industries and strict data residency needs.

From Dashboards to Conversations

The interface for observability is changing. The architecture has to change with it.

CursorCursor proved that a sidebar conversation can replace complex IDE workflows. The same shift is happening in observability. Instead of building dashboards and writing queries, you describe what you're looking for and the system investigates.

Oodle's AI assistant works as a Cursor-like sidebar, inside Slack, and as an embedded experience in your existing tools. Ask it about an alert. Ask follow-up questions. It navigates across your metrics, logs, and traces to surface what matters.

This only works if the backend can handle the query patterns that conversational debugging produces — bursty, exploratory, high-volume, often touching data that hasn't been queried recently. Dashboard-era architectures were never built for this. Elastic compute was.

Complete Observability at 1/5th the Cost

Go live in 15 minutes. No clusters to manage. No vendor lock-in.

5x
Lower cost
< 3s
p99 query latency
15 min
Time to go live
0
Ops overhead