Omnitrace
All posts

How to use AI for Databricks optimization without exposing customer data

Why metadata and telemetry are enough for Databricks FinOps and reliability agents, and how to keep AI reasoning inside enterprise boundaries.

The Omnitrace team - - 8 min read

The security question around AI agents is fair: what data does the agent see, where does it send that data, and what can it change?

For Databricks operations, the answer should not require access to customer table contents, files, business records, or query result sets. Most FinOps and reliability work can be done from operational metadata and telemetry: configuration, billing usage, query history where available, job behavior, warehouse settings, cluster changes, ownership, workflow state, and verification signals.

Optimization is usually a metadata problem

A cost agent does not need to read customer rows to know that a warehouse lacks auto-stop, a cluster runs idle, a job retries repeatedly, or a table has too many small files. A reliability agent does not need business records to detect executor OOM, spill-heavy stages, skew symptoms, runtime drift, or repeated query failures.

The important signals are operational:

  • Resource configuration
  • Usage and billing metadata
  • Query and job history
  • Compute lifecycle events
  • Performance symptoms
  • Tags, owners, service principals, and workflow context
  • Post-action verification state

That is enough to reason about waste, reliability drift, owner routing, and safe remediation boundaries.

The model boundary matters

Even when the agent only uses metadata, enterprises still need a clear model-hosting boundary. Some teams are comfortable with managed AI services. Others require model endpoints hosted by their cloud provider so prompt context and operational metadata stay inside a selected cloud boundary.

That distinction should be a first-class product decision, not an afterthought. Teams should know which model endpoint is used, what metadata is sent, what is excluded, how credentials are scoped, and how actions are approved.

Databricks system table data deserves care

Databricks notes that system table information can expose sensitive operational information if mishandled. That is another reason metadata-only design should still be treated as enterprise architecture. Operational metadata can reveal names, usage patterns, identities, and infrastructure behavior. It should be governed, scoped, and protected.

The right question is not only, "Does the agent avoid customer data?" It is also, "Does the agent minimize metadata exposure and keep the reasoning path inside the right boundary?"

Autonomy should be layered

AI optimization also needs action controls. Reading metadata is different from changing production configuration. A mature system separates autonomy levels:

  • Manual: the agent explains the finding and recommends the action.
  • Approval required: the agent prepares a governed action for review.
  • Low-risk automatic: the agent applies bounded fixes inside policy.

Every action should record the strategy, policy, tool response, and verification result.

How Omnitrace frames the boundary

Omnitrace is designed around metadata-only operation. It works from Databricks and cloud operational metadata, not customer table rows, business records, files, query result sets, or application payloads. For enterprise deployments, AI reasoning can use model endpoints hosted by the customer's cloud provider.

That architecture supports the real job: detect waste, explain impact, route ownership, apply governed fixes, and verify outcomes without expanding the customer data boundary.

The best AI agents for lakehouse operations will not be the ones that see the most data. They will be the ones that use the right metadata, inside the right boundary, with the right guardrails.

Sources worth reading

Ready to put the agent to work?

Connect operational metadata, prioritize verified savings, and move approved Databricks fixes through the agent loop.