50+ Databricks waste and reliability detectors platform teams should track

A practical map of the detector categories Omnitrace uses to find Databricks cost waste, Spark reliability drift, and remediation opportunities.

The Omnitrace team - June 2026 - 8 min read

Databricks optimization is usually discussed as a handful of obvious fixes: terminate idle clusters, enable auto-stop, tune a warehouse, and maybe run table maintenance. Those are useful, but they are not enough for a production lakehouse.

The real opportunity is broader. Waste and reliability drift show up across clusters, SQL warehouses, Spark jobs, Delta tables, storage patterns, tagging, ownership, and cloud infrastructure. A detector library needs enough coverage to find the small leaks that add up.

Cluster and compute waste

This is the first category most teams understand because it is easy to see and easy to fix. The agent should track idle clusters, missing auto-termination, oversized workers, expensive instance families, policy drift, and clusters running outside approved modes.

These findings are strong candidates for low-risk automation because the target state is concrete and verification is straightforward.

SQL warehouse spend

SQL warehouses deserve their own detector group. A warehouse can be correctly sized for one workload and wasteful for another. Useful signals include missing auto-stop, long-running idle windows, bursty concurrency, excessive query retries, slow query patterns, and Photon eligibility.

The best detector is not just "this warehouse is expensive." It is "this warehouse is expensive because this pattern repeated, here is the owner, and here is the specific setting to review."

Spark reliability drift

Reliability detectors should catch the symptoms that usually become tickets later: executor OOM, GC thrash, spill-heavy stages, skew, shuffle partition drift, failed retries, and job runtime regression.

These are especially valuable when paired with cost context. A slow job is annoying. A slow job that wastes $4,000 per year and fails every Monday is a prioritized platform item.

Delta table hygiene

Table health quietly affects both cost and performance. Detector coverage should include small-file growth, stale optimization, missing or poor partitioning, high scan amplification, abandoned table candidates, and storage growth that no longer maps to active workloads.

Table remediation often needs more careful guardrails than cluster cleanup, but the agent can still create the evidence package: table history, file counts, workload impact, recommended maintenance, and verification criteria.

Ownership and governance gaps

A finding without an owner becomes shelfware. Production detector systems should look for untagged compute, missing workspace ownership, orphaned jobs, teams without cost attribution, and policy exceptions that no longer have an active reason.

This is where Jira workflow matters. A detector can create operational value even when it does not auto-apply a technical fix, because it routes the evidence to the right team.

Why 50+ detectors matters

The number matters less than the coverage model. A lakehouse is a connected system, and the best savings often come from correlations across layers: a table pattern that slows a job, a job pattern that inflates a cluster, a cluster pattern that creates avoidable cloud spend.

Omnitrace currently covers 50+ detector types and 19 auto-fix paths because the goal is not a prettier dashboard. The goal is a steady stream of ranked, dollar-aware, action-ready recommendations that can move through guardrails and verification.