Explain failed jobs faster
The agent correlates recent failures, Spark symptoms, table growth, and cluster settings so teams see cause instead of another raw stack trace.
Spark reliability
Omnitrace watches Spark jobs, SQL workloads, clusters, and Delta table hygiene so reliability drift is caught before it becomes a week of firefighting.
Signals tracked
The agent correlates recent failures, Spark symptoms, table growth, and cluster settings so teams see cause instead of another raw stack trace.
Each finding carries estimated savings or reliability impact, owner context, and recommended next action.
Approved changes are applied through tool calls, read back, and verified so the final state is visible in the audit trail.
What changes
Spark issues rarely appear out of nowhere. Tables fragment, partitions skew, clusters drift, and failed jobs repeat. Omnitrace turns those weak signals into a ranked queue of fixes with owner-ready evidence.
Example investigation
> why did nightly_billing_etl slow down?
The job has a 42% longer median runtime this week. The largest stage is spill-heavy, skew increased on customer_id, and the source Delta table crossed the small-file threshold after the last ingest change.
Recommended: compact source table, enable AQE, tune shuffle partitions, and verify next run.
We will walk through how Omnitrace explains the failure, ranks the fix, and verifies the result against Databricks telemetry.