Omnitrace

Spark reliability

Find the Spark problems hiding behind Databricks incidents.

Omnitrace watches Spark jobs, SQL workloads, clusters, and Delta table hygiene so reliability drift is caught before it becomes a week of firefighting.

Signals tracked

Executor OOM GC thrash Spill-heavy stages Data skew Shuffle partition drift Small-file amplification Slow SQL patterns Cluster right-sizing gaps

Explain failed jobs faster

The agent correlates recent failures, Spark symptoms, table growth, and cluster settings so teams see cause instead of another raw stack trace.

Prioritize by operational impact

Each finding carries estimated savings or reliability impact, owner context, and recommended next action.

Fix with evidence

Approved changes are applied through tool calls, read back, and verified so the final state is visible in the audit trail.

What changes

Reliability becomes continuous maintenance, not an incident ritual.

Spark issues rarely appear out of nowhere. Tables fragment, partitions skew, clusters drift, and failed jobs repeat. Omnitrace turns those weak signals into a ranked queue of fixes with owner-ready evidence.

Example investigation

> why did nightly_billing_etl slow down?

The job has a 42% longer median runtime this week. The largest stage is spill-heavy, skew increased on customer_id, and the source Delta table crossed the small-file threshold after the last ingest change.

Recommended: compact source table, enable AQE, tune shuffle partitions, and verify next run.

Bring us one slow Spark job.

We will walk through how Omnitrace explains the failure, ranks the fix, and verifies the result against Databricks telemetry.