Omnitrace

Blog

Notes from the agent loop.

Engineering posts, FinOps deep dives, and lessons from running autonomous remediation in production.

7 min read

Databricks cost optimization is not a dashboard problem

Why Databricks FinOps teams need ownership, workflow, and verified action after cost visibility is in place.

Read post
8 min read

What Databricks system tables tell you about cost, and what they still do not

System tables are the foundation for Databricks observability, but cost optimization still needs context, ownership, and action.

Read post
7 min read

How to turn Databricks billing usage into verified savings

A practical workflow for moving from Databricks usage evidence to approved fixes and read-back verification.

Read post
7 min read

Databricks query history is the missing link between performance and cost

How query history helps connect slow SQL workloads, warehouse spend, ownership, and optimization opportunities.

Read post
8 min read

How to use AI for Databricks optimization without exposing customer data

Why metadata and telemetry are enough for Databricks FinOps and reliability agents, and how to keep AI reasoning inside enterprise boundaries.

Read post
8 min read

50+ Databricks waste and reliability detectors platform teams should track

A practical map of detector categories across Databricks cost waste, Spark reliability drift, Delta table hygiene, ownership, and remediation opportunities.

Read post
7 min read

From alerts to autonomous remediation for the lakehouse

A practical playbook for moving Databricks operations from dashboard alerts to governed, verified action.

Read post
7 min read

An agent loop in production: lessons from the first hundred actions

Designing for verification, blast-radius caps, and human-friendly audit trails. What we got wrong, what we'd ship differently, and what we're still figuring out.

Read post
6 min read

What system tables don't tell you about Databricks cost

DBU billing is one piece of the puzzle. Real cost lives in EC2, EBS, NAT gateways, and tag chaos. Here's how we reconcile across three tiers.

Read post
5 min read

Why we built an autonomous agent for the lakehouse

Observability told us about problems for a decade. The next decade is about closing the loop. Here's how we think about it.

Read post