OpenLineage and Data Discovery
Scenario: The company has 4,000 tables across three warehouses and two lakes. A new analyst takes a week to find the “right” customers table because there are nine of them with similar names. A data engineer wastes an afternoon trying to figure out who owns a table before they refactor it. The CTO has heard “data catalog” and “data discovery” and wants to know what to actually buy or build.
In the interview, the question is:
What is a data catalog, what does OpenLineage have to do with it, and how would you stand one up for a 4,000-table organisation?
Your Task:
- Define the data catalog and the three things it must do (search, lineage, ownership).
- Explain OpenLineage as the open standard for emitting lineage events.
- Compare the main options (DataHub, OpenMetadata, Atlan, Collibra, homegrown).
- Walk through the rollout: pilot, expand, govern.
- Cover the failure mode where the catalog rots.
What a Good Answer Covers:
- Search-by-name and search-by-meaning (semantic search over columns and tags).
- Lineage at table and column level.
- Owner and freshness metadata as first-class.
- The open-source vs vendor decision in 2026.
- Why the catalog dies without a “register or it does not exist” policy.
Try the problem on your own first. Solutions are most valuable after you've struggled with it.
Solution 90: OpenLineage and Data Discovery
Short version you can say out loud
A data catalog is the searchable directory of every data asset in the organisation: tables, columns, dashboards, pipelines, and the relationships between them. It does three things: lets people find the right asset (search), lets them understand how it was built (lineage), and tells them who owns it (governance). OpenLineage is the open standard for emitting lineage events from pipelines, and the catalog consumes those events to draw the dependency graph. The main products that do this in 2026 are DataHub (open-source, ex-LinkedIn), OpenMetadata (open-source, fast-moving), Atlan (vendor, polished UX), Collibra (vendor, enterprise governance). The choice depends on team size, governance needs, and how much you want to operate. The harder problem than picking is keeping the catalog alive: it dies fast if registering an asset is optional or if metadata is hand-maintained.
The three jobs of a catalog
flowchart LR
USER([Analyst, engineer, PM]):::user --> CAT[("Data catalog")]:::cat
CAT --> S([Search:<br/>find the right asset]):::ok
CAT --> L([Lineage:<br/>see how it was built]):::ok
CAT --> O([Ownership:<br/>know who to ask]):::ok
classDef user fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef cat fill:#fed7aa,stroke:#c2410c,color:#7c2d12
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
Search. Not just by name. By tag (pii), by domain (finance), by column (is there a table with a column called gross_revenue), and ideally by meaning (semantic search over descriptions and column names). Modern catalogs use a vector index over descriptions so “tables about customer churn” finds the right one even when no column is called “churn.”
Lineage. Both table-level (“who reads this”) and column-level (“who reads this column”). See problem 82. The catalog draws the graph; OpenLineage feeds it the edges.
Ownership. Each asset has an owner team, a freshness SLA, and a contact. Without this the catalog is read-only; with it, the catalog becomes the place to start any conversation about data.
OpenLineage, the protocol
OpenLineage is a JSON spec for “a job ran” events. A job (Spark, dbt, Airflow task, Trino query) emits a START event when it begins, a COMPLETE or FAIL event when it ends. Each event names the inputs and outputs of the job, optionally with column-level lineage and run-level metadata.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
"eventType": "COMPLETE",
"eventTime": "2026-06-04T03:00:12Z",
"job": {"namespace": "dbt.analytics", "name": "fct_orders"},
"run": {"runId": "abc-123", "facets": {...}},
"inputs": [{"namespace": "snowflake.prod", "name": "stg.orders"}],
"outputs": [
{
"namespace": "snowflake.prod",
"name": "analytics.fct_orders",
"facets": {"columnLineage": {
"fields": {
"gross_revenue": {
"inputFields": [
{"namespace": "snowflake.prod", "name": "stg.orders", "field": "amount"},
{"namespace": "snowflake.prod", "name": "stg.fx", "field": "rate"}
]
}
}
}}
}
]
}
In 2026 the major producers all emit OpenLineage natively or with a thin wrapper:
- dbt via the
dbt-olwrapper. - Airflow via the OpenLineage provider.
- Spark via the OpenLineage listener.
- Flink, Trino, Dagster all native.
- Snowflake, BigQuery via query history connectors.
The catalog subscribes to the event stream and stitches the graph.
The 2026 catalog options
| Option | Type | Strengths | Watch out for |
|---|---|---|---|
| DataHub | Open source | Strong lineage, big community, plug-in friendly | Operational complexity, several services to run |
| OpenMetadata | Open source | Fast-moving, good UX, easier to operate | Newer, fewer integrations than DataHub |
| Atlan | Vendor | Polished UX, low setup, governance features | Cost, vendor lock |
| Collibra | Vendor (enterprise) | Strongest governance and audit story | Heavy, slow to roll out, costly |
| Castor / Secoda / others | Vendor | Lightweight, opinionated | Less depth on lineage |
| Homegrown | DIY | Bespoke to your needs | Maintenance burden, never reaches feature parity |
The decision tree most teams follow:
- Under 50 people, no compliance pressure: OpenMetadata.
- 50 to a few hundred, modest budget: Atlan or DataHub depending on UX taste and operating capacity.
- Heavy regulatory environment (finance, healthcare): Collibra or Atlan + Collibra.
- Public sector, security-first: DataHub self-hosted.
- “We want to build it ourselves”: only if your team has 3+ engineers to dedicate. Otherwise no.
Rolling it out without it dying
flowchart LR
P1[Pilot: 1 team,<br/>1 warehouse,<br/>auto-imported assets]:::p1
P2[Expand: 5 teams,<br/>OpenLineage from<br/>Airflow and dbt]:::p2
P3[Govern: ownership,<br/>SLAs, deprecation flow]:::p3
P1 --> P2 --> P3
classDef p1 fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
classDef p2 fill:#fef3c7,stroke:#a16207,color:#713f12
classDef p3 fill:#dcfce7,stroke:#15803d,color:#14532d
Phase 1, pilot. One team, one warehouse. Auto-import every table from the warehouse with its INFORMATION_SCHEMA metadata. No human curation. Goal: get the team to search the catalog and find tables. Measure usage. If usage is zero, the problem is the catalog itself or your search; fix that before adding more sources.
Phase 2, expand. Add more teams, more sources. Enable OpenLineage from dbt and Airflow so the lineage graph fills in. Add tags for PII, domain (finance, product, infra), and tier (1/2/3).
Phase 3, govern. Make registration mandatory for new tables. The dbt project requires an owner field on every model; the model fails build without it. Deprecation requires a flag in the catalog, a 60-day notice, and lineage shows zero downstream readers before drop.
The failure mode: it rots
Most catalogs go through this life cycle:
- Set up. Excitement. Everyone fills in descriptions.
- Six months later. Half the descriptions are out of date. Nobody trusts the freshness SLA. Search returns dead tables.
- Twelve months. Team uses Slack and tribal knowledge again. Catalog has zero users.
The catalogs that do not rot share three habits:
- Auto-extract everything that can be auto-extracted. Schema, lineage, owner from git CODEOWNERS, freshness from query history. Hand-maintained metadata rots; extracted metadata stays current.
- Make registration mandatory. A dbt model without an
ownertag fails CI. A new warehouse table without a registration entry triggers an alert. “Register or it does not exist.” - Deprecate the dead. A monthly job marks tables that have not been queried in 90 days as candidates for deprecation. The owner has 30 days to defend or drop.
Without these, the catalog drifts from reality and becomes worse than no catalog at all.
Where OpenLineage fits vs Marquez and DataHub
A common confusion: OpenLineage is the spec, not a product. Marquez is the reference implementation of an OpenLineage consumer (open source, originally LF AI & Data). DataHub and OpenMetadata also consume OpenLineage.
In practice:
- You emit OpenLineage from your jobs.
- You point the events at a consumer (Marquez, DataHub, OpenMetadata, Atlan).
- The consumer shows you the graph.
You can swap consumers later because the producers do not know which consumer is listening. That decoupling is the whole point of the spec.
Common mistakes interviewers want you to name
- Building a catalog without OpenLineage producers in place. The graph stays empty.
- Hand-maintained descriptions. They rot in a quarter.
- No ownership enforcement. Tables exist without owners, nobody can deprecate anything.
- Catalog as documentation system. Different problem. Catalog is for finding and depending; docs are for explaining.
- No deprecation flow. Dead tables accumulate forever and pollute search.
Bonus follow-up the interviewer might throw
“How do you handle the BI tool layer in the catalog?”
Same lineage cliff from problem 82. Three pragmatics:
- Looker and dbt Semantic Layer. Define LookML or Semantic Layer models, lineage falls out automatically. The catalog reads it.
- Tableau and Power BI. Use their REST APIs to scrape workbooks and parse their custom SQL. Imperfect; covers about 70% of dashboards.
- Notebooks. Hex and Marimo emit OpenLineage natively in 2026. Jupyter does not; you scrape execution logs or accept the gap.
The end-to-end catalog requires all three to be on board. A catalog that stops at the warehouse boundary still helps engineering teams and misses the BI side of the impact analysis. Plan for the gap; do not pretend it does not exist.