APIs or Kafka: Real-time Metadata Ingestion

Metadata is the foundation of modern analytics and AI. When schemas change, new tables appear, or ownership shifts, those changes ripple immediately through dashboards, models, and pipelines. If metadata lags behind reality, teams lose trust and systems break in subtle, hard-to-debug ways.

Real-time metadata ingestion keeps the system aligned with how data actually exists at any moment. Analysts see the right structures, governance stays enforceable, and AI systems get current context and meaning instead of stale snapshots. As data platforms grow more dynamic and automated, the architecture behind metadata becomes a first-order design decision.

What Architecture Is Right for You?

Architectural choices at large tech companies like Uber and LinkedIn reflect Conway's Law. Dedicated teams manage Kafka, ETL pipelines, data warehouses, and infrastructure. Adding dependencies or splitting services into microservices doesn't raise concerns because the organization has deep expertise for each component.

Most organizations don't operate this way. They lack the Kafka teams, pipeline teams, and SRE groups needed to operationalize complex architectures. Back in 2021, OpenMetadata founder Suresh Srinivas wrote about why this matters for metadata systems: metadata can be delivered as a service that assumes supporting teams exist, or as a product where one team owns everything end-to-end. OpenMetadata takes the product approach. When pipelines break or infrastructure fails, there's no separate team to escalate to. Architectural simplicity isn't optional here. It's how you ship something teams can actually run.

The API and Standards Advantage

Open Metadata Standards exist so that metadata can be shared and understood across the entire data ecosystem. OpenMetadata models all schemas using JSON Schema with strong typing and clear vocabulary. Entities and relationships are documented out of the box. Teams don't spend months on modeling before they can start using the system. For organizations that need to extend the model, the schemas have clear extension points.

APIs make these schemas usable. OpenMetadata provides APIs for every common operation: CRUD for entities and relationships, listing with pagination, event subscriptions through polling or webhooks, and search with both keyword and advanced query support. We designed these APIs for developers, not as an afterthought.

The dependencies are minimal and proven. Jetty, MySQL, or Postgres, and OpenSearch. Technology teams already know how to run and troubleshoot. No Kafka clusters to stand up. No coordination across multiple infrastructure teams just to get started.

What is OpenMetadata Standards?

OpenMetadata Standards is an open-source project that defines unified metadata schemas and semantic models for managing metadata across the data ecosystem. It includes:

700+ JSON Schemas covering entities, APIs, configurations, and events

RDF ontologies for linked metadata and knowledge graph construction

SHACL shapes for validation and compliance

JSON-LD contexts for semantic interoperability

What "Real-Time" Actually Means for Metadata

OpenMetadata uses REST APIs for push-based, real-time ingestion. When Stripe processes payments or Twilio handles communications, they use APIs, not message queues. We do the same.

Real-time metadata updates through REST APIs

When you call an API, you get an immediate response. The update happens, it's persisted, and it's available for consumption. No waiting for messages to flow through queues. No waiting for consumers to catch up. At enterprise scale, OpenMetadata delivers sub-millisecond reads and sub-millisecond writes. That's what real-time looks like for metadata.

Metadata updates require immediate consistency. When a schema changes or a new table appears, downstream systems need to know now, not eventually. Updates must be atomic, and sequence matters because the last update must win. APIs give you this naturally.

The Source Constraint Everyone Faces

There's an important technical reality that often gets overlooked in discussions about metadata architecture. Data warehouses like Snowflake, Databricks, and BigQuery don't publish metadata change events. There are no notifications when schemas change, no triggers when tables are created, and no event streams to subscribe to. This constraint exists at the source level, and no downstream architecture can change it.

This means every metadata platform, regardless of how it's built, must poll upstream systems to discover changes. Polling typically happens on a schedule ranging from every few minutes to every hour depending on the source and the organization's needs. The bottleneck isn't the transport layer or the processing pipeline. The bottleneck is that sources simply don't tell you when something changed.

So the real question is what happens after you poll. One approach pulls metadata from sources and pushes it directly to the metadata store through APIs. Two hops. Another approach adds Kafka in the middle: pull from sources, push to Kafka, pull from Kafka, push to the metadata store. Four hops. Both start with the same poll. The difference is what happens next.

Why Fewer Hops Means Faster Metadata

Kafka is excellent for high-volume event streaming. Millions of events per second, durable distributed storage, and late arrivals are fine because you're building an append-only log for eventual processing. Uber and LinkedIn use Kafka extensively because they operate at massive scale with dedicated platform teams. That makes sense for them.

But adding Kafka to a metadata pipeline doesn't fix the source constraint. If Snowflake doesn't publish change events, putting Kafka between your connector and your metadata store won't make metadata arrive faster. The data still enters on the same polling schedule. What changes is the path afterward: four hops with serialization overhead, queue management, and consumer coordination instead of two hops from source to store.

There's also a fit problem. Metadata requires transaction integrity. Updates must be atomic and ordered. Kafka handles ordering within a partition but not across partitions. To ensure a schema change at 11:29:30 overwrites the change from 11:29:00, you need strict partitioning strategies and deduplication logic. You end up rebuilding database transaction controls in the application layer. For metadata workloads measured in thousands of complex updates per day, not millions of simple events per second, this complexity buys you nothing. However, for teams that are already using Kafka, Collate can certainly support that as well!

When Kafka Makes Sense

Kafka is the right choice in specific scenarios.

If you operate at extreme scale with thousands of engineers and millions of events per second, you probably already have platform teams running Kafka for operational data. The expertise exists. The tooling exists. Adding metadata is an incremental cost, not new infrastructure. Your teams know Kafka's failure modes and have monitoring in place. Use what you have.

If regulatory requirements mandate immutable event logs for audit trails, Kafka's append-only architecture provides that guarantee. Proven durability and replication. If your compliance team needs a persistent record of every metadata change, Kafka works.

If you already run Kafka at scale with SRE teams and operational expertise, the marginal cost of adding metadata topics may be lower than standing up new API infrastructure. That's a reasonable calculation.

If none of that describes you, APIs get you there faster with less overhead.

Performance Validation

We benchmarked identical Redshift workloads to isolate architectural performance. Same database, same metadata volume. The API-based approach completed ingestion in 10 minutes. The Kafka-based approach took 3 to 6 hours.

The gap comes from ingestion logic, not transport speed. The API architecture uses smart polling with incremental ingestion. Track the last poll timestamp, retrieve only changes since then, reduce load on source systems, move less data through the pipeline. The Kafka-based approach accumulated overhead from connector serialization, queue management, and processing lag at each hop.

This holds up in production. Carrefour Brazil manages 2 million data assets, 133 petabytes of data, and 600 dashboards on the API architecture. Mango runs global retail operations the same way. No Kafka. These aren't unsophisticated organizations. They picked APIs because the requirements matched.

The Right Tool for The Job

Kafka is good technology, and we're not arguing otherwise. The question is whether it fits your situation.

Metadata has specific characteristics that differ from typical streaming workloads. Sources don't publish events, so every system polls. Transaction integrity matters because updates must be atomic and ordered. And the workloads run in thousands of complex updates per day, not millions of simple events per second. An API architecture handles these requirements directly with fewer hops, less infrastructure, and a faster path from poll to storage.

We built OpenMetadata this way from the beginning because we've seen what happens when you don't. Complex architectures that teams can't operate. Infrastructure dependencies that add latency without adding capability. Months spent on plumbing instead of solving actual metadata problems.

If you want real-time metadata, the answer isn't adding more infrastructure between you and your data. It's removing it.

If you’re interested in learning more, watch my “Ask the Experts” discussion with Senior Developer Advocate, Shawn Gordon, where I talk in more depth, including first principals, best practices, and data architecture diagrams.

Real-time Metadata Ingestion: APIs vs Kafka Explained