Redpanda Schema Registry

In Redpanda, the messages exchanged between producers and consumers contain raw bytes. Schemas enable producers and consumers to share the information needed to serialize and deserialize those messages. They register and retrieve the schemas they use in the Schema Registry to ensure data verification.

Schemas are versioned, and the registry supports configurable compatibility modes between schema versions. When a producer or a consumer requests to register a schema change, the registry checks for schema compatibility and returns an error for an incompatible change. Compatibility modes can ensure that data flowing through a system is well-structured and easily evolves.

Schema size best practice: Schema Registry works best with schemas of 128KB in size or less. Large schemas can consume significant memory resources and may cause system instability or crashes, particularly in memory-constrained environments. For Protobuf and Avro schemas, Redpanda recommends using schema references to break up large schemas into smaller constituent parts.

The Schema Registry is built directly into the Redpanda binary. It runs out of the box with Redpanda’s default configuration, and it requires no new binaries to install and no new services to deploy or maintain. You can use it with the Schema Registry API or Redpanda Console.

Schema terminology

Schema: A schema is an external mechanism to describe the structure of data and its encoding. Producer clients and consumer clients use a schema as an agreed-upon format for sending and receiving messages. Schemas enable a loosely coupled, data-centric architecture that minimizes dependencies in code, between teams, and between producers and consumers.

Subject: A subject is a logical grouping for schemas. When data formats are updated, a new version of the schema can be registered under the same subject, allowing for backward and forward compatibility. A subject may have more than one schema version assigned to it, with each schema having a different numeric ID.

Serialization format: A serialization format defines how data is converted into bytes that are transmitted and stored. Serialization, by producers, converts an event into bytes. Redpanda then stores these bytes in topics. Deserialization, by consumers, converts the bytes of arrays back into the desired data format. Redpanda’s Schema Registry supports Avro, Protobuf, and JSON serialization formats.

Normalization: Normalization is the process of converting a schema into a canonical form. When a schema is normalized, it can be compared and considered equivalent to another schema that may contain minor syntactic differences. Schema normalization allows you to more easily manage schema versions and compatibility by prioritizing meaningful logical changes. Normalization is supported for Avro, JSON, and Protobuf formats during both schema registration and lookup for a subject.

Redpanda design overview

Every broker allows mutating REST calls, so there’s no need to configure leadership or failover strategies. Schemas are stored in a compacted topic, and the registry uses optimistic concurrency control at the topic level to detect and avoid collisions.

The Schema Registry publishes an internal topic, _schemas, as its backend store. This internal topic is reserved strictly for schema metadata and support purposes. Do not directly edit or manipulate the _schemas topic unless directed to do so by Redpanda Support. See the kafka_nodelete_topics cluster property.

Redpanda Schema Registry uses the default port 8081.

Wire format

With Schema Registry, producers and consumers can use a specific message format, called the wire format. The wire format facilitates a seamless transfer of data by ensuring that clients easily access the correct schema in the Schema Registry for a message.

The wire format is a sequence of bytes consisting of the following:

The "magic byte," a single byte that always contains the value of 0.
A four-byte integer containing the schema ID.
The rest of the serialized message.

In the serialization process, the producer hands over the message to a key/value serializer that is part of the respective language-specific SDK. The serializer first checks whether the schema ID for the given subject exists in the local schema cache. The serializer derives the subject name based on several strategies, such as the topic name. You can also explicitly set the subject name.

If the schema ID isn’t in the cache, the serializer registers the schema in the Schema Registry and collects the resulting schema ID in the response.

In either case, when the serializer has the schema ID, it pads the beginning of the message with the magic byte and the encoded schema ID, and returns the byte sequence to the producer to write to the topic.

In the deserialization process, the consumer fetches messages from the broker and hands them over to a deserializer. The deserializer first checks the presence of the magic byte and rejects the message if it doesn’t follow the wire format.

The deserializer then reads the schema ID and checks whether that schema exists in its local cache. If it finds the schema, it deserializes the message according to that schema. Otherwise, the deserializer retrieves the schema from the Schema Registry using the schema ID, then the deserializer proceeds with deserialization.

You can also configure brokers to validate that producers use the wire format and the schema exists (but brokers do not validate the full payload). See Server-Side Schema ID Validation for more information.

Schema examples

To experiment with schemas from applications, see the clients in redpanda-labs.

For a basic end-to-end example, the following Protobuf schema contains information about products: a unique ID, name, price, and category. It has a schema ID of 1, and the Topic name strategy, with a topic of Orders. (The Topic strategy is suitable when you want to group schemas by the topics to which they are associated.)

syntax = "proto3";

message Product {
  int32 ProductID = 1;
  string ProductName = 2;
  double Price = 3;
  string Category = 4;
}

The producer then does something like this:

from kafka import KafkaProducer
from productpy import Product  # This imports the prototyped schema

# Create a Kafka producer
producer = KafkaProducer(bootstrap_servers='your_kafka_brokers')

# Create a Product message
product_message = Product(
    ProductID=123,
    ProductName="Example Product",
    Price=45.99,
    Category="Electronics"
)

# Produce the Product message to the "Orders" topic
producer.send('Orders', key='product_key', value=product_message.SerializeToString())

To add an additional field for product variants, like size or color, the new schema (version 2, ID 2) would look like this:

syntax = "proto3";

message Product {
  int32 ProductID = 1;
  string ProductName = 2;
  double Price = 3;
  string Category = 4;
  repeated string Variants = 5;
}

You would want the compatibility setting to accommodate adding new fields without breakage. Adding an optional new field to a schema is inherently backward-compatible. New consumers can process events written with the new schema, and older consumers can ignore it.

JSON Schema

All CRUD operations are supported for the JSON Schema (json-schema), and Redpanda supports all published JSON Schema specifications, which include:

draft-04
draft-06
draft-07
2019-09
2020-12

Limitations

Schemas are held in subjects. Subjects have a compatibility configuration associated with them, either directly specified by a user, or inherited by the default. See PUT /config and PUT/config/{subject} in the Schema Registry API.

If you have inserted a second schema into a subject where the compatibility level is anything but NONE, then any JSON Schema containing the following items are rejected:

$ref
$defs (definitions prior to draft 2019-09)
dependentSchemas / dependentRequired (dependencies prior to draft 2019-09)
prefixItems

Consequently, you cannot structure a complex schema using these features.

Additionally, you cannot have schema ID validation with JSON schemas if the subject name strategy is not TopicNameStrategy.

Next steps

Use the Schema Registry API

What do you think of this page?

Let us know more:

Let us contact you about your feedback: