Monitor Kubernetes Shadow Links
|
This feature requires an enterprise license. To get a trial license key or extend your trial period, generate a new trial license key. To purchase a license, contact Redpanda Sales. If Redpanda has enterprise features enabled and it cannot find a valid license, restrictions apply. |
Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. For Kubernetes deployments, you can monitor shadow links using the Redpanda Operator’s ShadowLink resource status or by using rpk commands directly.
|
See Kubernetes Failover Runbook for immediate step-by-step disaster procedures. |
Status commands
-
Operator
-
Helm
To list existing shadow links:
kubectl get shadowlink --namespace <shadow-namespace>
NAME SYNCED link True
The synced status is True for a healthy shadow link. If the synced status is False, use kubectl describe to investigate the issue.
To view detailed shadow link status and configuration:
kubectl describe shadowlink --namespace <shadow-namespace> <shadowlink-name>
Name: link
Namespace: redpanda-system
API Version: cluster.redpanda.com/v1alpha2
Kind: ShadowLink
Status:
Conditions:
Status: True
Type: Synced
Message: Shadow link is synced
Shadow Topics:
Name: orders
State: active
Name: inventory
State: active
Tasks:
Name: Source Topic Sync
State: active
Name: Consumer Group Shadowing
State: active
Name: Security Migrator
State: active
The kubectl describe output shows:
-
Shadow link state: Overall operational state in the
Statussection -
Individual topic states: Current state of each replicated topic under
Shadow Topics -
Task status: Health of replication tasks under
Tasks -
Sync status: Whether the resource is properly synced (
Synced: Truein conditions) -
Configuration: Complete shadow link configuration including connection settings and filters
Look for Synced: True in Conditions and active state for topics and tasks.
For more detailed monitoring or troubleshooting, you can also use rpk commands as shown in the Helm tab.
To list existing shadow links:
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
rpk shadow list
NAME UID STATE disaster-recovery-link 70f25b41-9bad-4e31-9f81-d302c8676397 ACTIVE
To view shadow link configuration details:
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
rpk shadow describe <shadow-link-name>
For detailed command options, see rpk shadow list and rpk shadow describe. This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
To check your shadow link status and ensure proper operation:
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
rpk shadow status <shadow-link-name>
OVERVIEW
===
NAME disaster-recovery-link
UID 70f25b41-9bad-4e31-9f81-d302c8676397
STATE ACTIVE
TASKS
===
NAME BROKER_ID SHARD STATE REASON
Source Topic Sync 0 0 ACTIVE Source Topic Sync has started
Consumer Group Shadowing 0 0 ACTIVE Group mirroring task finished successfully
Security Migrator Task 0 0 ACTIVE Security Migrator Task has started
TOPICS
===
Name: orders, State: ACTIVE
PARTITION SRC_LSO SRC_HWM DST_HWM LAG
0 1000 1234 1230 4
1 2000 2456 2450 6
Name: inventory, State: ACTIVE
PARTITION SRC_LSO SRC_HWM DST_HWM LAG
0 500 789 789 0
Key indicators:
-
State: active: Shadow link is replicating
-
Tasks: active: All replication tasks are running
-
Lag: Message count difference between source and shadow (lower is better)
For troubleshooting specific issues, you can use command options to show individual status sections. See rpk shadow status for available status options.
The status output includes the following:
-
Shadow link state: Overall operational state (
ACTIVE,PAUSED). -
Individual topic states: Current state of each replicated topic (
ACTIVE,FAULTED,FAILING_OVER,FAILED_OVER,PAUSED). -
Task status: Health of replication tasks across brokers (
ACTIVE,FAULTED,NOT_RUNNING,LINK_UNAVAILABLE). For details about shadow link tasks, see Shadow link tasks. -
Lag information: Replication lag per partition showing source vs shadow high watermarks (HWM).
Troubleshoot
Topics in FAULTED state
When monitoring shadow links, you may see topics showing FAULTED state in status output.
Check shadow cluster logs for specific error messages:
kubectl logs --namespace <shadow-namespace> <shadow-pod-name> --container redpanda | grep -i "shadow\|error"
Common causes include:
-
Source topic deleted: topic no longer exists on source cluster
-
Permission denied: shadow link service account lacks required permissions
-
Network interruption: temporary connectivity issues
If the source topic still exists and should be replicated, delete and recreate the shadow link to reset the faulted state.
High replication lag
When monitoring shadow links, you may see LAG values continuously increasing in rpk shadow status.
Check the following:
-
Check source cluster load: high produce rate may exceed replication capacity
-
Check shadow cluster resources: CPU, memory, or disk constraints
-
Check network bandwidth: verify sufficient bandwidth between clusters
To resolve:
-
Scale shadow cluster resources if constrained
-
Verify network connectivity and bandwidth
-
Review topic configuration for optimization opportunities
Task shows LINK_UNAVAILABLE
When monitoring shadow links, you may see tasks showing LINK_UNAVAILABLE state with "No brokers available" message.
Common causes include:
-
Source cluster requires SASL authentication but shadow link not configured for it
-
Source cluster unreachable from shadow cluster
-
Network policy blocking traffic between clusters
To resolve:
-
Verify SASL configuration if source cluster requires authentication
-
Test network connectivity:
kubectl execinto shadow pod and try connecting to source cluster -
Check Kubernetes NetworkPolicies and firewall rules
Metrics
Shadowing provides comprehensive metrics to track replication performance and health with the public_metrics endpoint.
| Metric | Type | Description |
|---|---|---|
|
Gauge |
The lag of the shadow partition against the source partition, calculated as source partition LSO (Last Stable Offset) minus shadow partition HWM (High Watermark). Monitor by |
|
Count |
The total number of bytes fetched by a sharded replicator (bytes received by the client). Labeled by |
|
Count |
The total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Uses |
|
Count |
The number of errors seen by the client. Track by |
|
Gauge |
Number of shadow topics in the respective states. Labeled by |
|
Count |
The total number of records fetched by the sharded replicator (records received by the client). Monitor by |
|
Count |
The total number of records written by a sharded replicator (records written to the write_at_offset_stm). Uses |
See also: Public Metrics
Monitoring best practices
Health check procedures
Establish regular monitoring workflows to ensure shadow link health:
-
Operator
-
Helm
# Check all shadow links are synced and healthy
kubectl get shadowlink --namespace <shadow-namespace>
# View detailed status for a specific shadow link
kubectl describe shadowlink --namespace <shadow-namespace> <shadowlink-name>
# Check for any shadow links with issues (not synced)
kubectl get shadowlink --namespace <shadow-namespace> -o json | \
jq '.items[] | select(.status.conditions[] | select(.type=="Synced" and .status!="True")) | .metadata.name'
# Check all shadow links are active
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
# Monitor lag for critical topics
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
Alert conditions
Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:
-
High replication lag: When
redpanda_shadow_link_shadow_lagexceeds your RPO requirements -
Connection errors: When
redpanda_shadow_link_client_errorsincreases rapidly -
Topic state changes: When topics move to
FAULTEDstate -
Task failures: When replication tasks enter
FAULTEDorNOT_RUNNINGstates -
Throughput drops: When bytes/records fetched drops significantly
-
Link unavailability: When tasks show
LINK_UNAVAILABLEindicating source cluster connectivity issues
For more information about shadow link tasks and their states, see Shadow link tasks.