Solana RPC observability: what to measure and how

Written by:

Maksym Bogdan

9

min read

Date:

March 18, 2026

Updated on:

March 17, 2026

Solana produces a new block every 400 milliseconds. At that tempo, a single second of RPC lag puts your application two to three slots behind the current state of the network. You are no longer reading the chain—you are reading its recent memory. And the troubling part: it usually looks fine. Responses come back with 200 OK. Latency dashboards are green. The data is just wrong.

Most RPC failures on Solana produce no error signals. You get stale data that looks correct, dropped transactions that return success codes, and slot lag that accumulates silently—visible only when something downstream breaks.

This failure mode is unique to Solana's architecture. On slower chains, a 1–2 second delay is noise. On Solana, it is three to five missed slots—enough to invalidate a blockhash, miss a trading window, or return account state from a fork that never finalized.

Observability on Solana is not about catching crashes. It is about catching drift—the gradual, silent separation between what your RPC node believes and what the network actually is.

In 2025, Solana processed over 600 million daily RPC requests across 2,000+ live dApps. During peak congestion events, the gap between the fastest and slowest RPC providers exceeded 400ms—equivalent to a full slot—while both reported normal HTTP response times.

Four failure modes you need to instrument

Before building dashboards, it helps to understand exactly how Solana RPC degrades. There are four distinct failure modes, and each requires its own instrumentation.

1. Slot lag

Slot lag is the difference between your RPC node's current slot and the true tip of the network. A lag of 0–1 slots is acceptable under normal conditions. Consistent lag of 2+ slots means the node is overloaded, poorly peered, or falling behind on ledger replay.

Slot lag Meaning Impact
0–1 slots Normal—within network jitter No impact
2–3 slots Node under load or peer delay Stale reads, potential blockhash issues
4–5 slots Serious—likely overloaded or forked Transaction failures, simulation errors
>5 slots Node effectively unusable All time-sensitive operations fail

The critical detail: a lagging RPC node can respond to getSlot() with a stale slot number while returning a 200 HTTP response. HTTP latency and data freshness are entirely separate metrics. A heavily cached, overloaded node can reply in 30ms—with data that is 800ms stale.

2. Tail latency (p99)

Averages lie. If your average RPC latency is 50ms but your p99 is 2 seconds, you will fail precisely at the moments when it matters most—during high-volatility periods when the market is moving and every millisecond counts.

A p99 latency of 2s on a bot that fires during volatility events means the slowest 1% of requests—the ones that arrive exactly when price is moving—miss their window entirely.

Production RPC monitoring must track the full latency distribution: p50 (median), p90, and p99. Spikes in the tail typically indicate one of three things: garbage collection pauses on shared infrastructure, disk I/O bottlenecks during heavy getProgramAccounts queries, or rate limiting starting to apply against your connection.

Percentile What it reveals Alert threshold
p50 Median request latency—baseline health > 150ms for latency-sensitive apps
p90 90th percentile—common load behavior > 300ms sustained over 5 minutes
p99 Worst 1%—behavior under peak load > 1000ms—indicates infrastructure problem
p99.9 Extreme tail—critical path failures > 3000ms—immediate investigation

3. Transaction drop rate

sendTransaction() returns a transaction signature immediately—before the transaction has been forwarded to the leader, let alone confirmed. A successful HTTP response tells you the RPC node received the transaction. It says nothing about whether it reached the validator.

Transactions are dropped silently in several scenarios:

  • The RPC node's rebroadcast queue is full (drops new submissions when queue exceeds 10,000 transactions)
  • The node is lagging and forwards the transaction to the wrong leader slot
  • The blockhash was fetched from one node in an RPC pool and submitted to a lagging node in the same pool—the blockhash appears unrecognized
  • A temporary network fork causes the transaction to reference a blockhash on a minority fork that is later abandoned

Measuring actual transaction landing rate requires end-to-end tracking: send a memo transaction, poll getSignatureStatuses() every 400ms, and record whether it lands within 3 slots (~1.2 seconds). Run this test against high-congestion windows (typically 14:00–18:00 UTC) to stress the propagation path.

4. Geyser stream drift

For bots and applications consuming Yellowstone gRPC, the relevant metric is not HTTP latency—it is the delay between an account state change on the validator and when your subscriber receives the update. This is Geyser stream drift.

Geyser drift compounds with slot lag: if your node is 2 slots behind and your Geyser subscription has 40ms stream latency, you are effectively 800ms + 40ms behind the tip. For arbitrage bots, that margin is the difference between landing and missing.

Data path Typical latency Notes
Geyser gRPC (local, tuned) < 10ms Sub-slot freshness, requires dedicated node
Yellowstone via provider 10–50ms Depends on provider peering and node load
WebSocket subscription 100–300ms Filtered but slower; degraded under congestion
HTTP polling (getAccountInfo) 100–500ms+ Worst option; never use for time-sensitive data

How to instrument Solana RPC observability

Slot freshness tracking

The foundation of any Solana RPC observability stack is continuous slot freshness monitoring. The approach: poll getSlot() from your RPC endpoint every 200ms, and compare against a reference source—a direct validator connection, or at minimum two separate paid providers.

// Continuous slot lag monitor (TypeScript)
import { Connection } from '@solana/web3.js';

const myRpc   = new Connection('https://your-rpc-endpoint', 'processed');
const refRpc  = new Connection('https://reference-rpc-endpoint', 'processed');

async function checkSlotLag() {
  const [mySlot, refSlot] = await Promise.all([
    myRpc.getSlot(),
    refRpc.getSlot()
  ]);
  const lag = refSlot - mySlot;
  console.log(`Slot lag: ${lag} | my: ${mySlot} | ref: ${refSlot}`);
  if (lag > 2) {
    alert(`WARN: slot lag ${lag}—check node health`);
  }
}

setInterval(checkSlotLag, 200); // every ~half-slot

Prometheus metrics for RPC nodes

For self-hosted or managed RPC nodes, Prometheus is the standard collection layer. Solana's internal metrics are exposed at the node level; a custom exporter bridges them to Prometheus format.

Key metrics to collect:

Metric Type Alert condition
solana_rpc_slot_lag Gauge > 2 for more than 30s
solana_rpc_request_latency_p99 Histogram > 1000ms sustained
solana_rpc_requests_total{status="error"} Counter Error rate > 1% of total
solana_rpc_tx_dropped_total Counter Any non-zero value during normal operation
solana_rpc_tx_landed_rate Gauge < 95% landing rate over 5 min window
solana_ledger_replay_lag_ms Gauge > 400ms (one slot worth)
solana_geyser_stream_delay_ms Histogram > 50ms p99 for latency-sensitive subscribers
solana_rpc_rate_limit_hits_total Counter Any spike > 10/min

For Geyser/Yellowstone observability specifically, the Prometheus exporter approach using Dragon's Mouth gRPC streams provides sub-second slot metrics. The exporter subscribes to live slot updates over gRPC and exposes them at /metrics for Prometheus scraping:

# Example Prometheus output from Yellowstone exporter
# HELP solana_latest_slot Most recent slot from Geyser stream
# TYPE solana_latest_slot gauge
solana_latest_slot 347291042

# HELP solana_geyser_stream_delay_ms Delay between slot production and receipt
# TYPE solana_geyser_stream_delay_ms histogram
solana_geyser_stream_delay_ms_bucket{le='10'} 8821
solana_geyser_stream_delay_ms_bucket{le='50'} 9944
solana_geyser_stream_delay_ms_bucket{le='100'} 9997
solana_geyser_stream_delay_ms_bucket{le='+Inf'} 10000

Grafana dashboard structure

A production Grafana dashboard for Solana RPC observability should be organized into four panels:

Panel Metrics displayed Time range
Slot health Slot lag (gauge), slot height vs. reference, fork alignment Last 15 min, 1s resolution
Latency distribution p50/p90/p99 heatmap per RPC method, tail spike frequency Last 1 hour, rolling
Transaction pipeline sendTransaction rate, landing rate %, drop count, 429 error rate Last 30 min
Geyser stream Stream delay histogram, subscription count, reconnect events Last 15 min, 200ms resolution

Alertmanager rules that matter most in practice:

  • Slot lag > 2 for more than 30 consecutive seconds → page on-call
  • p99 latency > 1000ms sustained for 5 minutes → investigate node load or peering
  • Transaction landing rate < 95% over any 5-minute window → check leader forwarding logic
  • Geyser reconnect event → stream dropped; verify subscription is healthy
  • Disk IOPS > 90% capacity → ledger replay will fall behind; add capacity or reduce indexing scope
Example of a production Solana RPC observability dashboard, showing HTTP request volume, RPC call rate, WebSocket connections, and per-method breakdown. 

What most teams get wrong

Three patterns appear repeatedly when teams try to build Solana RPC observability and fail:

Measuring HTTP latency instead of data freshness

A getSlot() call that returns in 30ms is useless if the slot number is stale. HTTP response time and data freshness are orthogonal. A heavily cached RPC node will respond instantly with data from three slots ago. The only way to detect this is to compare the returned slot against a reference source—not against your own response time baseline.

Trusting sendTransaction() success responses

The sendTransaction() method returns a transaction signature immediately upon receipt by the RPC node—not upon forwarding to the leader, not upon inclusion in a block. Most teams treat a 200 response as confirmation that the transaction is in flight. It is not. The only valid measure of transaction health is landing rate: how many transactions submitted actually appear on-chain within 3 slots.

Testing only during off-peak hours

Any RPC provider looks good at 03:00 UTC. The metrics that determine whether your infrastructure is viable are the ones collected during peak congestion—typically during high-volume trading sessions, NFT launches, or major protocol events. Build your baseline during normal conditions, then specifically run stress tests and comparisons during known high-activity windows.

Observability as infrastructure

Most teams add monitoring after something breaks. On Solana, that is already too late. Slot lag accumulates gradually. Transaction drop rates rise slowly before they become critical. Geyser streams disconnect with no visible error until your bot misses ten consecutive opportunities.

Production-grade observability for Solana RPC requires tracking four things continuously: slot lag against a reference source, the full latency distribution (not just averages), end-to-end transaction landing rates, and Geyser stream delay. Everything else is secondary.

"We see the same pattern repeatedly: teams build fast bots, route them through underpowered RPC, and wonder why they miss opportunities. The monitoring shows the problem immediately—3-slot lag, p99 above 800ms, landing rate under 90%. Once we move them to bare-metal with Yellowstone feeds and wire up proper Prometheus metrics, the picture changes completely. The bots don't get faster—they stop losing."

— RPC Fast 

RPC Fast provides production Solana RPC infrastructure with built-in observability—Yellowstone gRPC, Prometheus-compatible metrics, Grafana dashboards, and 24/7 alerting on slot lag and transaction health.

Table of Content

Need help with Web3 infrastructure?

Drop a line
More articles

Guide

All

Written by:

Olha Diachuk

Date:

17 Mar 26

10

min read

Guide

All

Written by:

Olha Diachuk

Date:

15 Mar 26

7

min read

We use cookies to personalize your experience