Skip to content

Add metrics for network message sizes and aggregated signature sizes#314

Open
pablodeymo wants to merge 5 commits intomainfrom
metrics-network-signature-sizes
Open

Add metrics for network message sizes and aggregated signature sizes#314
pablodeymo wants to merge 5 commits intomainfrom
metrics-network-signature-sizes

Conversation

@pablodeymo
Copy link
Copy Markdown
Collaborator

@pablodeymo pablodeymo commented Apr 21, 2026

Motivation

We want visibility into how large the messages we send on the wire are,
and how large individual aggregated signatures get. The existing gossip
size histograms only covered the receive path and measured a single
(uncompressed) number, so we couldn't answer questions like "how much
bandwidth does snappy save?" or "how do publish sizes compare to what we
receive from the mesh?" — and we had no signal at all on req/resp sizes
or on aggregated proof sizes.

Description

New and changed metrics

Metric Type Labels Buckets (bytes)
lean_gossip_block_size_bytes Histogram compression=raw,snappy 10k, 50k, 100k, 250k, 500k, 1M, 2M, 5M
lean_gossip_attestation_size_bytes Histogram compression=raw,snappy 512, 1024, 2048, 4096, 8192, 16384
lean_gossip_aggregation_size_bytes Histogram compression=raw,snappy 1024, 4k, 16k, 64k, 128k, 256k, 512k, 1M
lean_reqresp_request_size_bytes Histogram protocol=status,blocks_by_root
compression=raw,snappy
64, 128, 256, 512, 1k, 4k, 16k, 64k
lean_reqresp_response_chunk_size_bytes Histogram protocol=status,blocks_by_root
compression=raw,snappy
128, 1k, 10k, 100k, 500k, 1M, 5M, 10M
lean_aggregated_proof_size_bytes Histogram 1024, 4k, 16k, 64k, 128k, 256k, 512k, 1M

The three existing lean_gossip_*_size_bytes histograms gained a
compression label (raw = SSZ-encoded, snappy = on-wire), and are
now observed on both publish and receive paths instead of only receive.

Implementation

  • crates/net/p2p/src/metrics.rs: converts the three gossip size
    histograms to HistogramVec and adds the two new req/resp histograms.
    Each observe_*_size(raw, snappy) function records both points in one
    call.
  • crates/net/p2p/src/gossipsub/handler.rs: on receive, capture
    message.data.len() as the compressed size before decompressing; on
    publish (publish_block, publish_attestation,
    publish_aggregated_attestation), observe both ssz_bytes.len() and
    compressed.len() right after compression.
  • crates/net/p2p/src/req_resp/encoding.rs: write_payload now
    returns the compressed byte count; decode_payload returns a small
    DecodedPayload { uncompressed, compressed_size } struct. This lets
    the codec record both sizes without re-encoding.
  • crates/net/p2p/src/req_resp/codec.rs: observes request and
    per-chunk response sizes for Status and BlocksByRoot at every
    read/write site, using a short protocol_label() helper to map the
    full libp2p protocol id to the metric label.
  • crates/blockchain/src/metrics.rs: registers
    lean_aggregated_proof_size_bytes and its
    observe_aggregated_proof_size() helper; both included in the
    init() eager-registration list.
  • crates/blockchain/src/aggregation.rs: observes the proof size
    right after AggregatedSignatureProof::new() in aggregate_job,
    using proof.proof_data.len() — the variable-length portion of the
    post-quantum aggregated signature.
  • crates/common/metrics/src/lib.rs: re-exports HistogramVec and
    register_histogram_vec! so labeled histograms can be used
    workspace-wide.

Devnet-compat fix (bundled, separate commit)

While validating these metrics in a local devnet I found that ethlambda
cannot launch against the current lean-quickstart generator output,
blocking make run-devnet on main. The generator still produces the
single-key layout (one XMSS key per validator, flat-string
GENESIS_VALIDATORS, single validator_N_sk.ssz), which other clients
(zeam, lantern) already accept by using the key for both roles.

Commit Accept single-key genesis + annotated-validator layouts
extends two loaders to accept either layout:

  • GenesisValidatorEntry deserializes from either a map
    (attestation_pubkey + proposal_pubkey) or a flat hex string (used
    for both roles).
  • classify_role / read_validator_keys gain a Shared role for
    files without attester/proposer in the name, populating both role
    slots from the same key file. Mixing Shared with role-specific
    entries for the same validator is still an error.

Round-trip test added for the flat-string genesis variant.

Documentation

  • docs/metrics.md: adds the new lean_aggregated_proof_size_bytes
    to the PQ Signature Metrics table and documents the five network size
    histograms under Network Metrics.

How to Test

  1. make fmt — passes

  2. make lint — passes (clippy clean, -D warnings)

  3. cargo test -p ethlambda-types --lib genesis — 4 passed (includes the
    flat-pubkey round-trip)

  4. cargo test -p ethlambda-metrics -p ethlambda-p2p -p ethlambda-blockchain --lib --bins — passes

  5. cargo test -p ethlambda-state-transition --release — passes

  6. Live devnet scrape (4-node local devnet, 4 min uptime, 2 zeam +
    2 ethlambda with ethlambda_1 as aggregator). Metrics pulled from
    ethlambda_1 on the aggregator:

    lean_gossip_attestation_size_bytes{compression=raw}    n=124  mean=2.7 kB  p50/p99 ∈ (2.0 kB, 4.1 kB]
    lean_gossip_attestation_size_bytes{compression=snappy} n=124  mean=2.6 kB  p50/p99 ∈ (2.0 kB, 4.1 kB]
    lean_reqresp_request_size_bytes{protocol=status,raw}    n=7  mean=80 B    p50 ∈ (64 B, 128 B]
    lean_reqresp_request_size_bytes{protocol=status,snappy} n=7  mean=58 B    p50 ∈ (0 B, 64 B]
    lean_reqresp_response_chunk_size_bytes{protocol=status,raw}    n=7  mean=80 B
    lean_reqresp_response_chunk_size_bytes{protocol=status,snappy} n=7  mean=58 B
    

    Attestation payloads are already tight, so snappy only shaves ~3 %.
    Status frames (tiny structs) compress ~28 % (80 → 58 B).
    The block, aggregation, and BlocksByRoot histograms had zero
    observations — ethlambda could not produce blocks in this run
    (pre-existing cross-client signature scheme mismatch with the pinned
    zeam image; not related to this PR).

Notes

The forkchoice_spectests and signature_spectests fixture tests fail
with the same errors on main (XmssSignature length != 2536 and
ProofError(InvalidProof) — pre-existing devnet-3/devnet-4 fixture
mismatches). This PR does not touch signature or SSZ types.

Extend the three `lean_gossip_{block,attestation,aggregation}_size_bytes`
histograms with a `compression` label (raw SSZ vs snappy on-wire) and
observe them on both publish and receive paths instead of only receive.

Add new `lean_reqresp_request_size_bytes` and
`lean_reqresp_response_chunk_size_bytes` histograms for the Status and
BlocksByRoot protocols, labeled by `{protocol, compression}`. The
req/resp encoding helpers (`write_payload`, `decode_payload`) now return
the compressed byte count so the codec can record both sizes without
re-encoding.

Add `lean_aggregated_proof_size_bytes` histogram observed in
`aggregate_job` after each `AggregatedSignatureProof` is built,
recording the size of `proof_data` — the variable-length portion of
post-quantum aggregated signatures.

Re-export `HistogramVec` and `register_histogram_vec!` from
`ethlambda-metrics` so labeled histograms can be used workspace-wide.
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This is a solid observability PR that adds comprehensive metrics for P2P message sizes. The implementation is clean, secure, and follows Rust best practices.

Correctness & Security

  • No security issues introduced. The PR only adds instrumentation around existing encoding/decoding logic.
  • Proper handling of size limits remains intact in encoding.rs (MAX_PAYLOAD_SIZE checks).
  • No panic paths introduced in metric hot paths.

Code Quality

  • Good use of HistogramVec with low-cardinality labels (protocol, compression) to avoid cardinality explosion.
  • The DecodedPayload struct in encoding.rs (lines 11-14) elegantly returns both uncompressed data and compressed size without requiring the caller to re-measure.
  • Consistent error handling: metrics are only recorded on successful encode/decode operations.

Minor Suggestion
In crates/net/p2p/src/req_resp/codec.rs, the error response path in write_response (lines 165-169) does not record metrics, while success paths do. For consistency, consider observing the error payload size:

Response::Error { message } => {
    io.write_all(&[ResponseCode::ERROR.into()]).await?;
    let encoded = message.to_ssz();
    let compressed_size = write_payload(io, &encoded).await?;
    metrics::observe_reqresp_response_chunk_size(label, encoded.len(), compressed_size);
    Ok(())
}

Documentation

  • The docs/metrics.md update correctly documents all new histograms and their label values.

Performance

  • Metric observations are lock-free histogram updates (atomic increments) and won't block the async runtime.
  • No unnecessary allocations; the protocol_label function returns &'static str.

Nitpick
In crates/blockchain/src/aggregation.rs (line 293), the variable proof is created separately before being moved into the struct. This is fine for readability, though you could inline it if preferred.

Overall: LGTM. The PR significantly improves network observability without touching consensus-critical logic.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. Response::Error chunks are not counted in the new req/resp response-size metric, even though the receive side does count them and the docs describe the metric as covering req/resp response chunks generally. The gap is in codec.rs: after encoding the error payload, write_payload is called and the result is discarded, with no observe_reqresp_response_chunk_size(...). This will systematically under-report outbound error traffic and make send/receive comparisons misleading. Add the same observation here that you add for success chunks.

No consensus-layer correctness or security regressions stood out in this diff. The changes are confined to metrics/instrumentation; nothing here touches fork choice, attestation validation, STF, XMSS verification, or SSZ semantics beyond measuring encoded sizes.

I could not complete a cargo check in this environment: the toolchain/cargo cache paths under /home/runner are read-only, and network access is restricted, so dependency resolution failed before compilation.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have a thorough understanding of the PR. Let me compile the review.


PR Review: Add metrics for network message sizes and aggregated signature sizes

Overall: A clean, well-motivated observability PR. The design (dual raw/snappy labels, DecodedPayload struct, protocol_label helper) is solid and follows existing project conventions. A few issues worth addressing before merge.


Bug: compressed_size over-reports for multi-chunk BlocksByRoot responses

encoding.rs:36:

let (size, rest) = decode_varint(&buf)?;
let compressed_size = rest.len();

buf is filled by read_to_end, which consumes all remaining bytes from io (bounded only by MAX_COMPRESSED_PAYLOAD_SIZE). For a BlocksByRoot response stream carrying N blocks:

[varint1][snappy1][result_byte2][varint2][snappy2]…

rest after stripping varint1 includes the compressed bytes of all subsequent chunks. So on the first call to decode_payload inside decode_blocks_by_root_response, decoded.compressed_size is the entire remainder of the stream, not just the current chunk. The raw-vs-snappy ratio this metric is supposed to answer becomes meaningless for multi-block responses.

A correct approach would be to have write_payload return the compressed size (already done) and compare against size as usize on the decode side, or to at least cap the reported compressed_size at the actual snappy frame bytes consumed. As a short-term fix, consider adding a note in the DecodedPayload doc-comment that this field is accurate only for single-chunk streams, until the decode path is refactored.


Minor: Error responses on the write path are not measured

codec.rs:163–172 — the Response::Error arm calls write_payload but discards the returned compressed size and records no metric, while the read path (decode_status_response, decode_blocks_by_root_response) does record metrics for error chunks. If the intent is to measure all on-wire traffic, the write side should observe here too. If error sizes are intentionally excluded, the comments should say so.


Minor: Request size observed on both send and receive sides under the same metric name

codec.rs:47 (read_request) and codec.rs:104–106 (write_request) both call observe_reqresp_request_size. Each peer records its own view (requester records what it sent, responder records what it received), so there is no double-counting within one process. This is correct — but worth a brief comment so a future reader doesn't assume it's a bug.


Nit: HistogramVec label cardinality protection

metrics.rs:20–26protocol_label maps unknown protocols to "unknown". Good defensive practice that prevents metric cardinality explosion if new protocols are added without updating this function. No change needed, just worth calling out as a good pattern.


Nit: Gossip metrics observed before publish returns

handler.rs:144,179,207 — metrics are observed before the swarm_handle.publish call. If publish is fire-and-forget (it is here), this is harmless, but it means a failed publish still increments the histogram. Acceptable for this codebase since publish doesn't return a Result, but worth knowing.


Documentation

docs/metrics.md additions are accurate and complete. Bucket choices are reasonable for XMSS-heavy payloads (blocks can be large; individual attestations are small). The comment block on compression label values ("raw" / "snappy") in metrics.rs:73–75 is helpful.


What's good

  • DecodedPayload struct cleanly keeps encoding details co-located rather than returning multiple values or adding extra parameters.
  • Re-exporting HistogramVec + register_histogram_vec! from ethlambda-metrics follows the workspace re-export convention.
  • All new metrics use the lean_ prefix and are added to init() in blockchain/src/metrics.rs (consistent with existing eager registration pattern).
  • The p2p-side histograms (no init()) are consistent with existing p2p metrics that use lazy initialization.
  • observe_aggregated_proof_size correctly uses proof.proof_data.len() — measuring just the variable-length post-quantum proof bytes, not the full AggregatedSignatureProof struct size.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR adds raw/snappy size histograms for gossip messages (block, attestation, aggregation) on both send and receive paths, two new req/resp size histograms (request and per-chunk response), and a histogram for aggregated proof sizes. The implementation is well-structured: DecodedPayload cleanly carries both the uncompressed bytes and compressed size from decode_payload, write_payload now returns the compressed byte count, and protocol_label() keeps codec callsites tidy. The only minor gap is that the Network Metrics table in docs/metrics.md is missing a Buckets column present in other sections.

Confidence Score: 5/5

Safe to merge — changes are purely additive observability instrumentation with no behavior changes on the critical path.

All findings are P2 (documentation style). No logic errors, data loss, or correctness issues were found. The metric measurements are accurate for the cases that occur in practice, and the implementation follows existing patterns consistently.

docs/metrics.md — missing Buckets column in the Network Metrics table.

Important Files Changed

Filename Overview
crates/net/p2p/src/req_resp/encoding.rs Adds DecodedPayload struct returning (uncompressed, compressed_size) from decode_payload, and makes write_payload return the compressed byte count. Logic is correct: compressed_size is derived from rest.len() (bytes after the varint), accurately capturing the on-wire snappy bytes for single-chunk responses.
crates/net/p2p/src/req_resp/codec.rs Adds protocol_label() helper and observes request/response sizes at every read/write site. Error-response payloads are also counted in the response-chunk histogram — reasonable since they're still on-wire traffic.
crates/net/p2p/src/gossipsub/handler.rs Records raw+snappy sizes on both receive and publish paths for blocks, attestations, and aggregated attestations. Compressed size is captured before decompression on receive; SSZ and compressed lengths measured after compression on publish.
crates/net/p2p/src/metrics.rs Converts three gossip histograms from Histogram to HistogramVec with a compression label, and adds two new req/resp HistogramVec metrics. Helper functions observe_*_size(raw, snappy) cleanly record both label values per call.
crates/blockchain/src/metrics.rs Adds LEAN_AGGREGATED_PROOF_SIZE_BYTES histogram with appropriate byte-range buckets (1 KiB–1 MiB) and a clean observe_aggregated_proof_size() helper. Correctly included in init() for eager Prometheus registration.
crates/blockchain/src/aggregation.rs Observes proof.proof_data.len() right after AggregatedSignatureProof::new(). Observation is placed correctly — after the proof is built successfully, before Some(output) is returned.
crates/common/metrics/src/lib.rs Re-exports HistogramVec and register_histogram_vec! so they are available workspace-wide. Minimal, correct change.
docs/metrics.md Documents the new network size histograms and the aggregated proof size histogram. The Network Metrics table is missing a Buckets column present in other sections.

Comments Outside Diff (1)

  1. docs/metrics.md, line 76-88 (link)

    P2 Network Metrics table missing Buckets column

    The Network Metrics table header is missing a "Buckets" column, while the PQ Signature Metrics and Fork-Choice Metrics tables both include one. The five new size histograms all have non-trivial bucket choices that are worth documenting here for quick reference.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: docs/metrics.md
    Line: 76-88
    
    Comment:
    **Network Metrics table missing Buckets column**
    
    The Network Metrics table header is missing a "Buckets" column, while the PQ Signature Metrics and Fork-Choice Metrics tables both include one. The five new size histograms all have non-trivial bucket choices that are worth documenting here for quick reference.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/metrics.md
Line: 76-88

Comment:
**Network Metrics table missing Buckets column**

The Network Metrics table header is missing a "Buckets" column, while the PQ Signature Metrics and Fork-Choice Metrics tables both include one. The five new size histograms all have non-trivial bucket choices that are worth documenting here for quick reference.

```suggestion
| Name   | Type  | Usage | Sample collection event | Labels | Buckets | Supported |
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Add metrics for network message sizes an..." | Re-trigger Greptile

MegaRedHand and others added 4 commits April 21, 2026 17:45
The `lean-quickstart` genesis generator still emits a single XMSS key per
validator: `config.yaml`'s `GENESIS_VALIDATORS` is a flat hex-string list,
and `annotated_validators.yaml` references one `validator_N_sk.ssz` file
per validator with no `attester`/`proposer` split. zeam and lantern
already treat that shared key as both the attestation and proposal key
for the same validator; ethlambda rejected both files outright, which
prevented launching the client against the generator's output.

Extend the two loaders to accept the single-key layout in addition to
the canonical dual-key one:

- `GenesisValidatorEntry` (`crates/common/types/src/genesis.rs`) now
  deserializes either a map with `attestation_pubkey` + `proposal_pubkey`
  or a plain hex string; the flat string is used for both roles.
- `classify_role` / `read_validator_keys` in `bin/ethlambda/src/main.rs`
  add a `Shared` role that populates both attester and proposer slots
  from the same private-key file. Files with the existing
  `attester` / `proposer` substrings keep their prior routing.
  A `Shared` entry cannot be mixed with role-specific entries for the
  same validator.

Added a round-trip test for the flat-string genesis variant. Unblocks
running ethlambda against `make run-devnet` / lean-quickstart's current
generator output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants