Skip to content

Remote execution implementation#939

Open
tmckayus wants to merge 2 commits intoNVIDIA:mainfrom
tmckayus:grpc-server-v4
Open

Remote execution implementation#939
tmckayus wants to merge 2 commits intoNVIDIA:mainfrom
tmckayus:grpc-server-v4

Conversation

@tmckayus
Copy link
Contributor

@tmckayus tmckayus commented Mar 6, 2026

This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.

There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.

The server is built by build.sh and is called cuopt_grpc_server.

Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT

To try this feature, checkout the branch/PR and do the following:

$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps

All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.

@tmckayus tmckayus requested review from a team as code owners March 6, 2026 14:56
@tmckayus tmckayus requested review from Iroy30, aliceb-nv, bdice and rg20 March 6, 2026 14:56
@tmckayus tmckayus added feature request New feature or request non-breaking Introduces a non-breaking change labels Mar 6, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a complete gRPC remote execution subsystem: protobuf schemas and codegen, a full C++ gRPC client API and CLI test client, a production gRPC server with worker subprocess IPC and chunked upload/download, bidirectional mappers and settings converters, build/test integrations, extensive unit/integration tests, and architecture docs.

Changes

Cohort / File(s) Summary
Documentation
GRPC_ARCHITECTURE.md, SERVER_ARCHITECTURE.md
New architecture documents describing gRPC protocol, chunked transfer protocol, server multi-process model, IPC (shared memory & pipes), worker lifecycle, TLS/mTLS, env vars, and build/test notes.
Top-level build script
build.sh
Adds cuopt_grpc_server option to build script and help text; enables GRPC-only build path and conditional cmake invocation for server target.
CMake & codegen
cpp/CMakeLists.txt
Enable CUOPT_ENABLE_GRPC, locate protoc/grpc plugin, generate protobuf/grpc sources, add generated sources and new grpc mapper/client files to build, and add cuopt_grpc_server and test_grpc_client targets.
Protobuf schemas & services
cpp/src/grpc/cuopt_remote.proto, cpp/src/grpc/cuopt_remote_service.proto
New protobuf messages/enums for problems, solver settings, results, job lifecycle and chunked array/result descriptors; new CuOptRemoteService RPCs (SubmitJob, chunked upload/download, status/result/cancel/delete, WaitForCompletion, StreamLogs, GetIncumbents).
gRPC client API & CLI
cpp/src/grpc/client/grpc_client.hpp, cpp/src/grpc/client/grpc_client.cu, cpp/src/grpc/client/solve_remote.cu, cpp/src/grpc/client/test_grpc_client.cpp
Adds public grpc_client_t with config, connect/streaming/status/submit/get/cancel/delete/incumbents, unary and chunked transfer logic, templated solve/submit wrappers, explicit instantiations, test helpers, and a CLI test client.
Client-side mappers & builders
cpp/src/grpc/grpc_problem_mapper.*, grpc_service_mapper.*, grpc_solution_mapper.*, grpc_settings_mapper.*
Templated bidirectional mappers between CPU structures and protobufs, size estimation, chunked header population, array chunk builders/serializers, and settings converters; explicit float/double instantiations.
Server core & types
cpp/src/grpc/server/grpc_server_main.cpp, .../grpc_server_types.hpp
Server entry point and type definitions: CLI/TLS parsing, shared-memory queues/control block, job/result queue structs, job tracker, global state, and forward declarations for worker and session management.
Server service implementation
cpp/src/grpc/server/grpc_service_impl.cpp
Full CuOptRemoteService implementation handling SubmitJob, chunked upload/download state machine, CheckStatus/GetResult/WaitForCompletion/Cancel/Delete, StreamLogs, GetIncumbents, validation and precise gRPC Status handling.
Server worker infra & job mgmt
cpp/src/grpc/server/grpc_worker_infra.cpp, grpc_job_management.cpp, grpc_worker.cpp, grpc_server_threads.cpp
Per-worker pipe creation/management, spawn/wait/monitor workers, worker_process GPU solve loop, pipe-based IPC serialize/deserialize, job submission queueing, result/incumbent handling, cancellation, and background threads for monitor/result/incumbent/session reaper.
Pipe serialization & small utilities
cpp/src/grpc/server/grpc_pipe_serialization.hpp, grpc_field_element_size.hpp, grpc_incumbent_proto.hpp
Utilities for varint-delimited pipe blobs with 64 MiB chunking, array element-size mapping, and incumbent proto encode/parse helpers.
Tests — unit & integration
cpp/tests/linear_programming/grpc/*, cpp/tests/linear_programming/CMakeLists.txt
Extensive gRPC unit tests with Mock stub, test helpers (mock injection, log capture), a GRPC client test executable, and integration tests that launch real server processes (TLS/mTLS, chunked validation); new CMake test targets and install rules.
Tests helpers & log capture
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp, grpc_client_test_helper.hpp
Test utilities for capturing client/server logs, client log callbacks, mock-stub injection helpers, and compute_chunk_size test wrapper.
Removed/relocated stubs
cpp/src/pdlp/solve_remote.cu, cpp/src/pdlp/CMakeLists.txt
Removed prior placeholder remote stub implementations and adjusted CMake so remote client/server implementations provide runtime behavior.
Packaging
conda/recipes/libcuopt/recipe.yaml
Conda recipe updated to add grpc-cpp, libprotobuf, libuuid build/host/run requirements and include bin/cuopt_grpc_server in libcuopt outputs.
Python test tweak
python/cuopt/.../test_cpu_only_execution.py
Sets CUOPT_PRESOLVE parameter in a CPU-only warmstart test (imports and uses CUOPT_PRESOLVE).

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)

152-163: ⚠️ Potential issue | 🟠 Major

Strengthen this warmstart test to assert numerical correctness, not just non-None outputs.

After adding CUOPT_PRESOLVE at Line 152, the test still only checks that a primal solution exists (Line 162). Please also validate warmstart numerical correctness (e.g., objective consistency/improvement within tolerance).

Proposed test-strengthening diff
     sol1 = linear_programming.Solve(dm, settings)
     ws = sol1.get_pdlp_warm_start_data()
+    obj1 = sol1.get_primal_objective()

     if ws is not None:
         settings.set_pdlp_warm_start_data(ws)
         settings.set_parameter(CUOPT_ITERATION_LIMIT, 200)
         sol2 = linear_programming.Solve(dm, settings)
         assert sol2.get_primal_solution() is not None
+        obj2 = sol2.get_primal_objective()
+        assert obj2 is not None
+        assert abs(obj2 - obj1) / max(1.0, abs(obj1)) < 1e-6

As per coding guidelines, **/*test*.{cpp,cu,py} must “Write tests validating numerical correctness of optimization results (not just 'runs without error').”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 152 - 163, The test currently only asserts
sol2.get_primal_solution() is not None; strengthen it by checking numerical
correctness: retrieve both primal solutions (via sol1.get_primal_solution() and
sol2.get_primal_solution()), compute their objective values (either via
solX.get_primal_objective() if available or by forming the dot product of the
primal vector with the problem cost from dm), then assert the objective values
are consistent within a small tolerance (e.g., abs(obj1 - obj2) <= 1e-6) or that
sol2's objective is no worse (<=) than sol1's given the increased
CUOPT_ITERATION_LIMIT; keep tolerance and exact assertion choice explicit in the
test.
🟠 Major comments (18)
cpp/src/grpc/server/grpc_worker.cpp-56-74 (1)

56-74: ⚠️ Potential issue | 🟠 Major

Potential race condition in result queue access.

store_simple_result iterates through result_queue looking for an empty slot without synchronization. If multiple workers call this concurrently, they could both find the same slot with !result_queue[i].ready and overwrite each other's results.

Consider using an atomic compare-exchange on ready similar to the job claiming pattern (lines 86-87), or protect access with a mutex.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74,
store_simple_result currently scans result_queue and sets a slot when
result_queue[i].ready is false without synchronization, causing a race where
multiple callers can claim the same slot; update this function to claim a slot
atomically (e.g., change result_queue[i].ready to an std::atomic<bool> and
perform an atomic compare-exchange to flip false->true before writing the slot)
or surround the scan-and-write with a mutex; ensure you only write the other
fields (job_id, status, data_size, worker_index, error_message, retrieved) after
successfully claiming the slot to avoid concurrent overwrites.
cpp/src/grpc/server/grpc_server_types.hpp-55-64 (1)

55-64: 🛠️ Refactor suggestion | 🟠 Major

Avoid using directives in headers.

The using declarations (lines 55-61) and using namespace directive (line 63) pollute the global namespace for any file that includes this header. This can cause unexpected name collisions.

Suggested fix: Use qualified names instead

Remove these lines and use fully qualified names (e.g., grpc::Server, grpc::Status) in the implementation files that include this header. Alternatively, move these declarations inside the #ifdef CUOPT_ENABLE_GRPC guard's associated .cpp files only.

-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-using grpc::ServerReaderWriter;
-using grpc::ServerWriter;
-using grpc::Status;
-using grpc::StatusCode;
-
-using namespace cuopt::linear_programming;
-// Note: NOT using "using namespace cuopt::remote" to avoid JobStatus enum conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 55 - 64, Remove the
header-level using declarations for grpc types (grpc::Server,
grpc::ServerBuilder, grpc::ServerContext, grpc::ServerReaderWriter,
grpc::ServerWriter, grpc::Status, grpc::StatusCode) and the using namespace
cuopt::linear_programming; to avoid leaking names (and the JobStatus enum
conflict). Replace usages in this header with fully-qualified names (e.g.,
grpc::Server, cuopt::linear_programming::Whatever) or move the using
declarations into the corresponding .cpp files (or inside the CUOPT_ENABLE_GRPC
guard in .cpp) so only implementation files import those namespaces; ensure any
references in grpc_server_types.hpp are updated to the qualified symbols.
cpp/src/grpc/server/grpc_worker_infra.cpp-194-205 (1)

194-205: ⚠️ Potential issue | 🟠 Major

Handle full result_queue when marking worker failures.

If all result slots are occupied, this path drops the terminal failure record. A waiting client can then stall even though the job already failed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 194 - 205, The
current loop that writes a terminal failure into result_queue stops if every
slot is already marked ready, which can drop the failure; modify the failure
handling logic so that if no slot with !ready is found it will (1) search for a
slot with ready && retrieved (i.e., already consumed) and reuse it, using the
same writes (copy_cstr to job_id/error_message, set status, data_size,
worker_index, retrieved=false, ready=true), and (2) if none exist, as a last
resort overwrite the oldest/first slot (e.g., index 0) to guarantee the client
gets a terminal response; update the block around result_queue/MAX_RESULTS and
reuse the existing copy_cstr calls so the behavior is deterministic.
cpp/src/grpc/server/grpc_server_main.cpp-50-60 (1)

50-60: ⚠️ Potential issue | 🟠 Major

Harden numeric CLI parsing and range validation.

Line 53/55/57/59 use stoi/stoll without exception handling or range checks. Invalid input or negative values (e.g., workers/port) can crash startup or create invalid runtime config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 60, The CLI
numeric parsing loop that assigns config.port, config.num_workers,
config.max_message_mb and config.max_message_b uses std::stoi/std::stoll without
error handling or range checks; wrap each conversion in a try/catch for
std::invalid_argument and std::out_of_range, validate values (e.g., port > 0 &&
port <= 65535, num_workers > 0, max_message_mb >= 0, max_message_b >= 4096) and
on invalid input log an error via the existing logger (or std::cerr) and
return/exit with non‑zero status or fall back to safe defaults; update the
parsing branches that set config.port, config.num_workers, config.max_message_mb
and the --max-message-bytes branch that sets config.max_message_b to use these
guarded conversions and checks.
cpp/src/grpc/server/grpc_worker_infra.cpp-120-124 (1)

120-124: ⚠️ Potential issue | 🟠 Major

Close child-end FDs on fork failure to prevent descriptor leaks.

On fork failure, only server-side pipe ends are closed. Parent-owned child ends (worker_read_fd, worker_write_fd, worker_incumbent_write_fd) remain open.

🔧 Minimal fix
   if (pid < 0) {
     std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
               << worker_id << "\n";
     close_worker_pipes_server(worker_id);
+    close_worker_pipes_child_ends(worker_id);
     return -1;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork
failure in grpc_worker_infra.cpp (the pid < 0 branch), close the parent-owned
child-end file descriptors to avoid leaks: ensure worker_read_fd,
worker_write_fd, and worker_incumbent_write_fd are closed (if valid) in addition
to calling close_worker_pipes_server(worker_id); update the pid < 0 error path
in the code that creates workers so it explicitly closes those descriptors
before returning -1.
cpp/src/grpc/client/solve_remote.cu-55-57 (1)

55-57: ⚠️ Potential issue | 🟠 Major

Validate env overrides before applying chunk/message sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are accepted as-is; zero/negative/too-small values can break chunk sizing and message-limit behavior.

✅ Minimal validation example
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_bytes  = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+
+  config.chunk_size_bytes = std::max<int64_t>(4096, chunk_size);
+  config.max_message_bytes =
+    (max_bytes <= 0) ? config.max_message_bytes : std::max<int64_t>(4096, max_bytes);
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The code currently
applies CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES directly which can allow
zero/negative/too-small values to break chunking; after calling parse_env_int64
for these into config.chunk_size_bytes and config.max_message_bytes, validate
each parsed value against sensible thresholds (e.g., >0 and >= a defined
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constant) and only assign the override if it
passes validation; if invalid, keep the existing default and emit a warning/log
mentioning the env var name and rejected value (use the same symbols
config.chunk_size_bytes, config.max_message_bytes, parse_env_int64 and add
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constants or inline thresholds).
cpp/src/grpc/client/solve_remote.cu-153-166 (1)

153-166: ⚠️ Potential issue | 🟠 Major

Don’t forward incumbent objective as the callback bound.

Line 164 sets bound_copy equal to incumbent objective. That can make gap-sensitive callback logic interpret the incumbent as already tightly bounded, which is semantically wrong when a true best bound is unavailable.

Based on learnings: objective direction is encoded by objective-sense metadata, and callback bounds should use direction-appropriate “no bound” values when the true bound is not available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 153 - 166, The incumbent
callback is incorrectly using the incumbent objective as the callback bound
(bound_copy), which can mislead gap-sensitive logic; instead, set bound_copy to
a direction-appropriate "no bound" sentinel (e.g., +infinity for minimization,
-infinity for maximization) rather than objective. In the
config.incumbent_callback lambda (referencing mip_callbacks and
get_solution_callback_t::get_solution), query the objective sense from the
callback or problem metadata (e.g., a get_objective_sense() or equivalent) and
assign bound_copy = std::numeric_limits<double>::infinity() for minimization or
-std::numeric_limits<double>::infinity() for maximization before calling
get_solution; do not forward the incumbent objective as the bound.
cpp/src/grpc/server/grpc_server_threads.cpp-253-263 (1)

253-263: ⚠️ Potential issue | 🟠 Major

Synchronize worker_pipes access across polling and worker-restart paths.

This thread reads worker_pipes (lines 254–255) while the monitor thread replaces dead workers via spawn_single_worker()close_worker_pipes_server(), which modifies the same entries without a shared lock. This is a data race: the vector can reallocate during concurrent modifications, and file descriptors can be closed while the incumbent thread polls them, causing undefined behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 253 - 263, The
polling code reads shared worker_pipes without synchronization, causing a data
race with spawn_single_worker() / close_worker_pipes_server() which mutate that
vector; fix by introducing and using a shared mutex (e.g., worker_pipes_mutex)
that both the polling thread and the worker-restart code lock when
accessing/modifying worker_pipes, and change the polling path (the loop that
builds pfds) to take the mutex, copy the needed file descriptors into a local
vector<int> (or local pfds) while holding the lock, then release the lock before
calling poll() so the poll is not done while holding the mutex; also ensure
close_worker_pipes_server() and spawn_single_worker() acquire the same mutex
when closing/removing entries to avoid closing fds being polled.
cpp/src/grpc/grpc_settings_mapper.cu-144-146 (1)

144-146: ⚠️ Potential issue | 🟠 Major

Guard iteration_limit narrowing conversion.

Line 145 casts protobuf int64 directly into i_t. Oversized values can overflow/truncate and corrupt solver limits.

Proposed fix
-  if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
-  }
+  if (pb_settings.iteration_limit() >= 0) {
+    const auto iter_limit = pb_settings.iteration_limit();
+    if (iter_limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    } else {
+      settings.iteration_limit = static_cast<i_t>(iter_limit);
+    }
+  }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The code
currently casts pb_settings.iteration_limit() directly to i_t (setting
settings.iteration_limit) which can overflow; update the assignment in the block
that checks pb_settings.iteration_limit() to first compare the int64 value
against the representable range of i_t (use std::numeric_limits<i_t>::max() and
min()/0 as appropriate), and then either clamp to that range or return/log an
error before assigning to settings.iteration_limit; reference
pb_settings.iteration_limit(), settings.iteration_limit, and the i_t type when
making this change so the narrowing conversion is guarded.
cpp/src/grpc/server/grpc_job_management.cpp-109-113 (1)

109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before data.resize(size).

Line 111 trusts wire-provided size and can allocate arbitrarily large memory.

Proposed fix
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  constexpr uint64_t kMaxIncumbentBytes = 64ULL * 1024 * 1024;
+  if (size > kMaxIncumbentBytes) {
+    std::cerr << "[Server] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads a wire-provided uint64_t size and calls data.resize(size) without
validation; before calling data.resize (in the block that uses read_from_pipe,
size and data), validate and cap size against a defined maximum payload (e.g.
kMaxPayloadSize) and reject sizes that are zero or exceed the cap by returning
false; use the same symbols from the snippet (uint64_t size, read_from_pipe(fd,
&size,...), data.resize(size)) so the check occurs immediately after reading
size and before any allocation or further read_from_pipe calls.
cpp/src/grpc/server/grpc_service_impl.cpp-699-716 (1)

699-716: ⚠️ Potential issue | 🟠 Major

StreamLogs can block indefinitely for terminal jobs without a log file.

The wait loop only exits on file existence or NOT_FOUND; a completed/failed/cancelled job with no log file never terminates the RPC.

Proposed fix
       if (waited_ms >= 2000) {
         std::string msg;
         JobStatus s = check_job_status(job_id, msg);
         if (s == JobStatus::NOT_FOUND) {
           if (config.verbose) {
             std::cout << "[gRPC] StreamLogs job not found: " << job_id << std::endl;
           }
           return Status(grpc::StatusCode::NOT_FOUND, "Job not found: " + job_id);
         }
+        if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) {
+          cuopt::remote::LogMessage done;
+          done.set_line("");
+          done.set_byte_offset(from_byte);
+          done.set_job_complete(true);
+          writer->Write(done);
+          return Status::OK;
+        }
         waited_ms = 0;
       }

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 699 - 716, The
StreamLogs loop can hang for terminal jobs that never produce a log file; modify
the loop (in grpc_service_impl::StreamLogs) to call check_job_status(job_id,
msg) when the file is still missing and, if the returned JobStatus is a terminal
state (e.g., JobStatus::COMPLETED, JobStatus::FAILED, JobStatus::CANCELLED or
other terminal enums your code defines), break out and return a gRPC error (for
example Status(grpc::StatusCode::NOT_FOUND, "Log file not available for job: " +
job_id)) or an appropriate Status; keep the existing verbose logging path
(config.verbose) and context->IsCancelled() checks, and retain the waited_ms
retry/backoff behavior but ensure terminal-state handling resets/short-circuits
the wait to avoid indefinite blocking.
cpp/src/grpc/server/grpc_service_impl.cpp-189-193 (1)

189-193: ⚠️ Potential issue | 🟠 Major

SendArrayChunk allows unbounded duplicate/overlapping payload growth.

meta.received_bytes and state.total_bytes are incremented unconditionally; repeated overlapping chunks can inflate memory usage until OOM.

Proposed guard
-    meta.received_bytes += static_cast<int64_t>(raw.size());
+    if (meta.received_bytes + static_cast<int64_t>(raw.size()) > array_bytes) {
+      return Status(StatusCode::INVALID_ARGUMENT, "Received bytes exceed declared array size");
+    }
+    meta.received_bytes += static_cast<int64_t>(raw.size());
     state.total_bytes += static_cast<int64_t>(raw.size());

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 189 - 193,
SendArrayChunk currently unconditionally increments meta.received_bytes and
state.total_bytes and pushes state.chunks, allowing duplicate/overlapping chunks
to inflate memory; modify SendArrayChunk to first validate the incoming chunk
against already-received ranges (e.g., track per-array received byte ranges in
state or use an interval set) and skip or trim duplicate/overlapping payloads
before modifying meta.received_bytes, state.total_bytes, state.chunks, and
state.total_chunks; also enforce a global per-array max size check and reject or
cap chunks that would push state.total_bytes beyond that limit to prevent
unbounded growth.
cpp/src/grpc/grpc_settings_mapper.cu-154-154 (1)

154-154: ⚠️ Potential issue | 🟠 Major

Validate protobuf presolver value before casting to presolver_t.

Line 154 and Line 221 trust unvalidated wire values; invalid integers become unsupported internal enum states.

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 221-221

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` at line 154, The code casts the
protobuf integer pb_settings.presolver() directly into presolver_t (see
assignment to settings.presolver and the similar use at the other spot), which
can produce invalid enum states for out-of-range wire values; before
static_casting to presolver_t, validate the integer is within the defined
presolver_t range (e.g., compare against the enum's min/max or use a switch/case
for known values), handle unknown values by logging an error/warning and
selecting a safe default (or returning an error), and then assign the validated
value to settings.presolver to avoid unsupported internal enum states.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-673-699 (1)

673-699: ⚠️ Potential issue | 🟠 Major

SolveInfeasibleLP can pass without actually validating infeasibility behavior.

The assertion is gated by if (result.success && result.solution), so transport or solve regressions can still produce a passing test.

Proposed fix
   auto result = client->solve_lp(problem, settings);
-  if (result.success && result.solution) {
-    EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);
-  }
+  ASSERT_TRUE(result.success) << result.error_message;
+  ASSERT_NE(result.solution, nullptr);
+  EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);

As per coding guidelines "**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 673
- 699, The test SolveInfeasibleLP currently skips assertions when the solver
failed or returned no solution; replace the guarded if-check with hard
assertions so the test fails on transport/solve regressions: assert that
result.success is true (ASSERT_TRUE(result.success)), assert that
result.solution is non-null (ASSERT_NE(result.solution, nullptr) or
ASSERT_TRUE(result.solution)), and then assert the termination status equals
pdlp_termination_status_t::Infeasible
(ASSERT_EQ(result.solution->get_termination_status(),
pdlp_termination_status_t::Infeasible)) so the test verifies infeasibility
behavior unconditionally.
cpp/src/grpc/grpc_problem_mapper.cu-175-182 (1)

175-182: ⚠️ Potential issue | 🟠 Major

Validate lower/upper bound array sizes before applying both.

Line 175 only checks constraint_lower_bounds_size() > 0; a malformed payload with mismatched upper-size still gets applied.

Proposed guard
-  if (pb_problem.constraint_lower_bounds_size() > 0) {
+  if (pb_problem.constraint_lower_bounds_size() > 0 ||
+      pb_problem.constraint_upper_bounds_size() > 0) {
+    if (pb_problem.constraint_lower_bounds_size() != pb_problem.constraint_upper_bounds_size()) {
+      return;  // or propagate an explicit mapping error
+    }
     std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(),
                             pb_problem.constraint_lower_bounds().end());

As per coding guidelines "Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 175 - 182, The code applies
constraint lower and upper bounds when only constraint_lower_bounds_size() > 0
is checked, which allows mismatched arrays; update the guard to validate both
pb_problem.constraint_lower_bounds_size() and
pb_problem.constraint_upper_bounds_size() are > 0 and equal before calling
cpu_problem.set_constraint_lower_bounds(...) and
cpu_problem.set_constraint_upper_bounds(...); if sizes differ or one is zero,
skip applying them (or log/return an error) to avoid out-of-bounds/misaligned
mappings and ensure index consistency across transformations.
cpp/src/grpc/grpc_problem_mapper.cu-454-461 (1)

454-461: ⚠️ Potential issue | 🟠 Major

Require complete CSR/Q triplets before setting matrix data.

Line 454 and Line 526 gate only on values array presence; applying matrices without matching indices/offsets can create malformed models.

Proposed guard pattern
-  if (!a_values.empty()) {
+  if (!a_values.empty() || !a_indices.empty() || !a_offsets.empty()) {
+    if (a_values.empty() || a_indices.empty() || a_offsets.empty()) { return; }
     cpu_problem.set_csr_constraint_matrix(...);
   }
-  if (!q_values.empty()) {
+  if (!q_values.empty() || !q_indices.empty() || !q_offsets.empty()) {
+    if (q_values.empty() || q_indices.empty() || q_offsets.empty()) { return; }
     cpu_problem.set_quadratic_objective_matrix(...);
   }

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 526-533

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 454 - 461, The current
guard only checks a_values before calling cpu_problem.set_csr_constraint_matrix,
which can yield malformed CSR data if a_indices or a_offsets are missing or
their sizes don't match; update the condition to require a_values, a_indices,
and a_offsets to be non-empty and verify their sizes are consistent (e.g.,
a_indices.size() matches a_values.size() and a_offsets.size() matches expected
row count + 1) before calling cpu_problem.set_csr_constraint_matrix, and apply
the same strengthened guard/validation to the other matrix-setting block
referenced around the second call (the similar call later in the file).
cpp/src/grpc/server/grpc_service_impl.cpp-823-824 (1)

823-824: ⚠️ Potential issue | 🟠 Major

Fix incumbent pagination cursor calculation.

next_index is always set to available, which skips data when max_count is smaller than remaining incumbents.

Proposed fix
-    response->set_next_index(available);
+    response->set_next_index(from_index + count);

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 823 - 824, The
pagination logic incorrectly sets response->set_next_index(available) which
skips items when max_count < remaining; change it to compute the next cursor as
the start index plus the number of items actually returned (e.g., start +
returned_count) or equivalently start + std::min(max_count, available - start),
then call response->set_next_index(...) with that value and set done when that
next index >= available; update the code around
response->set_next_index(available) and the subsequent bool done = ...
expression accordingly.
cpp/src/grpc/client/grpc_client.cu-525-531 (1)

525-531: ⚠️ Potential issue | 🟠 Major

Use compute_chunk_size() to clamp chunk payload to negotiated message-size limits.

chunk_data_budget is derived only from config_.chunk_size_bytes without enforcing server/client limits. If user config exceeds server capabilities, chunk RPC calls will fail with message-size errors. Apply the existing compute_chunk_size() helper at both occurrences (lines 525-531 and 694-698):

🔧 Proposed fix
-  int64_t chunk_data_budget = config_.chunk_size_bytes;
-  if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; }
+  int64_t chunk_data_budget =
+    compute_chunk_size(server_max_message_bytes_, config_.max_message_bytes, config_.chunk_size_bytes);

   const int64_t proto_overhead = 64;
   if (chunk_data_budget > proto_overhead) { chunk_data_budget -= proto_overhead; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 525 - 531, The code sets
chunk_data_budget directly from config_.chunk_size_bytes then subtracts
proto_overhead before calling build_array_chunk_requests, which can exceed
negotiated gRPC limits; replace the direct assignment with a call to
compute_chunk_size(config_.chunk_size_bytes) (or the appropriate
compute_chunk_size overload) to clamp to negotiated message-size limits, then
subtract proto_overhead and pass that clamped chunk_data_budget into
build_array_chunk_requests; apply the exact same change to the other place in
this file where chunk_data_budget is computed later (the second occurrence) so
both chunk payload computations use compute_chunk_size().
🟡 Minor comments (10)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-152-152 (1)

152-152: ⚠️ Potential issue | 🟡 Minor

Clarify the presolve mode and add numerical validation to the warmstart test.

At line 152, the literal 0 for CUOPT_PRESOLVE obscures intent—there is no PresolverMode enum exported to Python, so add an inline comment explaining that 0 disables presolve (required for warmstart). More importantly, the test at line 162 only checks that the solution is non-None; it should validate numerical correctness of the warmstart result by comparing it against a baseline solve or checking solution bounds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` at
line 152, Replace the magic literal by documenting intent: add an inline comment
next to settings.set_parameter(CUOPT_PRESOLVE, 0) stating that 0 disables
presolve (required for warmstart) since no PresolverMode enum is exposed to
Python. Then extend the warmstart test (the test that currently only asserts
solution is not None) to perform numerical validation: run a baseline solve
without warmstart (or use known optimal objective/variable bounds) and assert
the warmstarted solution’s objective and key variable values are within a tight
tolerance (e.g., via pytest.approx) of the baseline; reference the same settings
and solution objects used in the test to obtain values for comparison.
cpp/src/grpc/server/grpc_incumbent_proto.hpp-30-32 (1)

30-32: ⚠️ Potential issue | 🟡 Minor

Unchecked serialization return value.

SerializeToArray can fail and returns a boolean, but the return value is ignored. This could lead to returning an invalid/incomplete buffer silently.

🛡️ Proposed fix
   std::vector<uint8_t> buffer(msg.ByteSizeLong());
-  msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
+  if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) {
+    return {};  // Return empty vector on failure
+  }
   return buffer;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The code
ignores the boolean result of msg.SerializeToArray which can fail; update the
serialization block to check the return value of
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) and handle
failure (e.g., throw a std::runtime_error or return an empty/optional buffer)
rather than returning a potentially incomplete vector; keep the existing use of
msg.ByteSizeLong() and the buffer variable but ensure you validate
SerializeToArray's result and produce a clear error/alternative return before
returning buffer.
cpp/src/grpc/server/grpc_worker.cpp-248-251 (1)

248-251: ⚠️ Potential issue | 🟡 Minor

Missing CUDA error checking on cudaMemcpy.

Per coding guidelines, CUDA errors should be caught and mapped to meaningful error codes. These cudaMemcpy calls could fail (e.g., device out of memory) but errors are not checked.

🛡️ Proposed fix pattern
-        cudaMemcpy(host_solution.data(),
-                   device_solution.data(),
-                   device_solution.size() * sizeof(double),
-                   cudaMemcpyDeviceToHost);
+        cudaError_t err = cudaMemcpy(host_solution.data(),
+                                     device_solution.data(),
+                                     device_solution.size() * sizeof(double),
+                                     cudaMemcpyDeviceToHost);
+        if (err != cudaSuccess) {
+          throw std::runtime_error(std::string("cudaMemcpy failed: ") + cudaGetErrorString(err));
+        }

Apply the same pattern to the other cudaMemcpy calls on lines 296-307.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 251, The cudaMemcpy
calls (e.g., the call copying device_solution -> host_solution) lack CUDA error
checks; wrap each cudaMemcpy (including the other calls around lines 296-307) to
capture the returned cudaError_t, check if it is not cudaSuccess, and convert it
to a meaningful error/Status (use/create a helper like mapCudaErrorToStatus or
mapCudaErrorToGrpcStatus) and return/log that error instead of proceeding;
update the cudaMemcpy calls that reference device_solution, host_solution, and
any other device/host buffers to use this pattern so failures (OOM, invalid
value, etc.) are detected and mapped to the existing error-handling flow.
cpp/src/grpc/client/test_grpc_client.cpp-401-403 (1)

401-403: ⚠️ Potential issue | 🟡 Minor

Unhandled exception from std::stod.

If the user provides an invalid --time-limit value (e.g., non-numeric), std::stod throws std::invalid_argument or std::out_of_range, resulting in an unhandled exception with a cryptic error message.

🛡️ Proposed fix
     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      try {
+        time_limit = std::stod(argv[++arg_idx]);
+      } catch (const std::exception& e) {
+        std::cerr << "ERROR: Invalid time limit value\n";
+        return 1;
+      }
       arg_idx++;

The same issue exists for std::stoll on line 531.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the
std::stod call that parses --time-limit (and the std::stoll call at the other
location) in a try/catch that catches std::invalid_argument and
std::out_of_range, then log a clear error via the existing logging/exit path
and/or set a safe default; specifically, around the parsing of argv in the
branch handling "--time-limit" (variable time_limit and std::stod) catch parsing
exceptions, call the same error-handling used elsewhere (print usage or error
and exit with non-zero) and do the same for the std::stoll parse to avoid
uncaught exceptions and cryptic crashes.
cpp/src/grpc/cuopt_remote_service.proto-314-318 (1)

314-318: ⚠️ Potential issue | 🟡 Minor

Inconsistent job_id type: bytes here vs string elsewhere.

IncumbentRequest.job_id is defined as bytes (line 315), but all other request messages (e.g., StatusRequest, GetResultRequest, CancelRequest, DeleteRequest) use string for job_id. This inconsistency could cause confusion and require different handling on client/server.

Suggested fix for consistency
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;        // Return incumbents starting from this index
   int32 max_count = 3;         // Optional limit (0 or negative => no limit)
 }

Similarly, Incumbent.job_id at line 324 should also be string for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 318,
IncumbentRequest.job_id (and matching Incumbent.job_id) are declared as bytes
while all other request messages use string; change the field type from bytes to
string in the proto for IncumbentRequest and Incumbent so they match
StatusRequest/GetResultRequest/CancelRequest/DeleteRequest, then
recompile/regenerate the gRPC/protobuf stubs (look for the IncumbentRequest and
Incumbent message definitions) to ensure client/server code uses the consistent
string type.
cpp/src/grpc/server/grpc_server_types.hpp-263-265 (1)

263-265: ⚠️ Potential issue | 🟡 Minor

Hardcoded shared memory names may cause conflicts with concurrent server instances.

The shared memory segment names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are hardcoded constants. If multiple server instances run on the same host, they will conflict.

Consider incorporating the port number or a unique identifier into the segment names.

Suggested approach
// Instead of inline constants, generate names dynamically:
inline std::string get_shm_job_queue_name(int port) {
  return "/cuopt_job_queue_" + std::to_string(port);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 263 - 265, The three
hardcoded shared-memory name constants (SHM_JOB_QUEUE, SHM_RESULT_QUEUE,
SHM_CONTROL) will collide across concurrent server instances; change them from
inline const char* values to functions that generate unique names at runtime
(e.g., append the server port, PID, or UUID) and update call sites to use the
generator functions (e.g., get_shm_job_queue_name(port),
get_shm_result_queue_name(port), get_shm_control_name(port)) so each server
instance uses distinct segment names.
GRPC_ARCHITECTURE.md-298-304 (1)

298-304: ⚠️ Potential issue | 🟡 Minor

Document required remote endpoint env vars in the Environment Variables section.

This table currently lists only transfer tuning variables. Please also include CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT, since remote execution depends on them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 298 - 304, The Environment Variables table
is missing the remote endpoint variables: add two rows for CUOPT_REMOTE_HOST and
CUOPT_REMOTE_PORT in the "Environment Variables" section, specifying sensible
defaults (e.g., blank or "localhost" for CUOPT_REMOTE_HOST and a default port
like 50051 for CUOPT_REMOTE_PORT) and clear descriptions such as "Remote
execution host" and "Remote execution port" so remote execution is documented
alongside CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES.
GRPC_ARCHITECTURE.md-16-39 (1)

16-39: ⚠️ Potential issue | 🟡 Minor

Specify languages on fenced code blocks to satisfy markdown lint.

These blocks are missing a language identifier (MD040), which is currently failing lint hygiene.

📝 Example fix pattern
-```
+```text
 cpp/src/grpc/
 ...
-```
+```

Also applies to: 105-120, 136-151, 359-371

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 16 - 39, The markdown fenced code blocks
in GRPC_ARCHITECTURE.md are missing language identifiers (MD040); update each
triple-backtick block (including the shown cpp/src/grpc/ tree block and the
other blocks at ranges 105-120, 136-151, 359-371) to include a language token
such as "text" (i.e., change ``` to ```text) so linting passes; ensure all
similar block openings (```...) in the file are consistently annotated.
cpp/src/grpc/grpc_solution_mapper.cu-559-560 (1)

559-560: ⚠️ Potential issue | 🟡 Minor

Warm-start presence check is too narrow.

Warm-start reconstruction only triggers when RESULT_WS_CURRENT_PRIMAL is non-empty; other warm-start fields can be present and still be dropped.

Suggested adjustment
-  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
-  if (!ws_primal.empty()) {
+  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
+  auto ws_dual   = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_DUAL);
+  auto ws_init_p = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_PRIMAL_AVG);
+  auto ws_init_d = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_DUAL_AVG);
+  if (!ws_primal.empty() || !ws_dual.empty() || !ws_init_p.empty() || !ws_init_d.empty()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 560, The code only
triggers warm-start reconstruction when ws_primal (from
bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL)) is
non-empty, which misses cases where other warm-start fields exist; change the
presence check to detect any warm-start data key in arrays (e.g.,
cuopt::remote::RESULT_WS_CURRENT_PRIMAL, RESULT_WS_CURRENT_DUAL,
RESULT_WS_PREV_PRIMAL, RESULT_WS_PREV_DUAL or other RESULT_WS_* entries) before
skipping reconstruction. Implement a small helper or boolean that queries arrays
for any of those RESULT_WS_* keys (or calls bytes_to_typed for each and checks
non-empty) and use that combined condition instead of only ws_primal.empty() so
reconstruction runs when any warm-start field is present.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-53-57 (1)

53-57: ⚠️ Potential issue | 🟡 Minor

Add <fcntl.h> include for open() and flag constants.

Line 120 uses open() with O_WRONLY|O_CREAT|O_TRUNC, but <fcntl.h> is not explicitly included. While some systems may include it transitively through unistd.h, this is not guaranteed by the POSIX standard and creates a portability risk.

File handle cleanup is correct; fd is properly closed at line 125.

Proposed fix
 `#include` <signal.h>
+#include <fcntl.h>
 `#include` <sys/types.h>
 `#include` <sys/wait.h>
 `#include` <unistd.h>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53
- 57, The test uses open() with flags O_WRONLY|O_CREAT|O_TRUNC (call site:
open(..., O_WRONLY|O_CREAT|O_TRUNC)) but does not include <fcntl.h>, which can
break portability; add `#include` <fcntl.h> to the top includes so the O_* flag
constants and open declaration are defined and the code compiles reliably across
platforms.
🧹 Nitpick comments (10)
cpp/src/grpc/server/grpc_worker.cpp (1)

149-161: Duplicate error handling blocks.

Lines 149-151 and 153-161 both handle !read_success. The first block logs to stderr, while the second handles the failure logic. These could be consolidated.

♻️ Proposed consolidation
-    if (!read_success) {
-      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
-    }
-
     if (!read_success) {
+      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
       store_simple_result(job_id, worker_id, 1, "Failed to read job data");
       job.worker_pid   = 0;
       // ... rest of cleanup
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 149 - 161, The duplicate
handling for !read_success should be consolidated into a single block: remove
the first standalone stderr log and merge its message into the existing failure
branch so that when read_success is false you both log the error (including
worker_id) and execute the failure logic (call store_simple_result(job_id,
worker_id, 1, "..."), reset job fields job.worker_pid, job.worker_index,
job.data_sent, job.ready, job.claimed, and continue). Update the block around
the read_success check in grpc_worker.cpp to perform logging and the cleanup
exactly once to avoid duplicate branches.
cpp/src/grpc/server/grpc_incumbent_proto.hpp (1)

44-49: Consider clearing the output vector before populating.

The assignment vector is passed by reference but not cleared before push_back calls. If a caller reuses the same vector across multiple calls, stale data will accumulate.

♻️ Proposed fix
   job_id    = incumbent_msg.job_id();
   objective = incumbent_msg.objective();
+  assignment.clear();
   assignment.reserve(incumbent_msg.assignment_size());
   for (int i = 0; i < incumbent_msg.assignment_size(); ++i) {
     assignment.push_back(incumbent_msg.assignment(i));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 44 - 49, The
assignment vector is populated via push_back without being cleared, so reuse
will accumulate stale entries; before calling assignment.reserve(...) and the
for-loop that pushes from incumbent_msg.assignment(i), call assignment.clear()
(or use assignment.assign(...) from the repeated field) to ensure the vector is
emptied first—update the code around the symbols assignment and incumbent_msg in
grpc_incumbent_proto.hpp to clear assignment before reserving/pushing.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

93-100: Verify semantic difference between result and request deserialization.

For result blobs (line 94), total_elements is used directly as byte count for resize. For request blobs (lines 151-153), total_elements * elem_size is used. This asymmetry suggests total_elements has different semantics in result vs request contexts - consider adding a clarifying comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 100, The
resize logic for result deserialization uses ac.total_elements() directly as
bytes while request deserialization multiplies total_elements by element size,
so confirm whether ac.total_elements() represents bytes or element count and
make semantics consistent: either change the result-path allocation (the
dest.resize call) to use total_bytes = ac.total_elements() * ac.elem_size() (or
equivalent accessor) before resizing and memcpy, or add a clear comment by the
result-path block documenting that total_elements already includes element size
in bytes; update any related uses (element_offset(), chunk_data.size(), memcpy)
to match the chosen unit to avoid off-by-factor bugs.
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp (1)

16-20: Consider reordering includes for consistency.

Standard convention places system/library headers before local headers. The <memory> include should come before the local "grpc_client.hpp" include.

Suggested reorder
 `#include` <cuopt_remote_service.grpc.pb.h>
 `#include` <grpcpp/grpcpp.h>
-#include "grpc_client.hpp"
 
 `#include` <memory>
+
+#include "grpc_client.hpp"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp` around lines
16 - 20, Reorder the includes so system/library headers come before local
headers: move the <memory> include above the "grpc_client.hpp" include in
grpc_client_test_helper.hpp; ensure the block now lists
<cuopt_remote_service.grpc.pb.h>, <grpcpp/grpcpp.h>, <memory> and then
"grpc_client.hpp" to follow the standard include ordering convention.
cpp/tests/linear_programming/grpc/CMakeLists.txt (1)

24-33: Minor: Redundant GTest library linkage.

GTest::gmock_main already includes GTest::gmock, and GTest::gtest_main includes GTest::gtest. You can simplify to just the _main variants.

Simplified linkage
 target_link_libraries(GRPC_CLIENT_TEST
     PRIVATE
     cuopt
-    GTest::gmock
     GTest::gmock_main
-    GTest::gtest
     GTest::gtest_main
     gRPC::grpc++
     protobuf::libprotobuf
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 24 - 33,
Remove redundant GTest link entries in the target_link_libraries call for
GRPC_CLIENT_TEST: keep only the main variants (GTest::gmock_main and
GTest::gtest_main) and remove GTest::gmock and GTest::gtest to avoid duplicate
linkage; update the target_link_libraries block that references
GRPC_CLIENT_TEST, cuopt, gRPC::grpc++, and protobuf::libprotobuf accordingly so
only the necessary _main symbols remain.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (2)

359-360: Remove unused server_logs_ member.

The server_logs_ vector is declared but never populated or used. Server logs are read directly from the file system via get_server_logs(). This dead code should be removed to avoid confusion.

Remove unused member
  private:
   mutable std::mutex mutex_;
   std::vector<LogEntry> client_logs_;
-  std::vector<LogEntry> server_logs_;
   std::string server_log_path_;
   std::streampos server_log_start_pos_ = 0;  // Position in server log file when test started
   bool test_start_marked_              = false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 359
- 360, Remove the dead member server_logs_ from the class: delete the
declaration of std::vector<LogEntry> server_logs_ (it is never populated or
used; server logs are accessed via get_server_logs()). After removing the
member, run a quick search for server_logs_ to ensure there are no remaining
references and update any tests or constructors that may have initialized it.

123-126: Callback captures this pointer — document lifetime requirement.

The lambda returned by client_callback() captures this by pointer. If the callback is invoked after the GrpcTestLogCapture instance is destroyed, this will cause undefined behavior. Consider adding a brief note in the documentation that the GrpcTestLogCapture instance must outlive any client using this callback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 123
- 126, The lambda returned by GrpcTestLogCapture::client_callback() captures
this by pointer and can dangle if the GrpcTestLogCapture instance is destroyed,
so add a clear lifetime note: update the documentation/comment for the
GrpcTestLogCapture class and/or the client_callback() method to state that the
GrpcTestLogCapture instance must outlive any client that retains or invokes the
returned callback; reference the client_callback() function name in the comment
so callers know the requirement and consider mentioning alternative designs
(e.g., providing a weak_ptr-based factory) if you want to avoid the strict
lifetime requirement.
SERVER_ARCHITECTURE.md (1)

15-46: Add language specifier to fenced code block.

Per markdownlint, fenced code blocks should have a language specified. For ASCII diagrams, use text or plaintext.

Fix markdown lint warning
-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The fenced ASCII diagram block
in SERVER_ARCHITECTURE.md is missing a language specifier; change the opening
fence from ``` to ```text (or ```plaintext) so the block becomes ```text and
satisfies markdownlint for the diagram that starts with the
"┌────────────────────────────────────────────────────────────────────┐" line.
cpp/src/grpc/client/solve_remote.cu (1)

69-74: Mark currently-unused LP flags explicitly (or forward them).

problem_checking and use_pdlp_solver_mode are currently unused. Marking them explicitly avoids silent interface drift and warning churn.

Small cleanup option
 std::unique_ptr<lp_solution_interface_t<i_t, f_t>> solve_lp_remote(
   cpu_optimization_problem_t<i_t, f_t> const& cpu_problem,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
-  bool problem_checking,
-  bool use_pdlp_solver_mode)
+  [[maybe_unused]] bool problem_checking,
+  [[maybe_unused]] bool use_pdlp_solver_mode)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 69 - 74, The function
solve_lp_remote currently accepts parameters problem_checking and
use_pdlp_solver_mode but never uses them, which can cause warnings and interface
drift; update the function (solve_lp_remote) to explicitly mark these parameters
as intentionally unused (e.g., cast to void or annotate with [[maybe_unused]])
or forward them to the underlying call that needs them, ensuring the symbols
problem_checking and use_pdlp_solver_mode appear in the implementation so the
intent is clear and compiler warnings are silenced.
cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)

1086-1340: Add at least one degenerate solve-case assertion in this suite.

The end-to-end mock solve tests are strong, but adding explicit infeasible/unbounded/empty (or singleton edge) solve assertions would better protect status/result mapping regressions.

As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1086 -
1340, Add a new degenerate-case unit test (e.g., TEST_F(GrpcClientTest,
SolveLP_Infeasible)) that uses the existing GrpcClientTest fixture and
MockCuOptStub to exercise grpc_client_t::solve_lp for an infeasible LP: have
SubmitJob(...) return a job_id, have CheckStatus/WaitForCompletion indicate
COMPLETED (or skip if use_wait), and have GetResult(...) return a
cuopt::remote::LPSolution with termination_status set to PDLP_INFEASIBLE (and
resp->set_status(cuopt::remote::SUCCESS) if appropriate); then assert the
returned result indicates failure (result.success == false) and that
result.error_message contains a clear indicator of infeasibility (e.g.,
"INFEASIBLE" or "infeasible"); reuse symbols SubmitJob,
CheckStatus/WaitForCompletion, GetResult, grpc_client_t::solve_lp, and
GrpcClientTest to locate where to add the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6984e3cb-4874-432a-8307-c8f3a64fd9ba

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6f86b and 6cafe37.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/routing/utils.py
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu

Comment on lines +258 to +266
job_status_result_t grpc_client_t::check_status(const std::string& job_id)
{
job_status_result_t result;

grpc::ClientContext context;
auto request = build_status_request(job_id);
cuopt::remote::StatusResponse response;
auto status = impl_->stub->CheckStatus(&context, request, &response);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard manual RPC APIs against null stub dereference.

These public methods call impl_->stub->... directly (e.g., Line 265). If called before a successful connect(), this can crash.

🔧 Proposed guard pattern
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;

Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC
methods (e.g., grpc_client_t::check_status) currently dereference impl_->stub
directly and can crash if called before connect(); add a guard at the start of
each affected method to check that impl_ and impl_->stub are non-null (and
optionally that impl_->connected flag is true) and return an appropriate failure
job_status_result_t (or set an error state) instead of proceeding; update all
other methods listed (those around lines 293-303, 325-333, 354-361, 381-395,
419-430) to use the same pattern so no method ever calls impl_->stub->...
without the null check.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check the file exists and get basic info
if [ -f cpp/src/grpc/client/grpc_client.cu ]; then
  echo "=== File size and structure ==="
  wc -l cpp/src/grpc/client/grpc_client.cu
  echo ""
  echo "=== Code around line 258-266 (check_status function) ==="
  sed -n '250,275p' cpp/src/grpc/client/grpc_client.cu
else
  echo "File not found at expected path. Searching for grpc_client files..."
  find . -name "*grpc_client*" -type f 2>/dev/null
fi

Repository: NVIDIA/cuopt

Length of output: 1047


🏁 Script executed:

#!/bin/bash
# Search for all RPC call patterns and deadline usage in the codebase
echo "=== Searching for set_deadline usage ==="
rg "set_deadline" --type cpp --type cuda -B 2 -A 2

echo ""
echo "=== Searching for ClientContext usage ==="
rg "ClientContext" cpp/src/grpc/client/ --type cpp --type cuda -B 1 -A 3 | head -100

Repository: NVIDIA/cuopt

Length of output: 214


🏁 Script executed:

#!/bin/bash
# Check for config_ member and timeout configuration
echo "=== Searching for config_.timeout_seconds ==="
rg "timeout_seconds|config_\.timeout" cpp/src/grpc/client/ --type cpp --type cuda -B 2 -A 2

echo ""
echo "=== Check class definition and constructor ==="
rg "class grpc_client|grpc_client_t\s*\(" cpp/src/grpc/client/ --type cpp --type cuda -B 2 -A 5

Repository: NVIDIA/cuopt

Length of output: 224


🏁 Script executed:

#!/bin/bash
echo "=== Searching for set_deadline in grpc_client files ==="
rg "set_deadline" cpp/src/grpc/client/ -B 2 -A 2

echo ""
echo "=== Searching for timeout_seconds ==="
rg "timeout_seconds" cpp/src/grpc/client/ -B 2 -A 2

echo ""
echo "=== Searching for config_ initialization and class definition ==="
rg "class grpc_client_t|config_\s*=" cpp/src/grpc/client/ -B 1 -A 3

Repository: NVIDIA/cuopt

Length of output: 3404


🏁 Script executed:

#!/bin/bash
echo "=== Examine lines 293-303 (second location) ==="
sed -n '290,305p' cpp/src/grpc/client/grpc_client.cu

echo ""
echo "=== Examine lines 329-333 ==="
sed -n '325,340p' cpp/src/grpc/client/grpc_client.cu

echo ""
echo "=== Examine lines 356-360 ==="
sed -n '352,365p' cpp/src/grpc/client/grpc_client.cu

Repository: NVIDIA/cuopt

Length of output: 1533


🏁 Script executed:

#!/bin/bash
echo "=== Search for all ClientContext creations ==="
rg "grpc::ClientContext" cpp/src/grpc/client/

echo ""
echo "=== Check for any probe/connection code that might have deadlines ==="
rg -i "probe|connection|deadline" cpp/src/grpc/client/ -B 1 -A 1

Repository: NVIDIA/cuopt

Length of output: 3593


🏁 Script executed:

#!/bin/bash
echo "=== Get grpc_client class definition and members ==="
sed -n '1,100p' cpp/src/grpc/client/grpc_client.cu | head -80

echo ""
echo "=== Search for config_ member definition in header files ==="
find cpp/src/grpc/client -name "*.h*" -o -name "*.hpp" | xargs grep -l "grpc_client\|config_" 2>/dev/null

Repository: NVIDIA/cuopt

Length of output: 2659


🏁 Script executed:

#!/bin/bash
echo "=== Check all remaining affected line ranges ==="
for range in "387-395" "429-430" "468-470" "509-510" "540-541" "625-626" "674-675" "724-725"; do
  start=$(echo $range | cut -d'-' -f1)
  end=$(echo $range | cut -d'-' -f2)
  echo "Lines $range:"
  sed -n "${start},$((end+3))p" cpp/src/grpc/client/grpc_client.cu | grep -A 10 "grpc::ClientContext"
  echo ""
done

Repository: NVIDIA/cuopt

Length of output: 862


🏁 Script executed:

#!/bin/bash
echo "=== Verify that timeout_seconds is accessible in all RPC methods ==="
echo "Checking class context and member variables..."
sed -n '1,200p' cpp/src/grpc/client/grpc_client.hpp | grep -A 20 "^class grpc_client_t"

Repository: NVIDIA/cuopt

Length of output: 171


🏁 Script executed:

#!/bin/bash
echo "=== Verify config_ field and method signatures ==="
rg "grpc_client_config_t|config_\s*grpc_client_config_t|private:|protected:" cpp/src/grpc/client/grpc_client.hpp -A 5 -B 2 | head -80

Repository: NVIDIA/cuopt

Length of output: 1246


Add deadlines to all operational RPC contexts to prevent indefinite blocking on network stalls.

Most RPC calls lack deadlines and can block indefinitely, bypassing configured timeout behavior. The connection probe correctly demonstrates using set_deadline(), but this pattern must be applied to all operational RPCs (CheckStatus, WaitForCompletion, CancelJob, DeleteJob, GetIncumbents, SubmitJob, etc.). Use the existing config_.timeout_seconds field (default 1 hour, configurable) to set deadlines consistently across all RPC methods.

🔧 Proposed fix (apply deadline per RPC)
 grpc::ClientContext context;
+if (config_.timeout_seconds > 0) {
+  context.set_deadline(std::chrono::system_clock::now() +
+                       std::chrono::seconds(config_.timeout_seconds));
+}
 auto status = impl_->stub->CheckStatus(&context, request, &response);

Also applies to: 293-303 (WaitForCompletion), 329-333 (CancelJob), 356-360 (DeleteJob), 387-395 (GetIncumbents), 429-430, 468-470 (SubmitJob), 509-510, 540-541, 625-626, 674-675, 724-725

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, RPC calls like
grpc_client_t::check_status (and other operational RPCs: WaitForCompletion,
CancelJob, DeleteJob, GetIncumbents, SubmitJob, etc.) currently create a
grpc::ClientContext but do not set a deadline, allowing indefinite blocking; fix
each RPC by calling context.set_deadline(std::chrono::system_clock::now() +
std::chrono::seconds(config_.timeout_seconds)) on the grpc::ClientContext before
invoking stub->... so every call uses the configured timeout
(config_.timeout_seconds) consistently, mirroring the connection probe pattern.

Comment on lines +703 to +712
auto field_id = arr_desc.field_id();
int64_t total_elems = arr_desc.total_elements();
int64_t elem_size = arr_desc.element_size_bytes();
if (total_elems <= 0) continue;

int64_t elems_per_chunk = chunk_data_budget / elem_size;
if (elems_per_chunk <= 0) elems_per_chunk = 1;

std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Harden chunked-result metadata validation before math/alloc/copy.

element_size_bytes and total_elements are used unchecked. elem_size <= 0 causes division by zero (Line 708), and total_elems * elem_size can overflow before allocation, making the memcpy at Line 739 unsafe.

🔧 Proposed bounds checks
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    const int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

Also applies to: 734-740

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and
sanity-check arr_desc.element_size_bytes() and arr_desc.total_elements() before
any math, allocation, or memcpy: ensure elem_size > 0 (avoid division-by-zero
when computing elems_per_chunk), ensure total_elems >= 0 and that total_elems *
elem_size does not overflow size_t (use a safe check like total_elems <=
SIZE_MAX / elem_size or similar), and recompute elems_per_chunk using the
validated elem_size and chunk_data_budget; if any check fails, skip the array or
return an error. Specifically update the logic around element_size_bytes(),
total_elements, elems_per_chunk, the allocation of array_bytes, and the
subsequent memcpy to rely only on validated, non-overflowing sizes.

Comment on lines +994 to +995
int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before dividing.

Line 994 and Line 1132 divide by config_.poll_interval_ms without checking <= 0. This can crash (division by zero) or produce invalid polling behavior.

🔧 Proposed validation
+if (config_.poll_interval_ms <= 0) {
+  stop_log_streaming();
+  result.error_message = "poll_interval_ms must be > 0";
+  return result;
+}
 int poll_count = 0;
 int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

Also applies to: 1132-1132

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, Validate
config_.poll_interval_ms before using it in divisions that compute max_polls:
check if config_.poll_interval_ms <= 0 and handle it (log an error and
return/throw, or clamp to a safe default like 1 ms) prior to the line that
computes int max_polls = (config_.timeout_seconds * 1000) /
config_.poll_interval_ms; do the same validation wherever you compute max_polls
(the other division using config_.poll_interval_ms), and make sure to reference
config_.timeout_seconds, config_.poll_interval_ms, and the max_polls calculation
so the check is applied in both places.

Comment on lines +399 to +411
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
auto it = arrays.find(field_id);
if (it == arrays.end() || it->second.empty()) return {};
size_t n = it->second.size() / sizeof(double);
if constexpr (std::is_same_v<f_t, double>) {
std::vector<double> v(n);
std::memcpy(v.data(), it->second.data(), it->second.size());
return v;
} else {
std::vector<double> tmp(n);
std::memcpy(tmp.data(), it->second.data(), it->second.size());
return std::vector<f_t>(tmp.begin(), tmp.end());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add alignment checks before memcpy in chunked array decoding.

n is computed with integer division, then memcpy copies full raw.size(). Misaligned payload sizes can write past vector bounds.

Proposed fix
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."

Also applies to: 414-426

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The get_doubles
lambda copies bytes from arrays (lookup via arrays.find(field_id) / it->second)
into a vector using memcpy but computes n = it->second.size() / sizeof(double)
without validating that it->second.size() is a multiple of sizeof(double), which
can overflow the destination; before any memcpy in get_doubles (and the
analogous block handling f_t != double), check that it->second.size() is nonzero
and that it->second.size() % sizeof(double) == 0, compute n = it->second.size()
/ sizeof(double) from the validated byte size, and use that n to allocate the
target vector; if the size is misaligned, handle gracefully (return empty vector
or surface an error) instead of performing memcpy to avoid out-of-bounds writes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

♻️ Duplicate comments (9)
cpp/src/grpc/server/grpc_server_main.cpp (2)

186-212: ⚠️ Potential issue | 🔴 Critical

Add teardown on TLS config failures after workers/threads are started.

These early returns occur after runtime startup and bypass shutdown/join/cleanup paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 186 - 212, The TLS
validation branches that currently do "return 1" (around checks using config,
ssl_opts and read_file_to_string for cert/key/root and require_client) can occur
after workers/threads have been started; replace those early returns with calls
to the centralized shutdown/teardown routine (e.g., call your existing server
stop/cleanup functions such as StopServer(), shutdownWorkers()/joinThreads(), or
a new cleanup_resources() that performs shutdown and joins) and then return the
error code; ensure every failure path that currently returns from inside the TLS
setup invokes that teardown routine before exiting so workers/threads and other
resources are properly cleaned up.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before waiting on the server pointer.

server->Wait() is called unconditionally; if startup fails and returns null, this dereferences a null pointer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after std::unique_ptr<Server> server(builder.BuildAndStart()), check if (server)
before calling server->Wait() and starting the shutdown logic that assumes a
live server. If BuildAndStart() fails, log an error (or print to cerr) and
exit/return non-zero. Also ensure the shutdown_thread's capture/Shutdown call
remains safe by only invoking server->Shutdown() when server is non-null; move
creation of shutdown_thread and the server->Wait() call inside the if (server)
block and handle the failure path explicitly.
cpp/src/grpc/server/grpc_job_management.cpp (1)

147-158: ⚠️ Potential issue | 🔴 Critical

Synchronize all job_queue access; current slot claim/update path is racy.

submit_job_async still does unsynchronized check-then-write on shared queue state, and check_job_status/cancel_job concurrently read/write the same fields. This can corrupt queue ownership and status transitions under concurrent requests.

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 195-196, 267-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The
submit_job_async path performs an unsynchronized check-then-write on shared
job_queue entries (e.g., checking job_queue[i].ready/claimed then writing fields
like job_id, ready, claimed), which races with check_job_status and cancel_job;
protect all accesses to job_queue (both scans and per-entry reads/writes) by
introducing a synchronization primitive (e.g., a std::mutex or per-entry mutex
array) and acquire the lock(s) around the loop that scans MAX_JOBS and before
modifying fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.
cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Guard public RPC methods before dereferencing impl_->stub.

These methods can be called before connect(). Dereferencing impl_->stub without a null check can crash.

🔧 Suggested guard pattern
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC
method grpc_client_t::check_status dereferences impl_->stub without ensuring
impl_ is initialized or stub is non-null (can be called before connect()), which
can crash; add a null-check guard at the start of check_status (and other public
RPC methods) that verifies impl_ and impl_->stub are valid, and return an
appropriate job_status_result_t error/failed state (or set result.error and
result.code) when the client is not connected instead of dereferencing; refer to
grpc_client_t::connect and impl_->stub when adding the guard so the check
mirrors connect()'s initialization contract.

994-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before using it as a divisor.

max_polls divides by config_.poll_interval_ms without checking <= 0, which can crash.

🔧 Suggested validation
+  if (config_.poll_interval_ms <= 0) {
+    stop_log_streaming();
+    result.error_message = "poll_interval_ms must be > 0";
+    return result;
+  }
   int poll_count = 0;
   int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

Also applies to: 1132-1132

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; update
the logic that sets max_polls (the line initializing int max_polls) to first
validate config_.poll_interval_ms is > 0 (and also guard against extremely
large/small values), returning an error or using a safe default/policy (e.g.,
set poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).

703-712: ⚠️ Potential issue | 🔴 Critical

Validate chunk metadata and chunk bounds before allocation/copy.

elem_size, total_elems, and elements_in_chunk are trusted too early. This can lead to division by zero, overflowed allocation sizes, or out-of-bounds memcpy.

🔧 Suggested hardening
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element size";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: byte-size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

     for (int64_t elem_offset = 0; elem_offset < total_elems; elem_offset += elems_per_chunk) {
       int64_t elems_wanted = std::min(elems_per_chunk, total_elems - elem_offset);
@@
       int64_t elems_received = chunk_resp.elements_in_chunk();
       const auto& data       = chunk_resp.data();

       if (static_cast<int64_t>(data.size()) != elems_received * elem_size) {
         last_error_ = "GetResultChunk: data size mismatch";
         return false;
       }
+      if (elems_received < 0 || elem_offset + elems_received > total_elems) {
+        last_error_ = "GetResultChunk: elements_in_chunk out of bounds";
+        return false;
+      }

       std::memcpy(array_bytes.data() + elem_offset * elem_size, data.data(), data.size());

Also applies to: 731-740

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and
guard all chunk metadata before allocating or copying: ensure
arr_desc.element_size_bytes() (elem_size) > 0 and arr_desc.total_elements()
(total_elems) >= 0 before using them; ensure chunk_data_budget is >0 before
computing elems_per_chunk and handle division safely; check for
size_t/multiplication overflow when computing total_elems * elem_size before
constructing array_bytes and use a safe cap; when iterating chunks validate each
elements_in_chunk is within [0, total_elems] and compute copy_size =
elements_in_chunk * elem_size with overflow checks before any memcpy; add clear
error handling/logging and skip/abort the chunk on invalid metadata rather than
performing the allocation or memcpy (refer to symbols: arr_desc, field_id,
elem_size, total_elems, chunk_data_budget, elems_per_chunk, elements_in_chunk,
array_bytes, memcpy).

247-251: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang waiting on a blocked reader thread.

Setting stop_logs_ does not unblock a thread already stuck in stream read, so join() can block indefinitely.

🔧 Suggested fix
 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 247 - 251,
stop_log_streaming currently only sets stop_logs_ and then joins log_thread_,
which can hang if the thread is blocked in a gRPC read; change the shutdown
sequence to (1) set stop_logs_, (2) explicitly cancel/close the gRPC stream used
by the reader (e.g. call the grpc::ClientContext::TryCancel() or cancel/Finish
the ClientReader/ReaderInterface instance used for log streaming), (3) ensure
any CompletionQueue or async reader is shutdown/drained so the blocked read
returns, and only then join log_thread_ (use a timed join as a fallback and log
an error if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-411: ⚠️ Potential issue | 🔴 Critical

Validate byte alignment before memcpy in chunked array decoders.

n is computed with integer division, but memcpy copies full byte count. Misaligned payload sizes can write past allocated vectors.

🔧 Suggested fix
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
@@
   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);

Also applies to: 414-426

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The lambda
get_doubles uses n = it->second.size() / sizeof(double) but then memcpy's the
full byte size, which can overflow if the payload isn't aligned; validate that
it->second.size() is an exact multiple of sizeof(double) before copying, compute
expected_bytes = n * sizeof(double) and only memcpy expected_bytes (or
return/throw on misaligned size), and when converting to f_t ensure you only
read expected_bytes into the temporary vector; apply the same check/fix to the
other chunked-array decoder blocks that perform memcpy (the other double-to-f_t
conversion in this file).
cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to avoid queue slot races.

Current store order can allow another submitter to reserve the same slot before readiness is published.

🔧 Suggested ordering fix
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 265-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently does job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.
🧹 Nitpick comments (5)
cpp/src/grpc/client/test_grpc_client.cpp (1)

401-403: Consider graceful handling for invalid numeric CLI input instead of process abort.

std::stod/std::stoll at lines 402 and 531 throw on malformed input (e.g., --time-limit abc or invalid incumbent index). While this is test code, a try-catch wrapper would provide a friendlier CLI experience instead of silently crashing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Replace
direct calls to std::stod and std::stoll in the CLI parsing logic with
exception-safe parsing: wrap the std::stod(std::string) used to set time_limit
and the std::stoll used to set incumbent_index in try-catch blocks that catch
std::invalid_argument and std::out_of_range, print a clear error message
indicating the offending flag and value (e.g., "--time-limit" or
"--incumbent-index"), and exit with a non-zero status instead of letting the
process throw; update the parsing branches that reference time_limit and
incumbent_index so they validate the parsed numeric ranges if applicable and
fall back or exit cleanly on parse failure.
cpp/src/grpc/cuopt_remote_service.proto (1)

9-9: Standardize job_id field type across RPC schema.

Most request messages use string job_id (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.), but IncumbentRequest.job_id and Incumbent.job_id are defined as bytes. Similarly, cuopt_remote.proto uses bytes job_id in SubmitResponse. This type inconsistency can cause friction in client libraries expecting uniform scalar types across the API contract. Either standardize on string for consistency with the majority of the schema, or use bytes uniformly if avoiding UTF-8 validation is a requirement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, IncumbentRequest.job_id,
Incumbent.job_id and SubmitResponse.job_id are defined as bytes while most RPC
messages (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest,
WaitRequest, StreamLogsRequest, etc.) use string; change those fields to type
string to standardize the schema, update any proto message definitions where
job_id is bytes to string (IncumbentRequest, Incumbent, SubmitResponse), run
proto generation to refresh client/server stubs, and adjust any
serialization/handler code that expects bytes to handle string instead.
cpp/src/grpc/grpc_solution_mapper.cu (2)

93-101: Consider bulk operations for repeated fields.

The element-by-element add_*() calls work correctly but may be slower for large solutions. Protobuf's mutable_primal_solution()->Reserve(primal.size()) followed by Add() or using mutable_primal_solution()->Assign(...) could improve performance for large-scale problems.

💡 Example using Reserve for primal_solution
   // Solution vectors - CPU solution already has data in host memory
   const auto& primal       = solution.get_primal_solution_host();
   const auto& dual         = solution.get_dual_solution_host();
   const auto& reduced_cost = solution.get_reduced_cost_host();

+  pb_solution->mutable_primal_solution()->Reserve(primal.size());
   for (const auto& v : primal) {
     pb_solution->add_primal_solution(static_cast<double>(v));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 93 - 101, The per-element
protobuf adds (pb_solution->add_primal_solution, add_dual_solution,
add_reduced_cost) should be replaced with bulk operations to improve performance
for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.

559-598: Warm-start presence detection relies on empty array check.

The warm-start data presence is inferred from ws_primal.empty(). This works correctly as long as any valid warm-start data always includes the primal solution. If there's ever a case where warm-start exists but current_primal is empty, this would miss it. Consider checking the header for a dedicated warm-start flag if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 598, The code
currently detects warm-start presence by testing ws_primal.empty() in the block
that constructs cpu_pdlp_warm_start_data_t (see bytes_to_typed,
cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty()) guard); change that
condition to prefer a dedicated header flag (e.g., use h.has_warm_start() or
h.ws_present() if such a field exists) and fall back to the primal-array
emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.
cpp/src/grpc/client/grpc_client.hpp (1)

236-240: Non-movable semantics are intentional but could be revisited.

The class is non-copyable and non-movable due to std::atomic<bool> and std::thread members. While std::atomic is actually movable in C++17, the std::thread member and PIMPL pattern make this reasonable. If movability becomes needed, consider wrapping the thread in a nullable type or resetting stop_logs_ after move.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.hpp` around lines 236 - 240, The class
grpc_client_t is explicitly non-copyable and non-movable due to an
std::atomic<bool> (stop_logs_) and an internal std::thread member; if you need
movability, add a noexcept move constructor and move assignment that transfer
ownership of the PIMPL, move the std::thread (std::move on the thread member)
and ensure the source thread is left in a benign state (e.g., set to not
joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env overrides for CUOPT_CHUNK_SIZE and
CUOPT_MAX_MESSAGE_BYTES must be validated before assignment: call
parse_env_int64 for each (as currently done), then only assign to
config.chunk_size_bytes and config.max_message_bytes when the parsed value is a
sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
- Around line 164-166: The incumbent callback incorrectly uses the incumbent
objective as the "bound" (bound_copy = objective) when calling
get_callback->get_solution; change the flow so the actual best bound from the
remote solver (use get_solution_bound()) is forwarded through the incumbent
callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 315: The proto uses two different types for job_id (string in
SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest vs bytes in
IncumbentRequest and Incumbent), so unify them: choose one canonical type
(prefer string for UTF-8 identifiers or bytes if binary IDs are required) and
update the job_id field in the outlier messages (IncumbentRequest and Incumbent
if switching to string, or update the eight string-typed messages if switching
to bytes) to match; after changing the proto make sure to regenerate language
bindings and add any necessary UTF-8 validation or encoding/decoding notes in
the RPC handling code.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto doesn't match Buf's package-directory rules; either change
the package line to match the file's directory-based package (so the package
mirrors the proto file location) or move cuopt_remote.proto into a directory
structure that matches "cuopt/remote"; update the package declaration (package
cuopt.remote) accordingly so the package and directory structure are consistent
and the PACKAGE_DIRECTORY_MATCH lint passes.

In `@cpp/src/grpc/grpc_settings_mapper.cu`:
- Around line 144-146: The protobuf iteration_limit can exceed the range of the
target type i_t and overflow when statically cast; before assigning
pb_settings.iteration_limit() to settings.iteration_limit (the
static_cast<i_t>), check that the value lies within numeric_limits<i_t>::min()
and numeric_limits<i_t>::max(); if it is out of range, either clamp to the
appropriate min/max or return/log an error and refuse the assignment, ensuring
you reference pb_settings.iteration_limit() and settings.iteration_limit when
implementing the guard to prevent narrowing overflow.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 109-113: The code currently reads a uint64_t size from the pipe
and blindly calls data.resize(size) which permits unbounded allocation; before
resizing, validate and cap the incoming size (e.g. define a
MAX_INCUMBENT_PAYLOAD or MAX_INCUMBENT_SIZE constant) and return false if size
is zero or exceeds that limit, and ensure you convert/cast size to size_t
safely; update the recv_incumbent_pipe logic around the read_from_pipe(fd,
&size, ...) and data.resize(...) calls (and any callers using fd/data) to
enforce this limit and prevent resource exhaustion.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Line 90: The loop currently treats any ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) == false as a benign break and later returns true, which
accepts partial/corrupt payloads; change the handling so that when
ParseDelimitedFromCodedStream returns false you check clean_eof and only break
on a true clean_eof (clean EOF), otherwise return false to indicate a parse
error. Update both occurrences (the line with ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) and the other similar occurrence) so parse failures produce
an immediate false return instead of falling through and returning true.
- Around line 151-154: The resize uses ac.total_elements() * elem_size without
validation which can overflow or trigger huge allocations; before calling
dest.resize validate that ac.total_elements() and elem_size are non-negative and
compute the product using a larger unsigned type (e.g., uint64_t) then check
product <= std::numeric_limits<size_t>::max() and <= a configured maximum
allocation threshold, and only then cast to size_t and call dest.resize; if the
checks fail, return/throw an error instead of resizing. Also apply the same
validation logic for the other occurrence around lines 158-163 (same variables:
ac, elem_size, array_field_element_size(ac.field_id()), dest).

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 50-80: The CLI parsing loop in main (grpc_server_main.cpp)
currently silently ignores unknown flags and missing values; update the for-loop
that inspects argv/argc to validate every flag: for each option that requires a
value (e.g., "-p"/"--port", "-w"/"--workers", "--max-message-mb",
"--max-message-bytes", "--tls-cert", "--tls-key", "--tls-root") check i+1 < argc
and if not, print an error to std::cerr (include the flag name), call
print_usage(argv[0]) and return a non-zero exit code; for unrecognized args (the
final else case) do the same (error + usage + non-zero exit); also wrap
stoi/stoll conversions in try/catch to handle invalid numbers and fail similarly
with an explanatory error. Ensure you reference and update the same config
fields (config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) and keep behavior for boolean flags unchanged.
- Around line 157-167: The code currently calls memset on JobQueueEntry and
ResultQueueEntry which contain std::atomic members (job_queue and result_queue
initialization loop); replace the raw memset with explicit field initialization
or construction: for each JobQueueEntry/ResultQueueEntry set every member
explicitly (e.g., initialize non-atomic fields directly and initialize atomics
via .store(...) or use placement-new / a helper function that constructs the
object in shared memory) to avoid undefined behavior; update the loops that call
memset(&job_queue[i], 0, ...) and memset(&result_queue[i], 0, ...) to perform
per-field initialization instead, ensuring worker_index, ready, claimed,
cancelled, retrieved, etc. are correctly constructed and stored.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe
functions (std::cout and result_cv.notify_all()), which must be removed; change
signal_handler to only perform async-signal-safe writes: set keep_running to
false using a sig_atomic_t/atomic<bool> and, if needed, set
shm_ctrl->shutdown_requested using an atomic or sig_atomic_t field, and return
immediately (no I/O or condition_variable calls). Add a separate watcher
thread/function (e.g., monitor_shutdown or shutdown_watcher) that polls/blocks
on keep_running becoming false and then performs the logging (std::cout) and
calls result_cv.notify_all() and any other cleanup; update signal_handler,
keep_running's type, and shm_ctrl to use signal-safe atomics so the watcher can
safely observe the change.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The documentation sentence claiming “no custom serialization
logic is implemented” is too absolute; update the statement to acknowledge the
custom pipe blob framing helpers introduced in grpc_pipe_serialization.hpp (the
pipe blob framing helpers) and clarify scope by limiting the claim to gRPC wire
payloads (e.g., “All gRPC wire payload serialization uses protocol buffers
generated by protoc and grpc_cpp_plugin; separate pipe blob framing helpers in
grpc_pipe_serialization.hpp are used only for local pipe framing, not for gRPC
wire serialization.”).

In `@SERVER_ARCHITECTURE.md`:
- Around line 15-46: The markdown has untyped fenced code blocks (the ASCII
diagram block and other fenced regions) which trigger MD040; update each
triple-backtick fence (including the large ASCII diagram block shown and the
other blocks noted) to include an explicit language token such as text (e.g.,
change ``` to ```text) or another appropriate language (bash/cpp) so
markdownlint MD040 is satisfied across the file.

---

Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-266: The public RPC method grpc_client_t::check_status
dereferences impl_->stub without ensuring impl_ is initialized or stub is
non-null (can be called before connect()), which can crash; add a null-check
guard at the start of check_status (and other public RPC methods) that verifies
impl_ and impl_->stub are valid, and return an appropriate job_status_result_t
error/failed state (or set result.error and result.code) when the client is not
connected instead of dereferencing; refer to grpc_client_t::connect and
impl_->stub when adding the guard so the check mirrors connect()'s
initialization contract.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without validating it; update the logic that sets
max_polls (the line initializing int max_polls) to first validate
config_.poll_interval_ms is > 0 (and also guard against extremely large/small
values), returning an error or using a safe default/policy (e.g., set
poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).
- Around line 703-712: Validate and guard all chunk metadata before allocating
or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and
arr_desc.total_elements() (total_elems) >= 0 before using them; ensure
chunk_data_budget is >0 before computing elems_per_chunk and handle division
safely; check for size_t/multiplication overflow when computing total_elems *
elem_size before constructing array_bytes and use a safe cap; when iterating
chunks validate each elements_in_chunk is within [0, total_elems] and compute
copy_size = elements_in_chunk * elem_size with overflow checks before any
memcpy; add clear error handling/logging and skip/abort the chunk on invalid
metadata rather than performing the allocation or memcpy (refer to symbols:
arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk,
elements_in_chunk, array_bytes, memcpy).
- Around line 247-251: stop_log_streaming currently only sets stop_logs_ and
then joins log_thread_, which can hang if the thread is blocked in a gRPC read;
change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close
the gRPC stream used by the reader (e.g. call the
grpc::ClientContext::TryCancel() or cancel/Finish the
ClientReader/ReaderInterface instance used for log streaming), (3) ensure any
CompletionQueue or async reader is shutdown/drained so the blocked read returns,
and only then join log_thread_ (use a timed join as a fallback and log an error
if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 399-411: The lambda get_doubles uses n = it->second.size() /
sizeof(double) but then memcpy's the full byte size, which can overflow if the
payload isn't aligned; validate that it->second.size() is an exact multiple of
sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and
only memcpy expected_bytes (or return/throw on misaligned size), and when
converting to f_t ensure you only read expected_bytes into the temporary vector;
apply the same check/fix to the other chunked-array decoder blocks that perform
memcpy (the other double-to-f_t conversion in this file).

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 147-158: The submit_job_async path performs an unsynchronized
check-then-write on shared job_queue entries (e.g., checking
job_queue[i].ready/claimed then writing fields like job_id, ready, claimed),
which races with check_job_status and cancel_job; protect all accesses to
job_queue (both scans and per-entry reads/writes) by introducing a
synchronization primitive (e.g., a std::mutex or per-entry mutex array) and
acquire the lock(s) around the loop that scans MAX_JOBS and before modifying
fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 186-212: The TLS validation branches that currently do "return 1"
(around checks using config, ssl_opts and read_file_to_string for cert/key/root
and require_client) can occur after workers/threads have been started; replace
those early returns with calls to the centralized shutdown/teardown routine
(e.g., call your existing server stop/cleanup functions such as StopServer(),
shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs
shutdown and joins) and then return the error code; ensure every failure path
that currently returns from inside the TLS setup invokes that teardown routine
before exiting so workers/threads and other resources are properly cleaned up.
- Around line 234-251: BuildAndStart() may return nullptr so avoid unguarded
dereference of server; after std::unique_ptr<Server>
server(builder.BuildAndStart()), check if (server) before calling server->Wait()
and starting the shutdown logic that assumes a live server. If BuildAndStart()
fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the
shutdown_thread's capture/Shutdown call remains safe by only invoking
server->Shutdown() when server is non-null; move creation of shutdown_thread and
the server->Wait() call inside the if (server) block and handle the failure path
explicitly.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code currently does
job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.

---

Nitpick comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 236-240: The class grpc_client_t is explicitly non-copyable and
non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread
member; if you need movability, add a noexcept move constructor and move
assignment that transfer ownership of the PIMPL, move the std::thread (std::move
on the thread member) and ensure the source thread is left in a benign state
(e.g., set to not joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: Replace direct calls to std::stod and std::stoll in the
CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string)
used to set time_limit and the std::stoll used to set incumbent_index in
try-catch blocks that catch std::invalid_argument and std::out_of_range, print a
clear error message indicating the offending flag and value (e.g.,
"--time-limit" or "--incumbent-index"), and exit with a non-zero status instead
of letting the process throw; update the parsing branches that reference
time_limit and incumbent_index so they validate the parsed numeric ranges if
applicable and fall back or exit cleanly on parse failure.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 9: IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id
are defined as bytes while most RPC messages (StatusRequest, GetResultRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string;
change those fields to type string to standardize the schema, update any proto
message definitions where job_id is bytes to string (IncumbentRequest,
Incumbent, SubmitResponse), run proto generation to refresh client/server stubs,
and adjust any serialization/handler code that expects bytes to handle string
instead.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 93-101: The per-element protobuf adds
(pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should
be replaced with bulk operations to improve performance for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
- Around line 559-598: The code currently detects warm-start presence by testing
ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see
bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty())
guard); change that condition to prefer a dedicated header flag (e.g., use
h.has_warm_start() or h.ws_present() if such a field exists) and fall back to
the primal-array emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a43112f-b25e-4054-b00f-1fee3a598df9

📥 Commits

Reviewing files that changed from the base of the PR and between 6cafe37 and 612de9a.

📒 Files selected for processing (38)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (8)
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

Comment on lines +55 to +57
config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate env override ranges before applying gRPC sizing config.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are accepted as-is. Zero/negative values can break chunking/message-size logic at runtime.

Proposed guardrails
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
if (max_msg > 0) { config.max_message_bytes = max_msg; }
config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides
for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES must be validated before
assignment: call parse_env_int64 for each (as currently done), then only assign
to config.chunk_size_bytes and config.max_message_bytes when the parsed value is
a sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.

Comment on lines +164 to +166
double bound_copy = objective; // Use objective as bound for incumbent
get_callback->get_solution(
solution_copy.data(), &obj_copy, &bound_copy, callback->get_user_data());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Remote callback context (solve_remote.cu lines 160-170) =="
sed -n '160,170p' cpp/src/grpc/client/solve_remote.cu | cat -n

echo "== All get_solution calls in solve_remote.cu =="
rg -n 'get_solution' cpp/src/grpc/client/solve_remote.cu -B 5 -A 5

echo "== Other get_solution callback implementations =="
rg -n 'get_solution\(' cpp/src -g '*.{cu,cpp,hpp,h}' -B 3 -A 3 | head -100

Repository: NVIDIA/cuopt

Length of output: 9221


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for best_bound or incumbent_bound in solve_remote.cu =="
rg -n 'best_bound|incumbent_bound|lower_bound|upper_bound|dual_bound' cpp/src/grpc/client/solve_remote.cu -C 3

echo "== Check callback interface definition =="
rg -n 'get_solution|GetSolution' cpp/src/callbacks -g '*.{h,hpp}' -B 5 -A 10 | head -150

Repository: NVIDIA/cuopt

Length of output: 121


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for callback setup/documentation in solve_remote.cu =="
cat -n cpp/src/grpc/client/solve_remote.cu | head -200

Repository: NVIDIA/cuopt

Length of output: 9190


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Check incumbent_callback definition/signature =="
rg -n 'incumbent_callback' cpp/src/grpc -g '*.{hpp,h,cpp}' -B 2 -A 2

echo "== Check if best_bound is available in remote solve =="
rg -n 'best_bound|bestBound|best_lower_bound|lower_bound' cpp/src/grpc -g '*.{hpp,h,cpp}' -C 2

Repository: NVIDIA/cuopt

Length of output: 2037


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Check config/client callback interface =="
fd 'grpc_client' cpp/src/grpc --type f -x head -100 {} \;

echo "== Search for MipSolveResult or similar that returns solver progress =="
rg -n 'MipSolveResult|incumbent|bound' cpp/src/grpc/server -g '*.{hpp,h,cpp}' | head -50

Repository: NVIDIA/cuopt

Length of output: 17291


Remote solver should pass the best bound to incumbent callbacks, not the incumbent objective.

The incumbent callback from the remote server provides only the objective value and solution, not the best lower bound from the MIP solver. Setting bound_copy = objective at line 164 creates an artificial zero gap in user callbacks, misreporting MIP progress.

The underlying issue is that the incumbent_callback protocol (grpc_client.hpp:66) does not include the solver's best bound, while the local MIP solver (solve.cu:117) passes a separate bound vector to callbacks. The server has access to get_solution_bound() but does not forward it through the incumbent callback protocol.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The incumbent
callback incorrectly uses the incumbent objective as the "bound" (bound_copy =
objective) when calling get_callback->get_solution; change the flow so the
actual best bound from the remote solver (use get_solution_bound()) is forwarded
through the incumbent callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.

// =============================================================================

message IncumbentRequest {
bytes job_id = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n cpp/src/grpc/cuopt_remote_service.proto | sed -n '300,335p'

Repository: NVIDIA/cuopt

Length of output: 1562


🏁 Script executed:

#!/bin/bash
echo "== All job_id field declarations in proto files =="
rg -nP '\bjob_id\b\s*=\s*\d+\s*;' cpp/src/grpc/*.proto

echo ""
echo "== Check context around job_id declarations =="
rg -B2 -A2 '\bjob_id\b\s*=\s*\d+\s*;' cpp/src/grpc/*.proto

Repository: NVIDIA/cuopt

Length of output: 4428


🏁 Script executed:

#!/bin/bash
echo "== GetIncumbents RPC definition =="
rg -B5 -A5 'GetIncumbents' cpp/src/grpc/cuopt_remote_service.proto | head -30

echo ""
echo "== Usage of job_id in test files =="
rg -n 'job_id' cpp/tests/linear_programming/grpc/ | head -20

echo ""
echo "== Check if bytes job_id is used elsewhere intentionally =="
rg -B3 -A3 'bytes job_id' cpp/src/grpc/

Repository: NVIDIA/cuopt

Length of output: 4404


Unify job_id type across RPC messages.

job_id is string in 8 request/response messages (SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest) but bytes in IncumbentRequest and Incumbent. This inconsistency complicates cross-language clients and can introduce avoidable encoding bugs. If UTF-8 validation concerns require bytes encoding, the same should apply consistently across all job_id fields.

Suggested schema alignment
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;        // Return incumbents starting from this index
   int32 max_count = 3;         // Optional limit (0 or negative => no limit)
 }

 message Incumbent {
   int64 index = 1;
   double objective = 2;
   repeated double assignment = 3;
-  bytes job_id = 4;
+  string job_id = 4;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` at line 315, The proto uses two
different types for job_id (string in SubmitJobResponse, StatusRequest,
GetResultRequest, StartChunkedDownloadRequest, DeleteRequest, CancelRequest,
WaitRequest, StreamLogsRequest vs bytes in IncumbentRequest and Incumbent), so
unify them: choose one canonical type (prefer string for UTF-8 identifiers or
bytes if binary IDs are required) and update the job_id field in the outlier
messages (IncumbentRequest and Incumbent if switching to string, or update the
eight string-typed messages if switching to bytes) to match; after changing the
proto make sure to regenerate language bindings and add any necessary UTF-8
validation or encoding/decoding notes in the RPC handling code.


syntax = "proto3";

package cuopt.remote;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align proto package path with Buf package-directory rules.

package cuopt.remote from cpp/src/grpc/cuopt_remote.proto triggers PACKAGE_DIRECTORY_MATCH. This will fail Buf lint unless the file path (or package naming) is adjusted consistently.

🧰 Tools
🪛 Buf (1.66.0)

[error] 6-6: Files with package "cuopt.remote" must be within a directory "cuopt/remote" relative to root but were in directory "cpp/src/grpc".

(PACKAGE_DIRECTORY_MATCH)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration
"package cuopt.remote" in cuopt_remote.proto doesn't match Buf's
package-directory rules; either change the package line to match the file's
directory-based package (so the package mirrors the proto file location) or move
cuopt_remote.proto into a directory structure that matches "cuopt/remote";
update the package declaration (package cuopt.remote) accordingly so the package
and directory structure are consistent and the PACKAGE_DIRECTORY_MATCH lint
passes.

Comment on lines +144 to +146
if (pb_settings.iteration_limit() >= 0) {
settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Range-check iteration_limit before narrowing to i_t.

A large protobuf iteration_limit can overflow the static_cast<i_t> and produce an invalid internal limit.

🔧 Suggested guard
   if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
+    const auto limit = pb_settings.iteration_limit();
+    if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = static_cast<i_t>(limit);
+    } else {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    }
   }
As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (pb_settings.iteration_limit() >= 0) {
settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
}
if (pb_settings.iteration_limit() >= 0) {
const auto limit = pb_settings.iteration_limit();
if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
settings.iteration_limit = static_cast<i_t>(limit);
} else {
settings.iteration_limit = std::numeric_limits<i_t>::max();
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The protobuf
iteration_limit can exceed the range of the target type i_t and overflow when
statically cast; before assigning pb_settings.iteration_limit() to
settings.iteration_limit (the static_cast<i_t>), check that the value lies
within numeric_limits<i_t>::min() and numeric_limits<i_t>::max(); if it is out
of range, either clamp to the appropriate min/max or return/log an error and
refuse the assignment, ensuring you reference pb_settings.iteration_limit() and
settings.iteration_limit when implementing the guard to prevent narrowing
overflow.

Comment on lines +213 to +218
void set_server_log_path(const std::string& path)
{
server_log_path_ = path;
server_log_start_pos_ = 0;
test_start_marked_ = false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently.

server_log_path_, server_log_start_pos_, and test_start_marked_ are written and read without a common lock in several methods. This can race in concurrent test callbacks/pollers.

As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

Also applies to: 229-244, 332-343

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.

Comment on lines +15 to +46
```
┌────────────────────────────────────────────────────────────────────┐
│ Main Server Process │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────────┐ │
│ │ gRPC │ │ Job │ │ Background Threads │ │
│ │ Service │ │ Tracker │ │ - Result retrieval │ │
│ │ Handler │ │ (job status,│ │ - Incumbent retrieval │ │
│ │ │ │ results) │ │ - Worker monitor │ │
│ └─────────────┘ └──────────────┘ └─────────────────────────────┘ │
│ │ ▲ │
│ │ shared memory │ pipes │
│ ▼ │ │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Shared Memory Queues ││
│ │ ┌─────────────────┐ ┌─────────────────────┐ ││
│ │ │ Job Queue │ │ Result Queue │ ││
│ │ │ (MAX_JOBS=100) │ │ (MAX_RESULTS=100) │ ││
│ │ └─────────────────┘ └─────────────────────┘ ││
│ └─────────────────────────────────────────────────────────────────┘│
└────────────────────────────────────────────────────────────────────┘
│ ▲
│ fork() │
▼ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Worker 0 │ │ Worker 1 │ │ Worker N │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ GPU Solve │ │ │ │ GPU Solve │ │ │ │ GPU Solve │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ (separate proc)│ │ (separate proc)│ │ (separate proc)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add fence languages to code blocks to satisfy markdownlint.

Several fenced blocks are untyped; add explicit languages (e.g., text, cpp, bash) to resolve MD040.

Also applies to: 105-120, 136-151

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 16-16: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has untyped fenced
code blocks (the ASCII diagram block and other fenced regions) which trigger
MD040; update each triple-backtick fence (including the large ASCII diagram
block shown and the other blocks noted) to include an explicit language token
such as text (e.g., change ``` to ```text) or another appropriate language
(bash/cpp) so markdownlint MD040 is satisfied across the file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (24)
cpp/src/grpc/cuopt_remote_service.proto (1)

314-325: ⚠️ Potential issue | 🟠 Major

Use one job_id scalar type across service messages.

Line 315 and Line 324 use bytes, while most service request/response messages use string for job_id. This inconsistency creates avoidable cross-language conversion edge cases.

Schema alignment sketch
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;
   int32 max_count = 3;
 }

 message Incumbent {
   int64 index = 1;
   double objective = 2;
   repeated double assignment = 3;
-  bytes job_id = 4;
+  string job_id = 4;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 325, The
IncumbentRequest.job_id and Incumbent.job_id fields use bytes while the rest of
the API uses string for job_id; change both fields in messages IncumbentRequest
and Incumbent from bytes to string to ensure a single scalar type for job_id
across the service, and update any serialization/usage code that expects bytes
to accept a string (or perform explicit encoding/decoding at the callsite) so
cross-language conversions remain consistent.
GRPC_ARCHITECTURE.md (1)

9-10: ⚠️ Potential issue | 🟡 Minor

Scope the serialization claim to gRPC wire payloads.

Line 9-Line 10 is too absolute; local pipe framing helpers exist. Please clarify this is specifically about gRPC wire serialization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 9 - 10, The statement "All serialization
uses protocol buffers generated by `protoc` and `grpc_cpp_plugin` — no custom
serialization logic is implemented." is too absolute; update it to explicitly
scope serialization to gRPC wire payloads (e.g., "All gRPC wire payload
serialization uses protocol buffers generated by `protoc` and
`grpc_cpp_plugin`") and add a brief note acknowledging that local pipe framing
helpers (custom framing/unframing code) are used outside the gRPC wire format;
keep the original wording's intent but clarify that custom framing helpers exist
for local pipes.
SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add explicit fence languages to these code blocks.

markdownlint MD040 is still triggered by untyped fences in these sections. Use ```text for ASCII diagrams and table-like flows.

Also applies to: 110-118, 122-135, 139-148, 154-168, 196-198, 207-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The ASCII-art code fences
(e.g., the block starting with the "Main Server Process" diagram that includes
"Shared Memory Queues", "Job Queue", "Result Queue", and worker boxes) are
untyped and trigger markdownlint MD040; update each such fence to use an
explicit text language (replace ``` with ```text) for that diagram and the other
similar untyped blocks (the blocks containing table/flow-style ASCII diagrams
and blocks starting with the same diagram-like contents referenced in the
comment). Ensure every ASCII/table-like fenced block (including the ones around
the ranges called out in the review) is changed to ```text so markdownlint stops
flagging MD040.
cpp/src/grpc/server/grpc_worker_infra.cpp (1)

165-168: ⚠️ Potential issue | 🔴 Critical

Validate PID and handle EINTR in wait_for_workers().

The reap loop at Line 165-Line 168 should skip non-positive PIDs and retry interrupted waits; otherwise waitpid(0, ...)/signal interruption can lead to incorrect child reaping behavior.

Suggested fix
 void wait_for_workers()
 {
   for (pid_t pid : worker_pids) {
+    if (pid <= 0) continue;
     int status;
-    waitpid(pid, &status, 0);
+    while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {}
   }
   worker_pids.clear();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 165 - 168, In
wait_for_workers(), the reap loop currently calls waitpid(pid, &status, 0) for
every pid in worker_pids without validating pid or handling EINTR; update the
loop to skip non-positive PIDs (pid <= 0) and wrap waitpid(pid, &status, 0) in a
retry loop that re-invokes waitpid when it returns -1 with errno == EINTR, and
only break on a successful return or an error other than EINTR; reference the
worker_pids iteration, the pid variable, the waitpid call and the status/errno
checks when making the change.
cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Align proto package and directory to satisfy Buf lint.

package cuopt.remote at Line 6 does not match the current file directory (cpp/src/grpc), which triggers PACKAGE_DIRECTORY_MATCH.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") does not match the file's directory
and triggers PACKAGE_DIRECTORY_MATCH; update the package line in
cuopt_remote.proto to reflect the directory structure (e.g., "package
cpp.src.grpc;" or another package path that mirrors the directory segments) so
the package-to-directory mapping aligns, then re-run buf lint to verify the
PACKAGE_DIRECTORY_MATCH error is resolved.
cpp/src/grpc/server/grpc_server_main.cpp (5)

157-167: ⚠️ Potential issue | 🔴 Critical

Replace raw memset on queue entries that include atomics.

JobQueueEntry / ResultQueueEntry contain std::atomic members; byte-wise zeroing those objects is undefined behavior.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 157 - 167, The code
currently zeroes JobQueueEntry and ResultQueueEntry with memset (affecting
job_queue and result_queue), which is undefined because those structs contain
std::atomic members; replace the memset calls by value-initializing or
explicitly initializing each member (e.g., construct entries using
JobQueueEntry() or assign default-constructed instances) and then set the atomic
flags via their store(...) calls (retain the existing store(false)/store(-1)
lines for ready/claimed/cancelled/worker_index and retrieved), ensuring no
byte-wise writes touch atomics; update the initialization loop to use assignment
or placement-new of JobQueueEntry/ResultQueueEntry rather than memset to avoid
UB.

50-80: ⚠️ Potential issue | 🟠 Major

Reject unknown or incomplete CLI flags explicitly.

The parser currently falls through for unrecognized flags and for value-requiring flags missing a value, so the server can continue with unintended defaults.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parser in the main argument loop accepts unknown flags and silently ignores
missing values for flags that require them; update the argument parsing logic
(the for-loop handling argv in grpc_server_main.cpp and the handling of flags
like
"--port"/"--workers"/"--max-message-mb"/"--max-message-bytes"/"--tls-cert"/"--tls-key"/"--tls-root")
so that when a flag that requires a value is encountered and i+1 >= argc you
call print_usage(argv[0]) (or log an explicit error) and return a non-zero exit
code, and when an unrecognized argument starting with '-' is seen you also call
print_usage(argv[0]) (or emit an error) and return non-zero—this ensures unknown
or incomplete CLI flags are rejected instead of silently falling back to
defaults.

89-90: ⚠️ Potential issue | 🔴 Critical

Do not wire POSIX signals to the current async-unsafe handler.

signal_handler (in cpp/src/grpc/server/grpc_server_types.hpp) uses operations like condition-variable notification from signal context, which is not async-signal-safe.

As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 89 - 90, The current
wiring of POSIX signals via signal(SIGINT, signal_handler) / signal(SIGTERM,
signal_handler) is unsafe because signal_handler performs async-unsafe
operations (e.g., condition-variable notification); instead, change to a safe
signal handling pattern: install a minimal async-signal-safe handler that only
sets a std::sig_atomic_t or std::atomic_bool shutdown flag (or writes a byte to
a dedicated self-pipe) and return, and move any condition-variable notifications
or complex shutdown logic out of signal_handler into the main thread or a
dedicated watcher that polls the flag or reads the pipe; update references in
grpc_server_main.cpp that call signal(...) and the implementation of
signal_handler in grpc_server_types.hpp to follow this pattern so all
non-async-safe actions occur off the signal context.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before calling server->Wait().

BuildAndStart() can return null, but Line 250 unconditionally dereferences server.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after calling std::unique_ptr<Server> server(builder.BuildAndStart()), check if
(!server) and handle the failure (log an error via std::cerr or similar and
return/exit with non-zero) before starting the shutdown_thread or calling
server->Wait(); also ensure the shutdown_thread lambda still checks server
before calling server->Shutdown() and join the thread if started (use a flag to
indicate thread started or create the thread only when server is valid).

185-212: ⚠️ Potential issue | 🔴 Critical

TLS error branches leak workers/threads/shared memory after startup.

These early returns execute after spawn_workers() and thread creation, so teardown is skipped on TLS configuration/read failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 185 - 212, The TLS
error branches return early after spawn_workers()/thread/shared-memory setup and
thus leak resources; either move the TLS validation block to run before
spawn_workers(), or ensure every early-return path calls the teardown routines
(e.g., shutdown_workers(), join_worker_threads(), release_shared_memory() or
whatever the existing cleanup functions are) before returning. Locate the TLS
block (uses config, ssl_opts, read_file_to_string) and change it so that
validation happens prior to calling spawn_workers(), or add calls to the
project's worker/shutdown/cleanup functions immediately before each return in
this block to properly stop threads and free shared memory.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on malformed chunk parse instead of breaking and succeeding.

A parse error currently breaks the loop and still returns true, which accepts partial/corrupt payloads as valid.

🛠️ Minimal fix
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }
...
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loop
currently treats any false from
google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)
as a harmless break and ultimately returns true, which accepts corrupt partial
payloads; change the logic so that when ParseDelimitedFromCodedStream returns
false you check clean_eof and only break if clean_eof is true (clean EOF),
otherwise immediately return false to signal a malformed chunk error; update
this behavior for the ParseDelimitedFromCodedStream call at the shown line and
the duplicate occurrence (around the other occurrence near line 148) so
malformed parses fail instead of succeeding.

93-95: ⚠️ Potential issue | 🟠 Major

Validate total_elements size math before resize().

The current resize paths trust unvalidated total_elements (and multiplication with elem_size), which can overflow or trigger excessive allocation on malformed input.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

Also applies to: 151-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 95, The
resize call trusts ac.total_elements() (and the later elem_size * total_elements
math) and can overflow or cause huge allocations; before calling dest.resize()
(both at the dest.empty() check and the duplicate block at the later
occurrence), validate ac.total_elements() and the computed byte_count: cast
total_elements to a fixed-width unsigned type (e.g. uint64_t), check
total_elements > 0 is within a sane upper bound and that elem_size *
total_elements does not overflow size_t (e.g. byte_count = uint64_t(elem_size) *
total_elements; if (byte_count > std::numeric_limits<size_t>::max() ||
byte_count > MAX_ALLOWED_BYTES) return/error), then perform resize using the
checked size_t value; if the validation fails, return an error/handle gracefully
instead of resizing.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently.

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock across setters/readers. This can race when log polling and test setup run concurrently.

Suggested fix pattern
   void set_server_log_path(const std::string& path)
   {
+    std::lock_guard<std::mutex> lock(mutex_);
     server_log_path_      = path;
     server_log_start_pos_ = 0;
     test_start_marked_    = false;
   }

   std::string get_server_logs(bool since_test_start = true) const
   {
-    if (server_log_path_.empty()) { return ""; }
+    std::string path;
+    std::streampos start_pos = 0;
+    bool marked = false;
+    {
+      std::lock_guard<std::mutex> lock(mutex_);
+      path = server_log_path_;
+      start_pos = server_log_start_pos_;
+      marked = test_start_marked_;
+    }
+    if (path.empty()) { return ""; }

-    std::ifstream file(server_log_path_);
+    std::ifstream file(path);
     if (!file.is_open()) { return ""; }

-    if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-      file.seekg(server_log_start_pos_);
+    if (since_test_start && marked && start_pos > 0) {
+      file.seekg(start_pos);
     }

As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

Also applies to: 229-244, 332-355

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, The three server-log fields (server_log_path_, server_log_start_pos_,
test_start_marked_) are accessed without synchronization; modify
set_server_log_path to acquire a dedicated mutex (e.g., server_log_mutex_) and
set all three fields while holding the lock, and update all other
readers/writers that touch these members (the other setter/getter/polling
functions referenced in the diff ranges) to also lock the same mutex when
reading or writing those fields to prevent races.
cpp/src/grpc/server/grpc_job_management.cpp (2)

109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before resizing the buffer.

This path trusts the size prefix and directly calls data.resize(size). A malformed payload can force large allocations and exhaust memory.

Suggested guardrail
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
currently trusts the size prefix and calls data.resize(size) after
read_from_pipe; introduce a validated upper bound (e.g., a constexpr size_t
kMaxJobPayloadSize) and check that size is non-negative and <=
kMaxJobPayloadSize before calling data.resize(size) and before the subsequent
read_from_pipe; if the size is out of range, return false (or handle as a
protocol error) to prevent unbounded allocation and potential memory exhaustion.
Use the existing symbols size, data.resize, and read_from_pipe to locate and
modify the logic.

147-158: ⚠️ Potential issue | 🔴 Critical

Fix racy queue-slot reservation in async submit.

Slot selection currently performs a check-then-write on shared queue state, so concurrent submissions can reserve the same slot.

Suggested approach (atomic claim)
   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected = false;
+    if (!job_queue[i].claimed.compare_exchange_strong(expected, true)) { continue; }
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       ...
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines: "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The loop
in submit that does check-then-write on job_queue slots is racy: change
reservation to an atomic claim (e.g., make job_queue[].claimed an
std::atomic<bool> or std::atomic_flag) and perform an atomic test-and-set
(compare_exchange_strong or test_and_set) to reserve the slot before writing any
fields; once the atomic claim succeeds, initialize job_queue[i].job_id,
problem_type, data_size, worker_* etc., then set job_queue[i].ready = true
(non-atomic write) only after fully initialized; ensure headers (<atomic>) and
any needed memory-ordering are used and that no other code resets claimed
without clearing ready in the reverse order.
cpp/src/grpc/client/solve_remote.cu (2)

53-57: ⚠️ Potential issue | 🟠 Major

Guard invalid env overrides before applying gRPC sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are applied without sanity checks. Zero/negative values can break transfer behavior at runtime.

Suggested fix
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  const auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  const auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 53 - 57, The env overrides
in apply_env_overrides currently accept any parsed value from parse_env_int64
for chunk_size_bytes and max_message_bytes; update apply_env_overrides to
validate the returned values before assignment: call parse_env_int64 for
"CUOPT_CHUNK_SIZE" and "CUOPT_MAX_MESSAGE_BYTES", then only assign to
config.chunk_size_bytes and config.max_message_bytes if the parsed value is > 0
and within sane bounds (e.g., enforce a minimum of 1 and a sensible upper cap to
avoid resource exhaustion); leave enable_debug_log assignment using
parse_env_int64 as-is. Ensure you reference apply_env_overrides,
config.chunk_size_bytes, config.max_message_bytes, and parse_env_int64 when
making the change.

163-166: ⚠️ Potential issue | 🟠 Major

Forward the real best bound to incumbent callbacks.

The current forwarding path sets bound_copy from the incumbent objective, which can report an artificial zero gap to user callbacks. This should carry the solver’s actual best bound from the remote pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 163 - 166, The code
currently sets bound_copy from objective, so get_callback->get_solution(...)
forwards the incumbent objective as the bound and can report a zero gap; change
the initialization of bound_copy to use the solver’s actual best-bound value
from the remote pipeline (replace "bound_copy = objective" with the variable
that holds the solver's best bound), then call
get_callback->get_solution(solution_copy.data(), &obj_copy, &bound_copy,
callback->get_user_data()); this ensures callbacks receive the real best bound
instead of the incumbent objective.
cpp/src/grpc/grpc_settings_mapper.cu (1)

144-146: ⚠️ Potential issue | 🟠 Major

Range-check iteration_limit before narrowing cast.

A large protobuf iteration_limit (int64_t) can overflow when cast to i_t (which may be int32_t), producing an invalid internal limit.

🔧 Suggested guard
   // proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset"
   // so the server keeps the library default (typically max()).
   if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
+    const auto limit = pb_settings.iteration_limit();
+    if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = static_cast<i_t>(limit);
+    } else {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    }
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The narrowing
cast from pb_settings.iteration_limit() to settings.iteration_limit (type i_t)
can overflow; before the static_cast in the block that sets
settings.iteration_limit, range-check the int64 value against the numeric limits
of i_t and either clamp to std::numeric_limits<i_t>::max()/min() or
return/report an error if out-of-range—update the code around the check of
pb_settings.iteration_limit() to perform this validation and only then assign
via static_cast<i_t>.
cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Add null stub guard and deadline to RPC methods.

Public RPC methods like check_status dereference impl_->stub without checking if a connection was established. Additionally, the ClientContext lacks a deadline, allowing indefinite blocking on network issues. The connection probe at line 191 correctly demonstrates both patterns.

🔧 Proposed fix pattern (apply to all RPC methods)
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
+  if (config_.timeout_seconds > 0) {
+    context.set_deadline(std::chrono::system_clock::now() +
+                         std::chrono::seconds(config_.timeout_seconds));
+  }
   auto request = build_status_request(job_id);

Apply the same pattern to: wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs, and all chunked upload/download methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The method
grpc_client_t::check_status currently dereferences impl_->stub and creates a
grpc::ClientContext without a deadline; add a null-stub guard at the start
(check impl_ and impl_->stub and return a job_status_result_t error/result
indicating "not connected") before calling build_status_request/CheckStatus, and
set a deadline on the grpc::ClientContext (use std::chrono::... now()+configured
RPC timeout) so the RPC won't block indefinitely; follow the same pattern for
wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs and all
chunked upload/download methods, using the same error/result handling style used
by the connection probe at line 191.

991-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before division.

Line 994 computes max_polls by dividing by config_.poll_interval_ms. If this value is <= 0, the division will either crash or produce invalid polling behavior. The same issue exists at line 1132.

🔧 Proposed validation
   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "Invalid configuration: poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 991 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; add a
guard in the polling branch (where CUOPT_LOG_INFO logs "Using polling
(CheckStatus) for job" and any other similar polling block around the second
occurrence) to check config_.poll_interval_ms > 0 and handle invalid values: log
an error via CUOPT_LOG_ERROR (include job_id/context), set a safe default (e.g.,
poll_interval_ms = 1) or return/fail early, then compute max_polls using the
validated/clamped value; update both locations that reference
config_.poll_interval_ms to use the validated variable.

702-712: ⚠️ Potential issue | 🔴 Critical

Validate chunked result metadata before arithmetic operations.

element_size_bytes is used at line 708 without checking for <= 0, causing division by zero. Additionally, total_elems * elem_size at line 711 can overflow before allocation, making subsequent memcpy operations unsafe.

🔧 Proposed bounds validation
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    // Check for overflow before allocation
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (static_cast<uint64_t>(total_bytes) > std::numeric_limits<size_t>::max()) {
+      last_error_ = "Invalid chunk metadata: exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 712, The loop over
header->arrays() uses arr_desc.element_size_bytes() and multiplies it by
arr_desc.total_elements() without validation; add checks that
element_size_bytes() > 0 (return/continue or log and skip if not) before
computing elems_per_chunk to avoid division by zero, compute elems_per_chunk =
max(1, chunk_data_budget / elem_size) only after that check, and validate that
total_elements() * element_size_bytes() cannot overflow size_t before allocating
array_bytes (e.g., if total_elems > SIZE_MAX / elem_size then handle as
error/skip or cap/stream allocation); reference arr_desc.element_size_bytes(),
arr_desc.total_elements(), chunk_data_budget, elems_per_chunk, and the
allocation of array_bytes to implement these guards.

221-252: ⚠️ Potential issue | 🔴 Critical

Log streaming thread can hang indefinitely on shutdown.

The stop_log_streaming() method sets the stop_logs_ flag and joins the thread, but reader->Read() at line 232 blocks indefinitely if the server stops sending messages. The TryCancel() call at line 234 only executes after Read() returns, not before. You need to call context.TryCancel() from the main thread before joining.

🔧 Proposed fix using context cancellation from stop_log_streaming

Add member variables to the class (in grpc_client.hpp):

std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

Then update the methods:

 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The log streaming
thread can block in reader->Read() so add a cancellable context pointer
protected by a mutex: introduce members std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ = nullptr; in grpc_client_t, then in
start_log_streaming() create the local grpc::ClientContext context as before but
immediately register it by locking log_context_mutex_ and setting
active_log_context_ = &context before calling impl_->stub->StreamLogs(...);
after the read/Finish() loop clear the pointer (lock and set to nullptr). In
stop_log_streaming() lock log_context_mutex_, if active_log_context_ != nullptr
call active_log_context_->TryCancel() (to wake up reader->Read()) before joining
the thread, then proceed to join/reset log_thread_ and set stop_logs_. Ensure
all accesses to active_log_context_ are mutex-protected to avoid race conditions
and that the pointer only points to the stack context while the thread is
running.
cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove non-async-signal-safe operations from signal handler.

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock or corrupt state when called from a POSIX signal handler. The signal handler should only perform atomic writes.

🔧 Suggested approach
 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Move logging and result_cv.notify_all() to a dedicated shutdown-watcher thread that polls keep_running == false and then performs the cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, The
signal_handler currently performs non-async-signal-safe operations (std::cout
and result_cv.notify_all()), which must be removed; change signal_handler to
only perform atomic-safe actions: set keep_running = false and, if shm_ctrl
exists, set shm_ctrl->shutdown_requested = true (both already present), and
return immediately; then implement a separate shutdown-watcher thread (e.g.,
spawned at server startup) that polls or waits on an atomic/condition to detect
keep_running == false and from that thread perform logging (std::cout or logging
framework), call result_cv.notify_all(), and any other cleanup work — locate
signal_handler, keep_running, shm_ctrl, and result_cv to update behavior and add
the watcher thread to the server lifecycle.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Add alignment validation before memcpy in chunked array decoding.

The get_doubles and get_ints lambdas compute element count via integer division but then copy the full byte array. If the payload size isn't aligned to element size, memcpy will copy garbage or partial data.

🔧 Proposed alignment checks
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       // ...
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(int32_t);
     // ...
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints must validate byte alignment before memcpy: compute
size_t bytes = it->second.size(); if bytes == 0 return {}; if (bytes %
sizeof(double) != 0) (for get_doubles) or bytes % sizeof(int32_t) != 0 (for
get_ints) then handle the misaligned payload (log/return error/throw) instead of
blindly memcpy; use size_t n = bytes / sizeof(...) and pass bytes (or n *
sizeof(...)) to memcpy rather than it->second.size(); update references in
get_doubles/get_ints and ensure the branch where types differ still copies only
the aligned byte count and converts only n elements.
🧹 Nitpick comments (3)
cpp/src/grpc/server/grpc_field_element_size.hpp (1)

17-29: Avoid silent fallback for unknown ArrayFieldId.

At Line 28, default: return 8; can mask unsupported/new enum values and mis-size chunk parsing. Prefer an explicit invalid path so callers can reject the request.

Suggested change
 inline int64_t array_field_element_size(cuopt::remote::ArrayFieldId field_id)
 {
   switch (field_id) {
@@
-    default: return 8;
+    default: return -1;  // unknown field id; caller should fail validation
   }
 }
As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_field_element_size.hpp` around lines 17 - 29, The
function array_field_element_size(cuopt::remote::ArrayFieldId) silently falls
back to return 8 for unknown enum values; change it to explicitly handle
unsupported/new enums by logging/propagating an error or failing fast (e.g.,
throw std::invalid_argument or assert) instead of returning 8 so callers can
reject invalid requests and avoid mis-sized parsing; update
array_field_element_size to remove the silent default and return/throw a clear
failure for unexpected cuopt::remote::ArrayFieldId values and ensure callers
handle that error path.
build.sh (1)

7-7: Enable unbound-variable checks in the script preamble.

At Line 7, prefer set -eu to catch unset variable usage early.

Suggested change
-set -e
+set -eu
Based on learnings: "In this repository, prefer using 'set -u' in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build.sh` at line 7, The script preamble currently uses "set -e"; change this
to enable unbound-variable checks by replacing that occurrence with "set -eu"
(i.e., add the -u flag) so the shell will error on unset variables; audit any
places that intentionally rely on unset variables and initialize or guard them
accordingly (locate the "set -e" token in build.sh and update it).
cpp/src/grpc/server/grpc_server_threads.cpp (1)

166-182: Deserialize outside tracker_mutex to reduce contention.

deserialize_result_pipe_blob(...) can be expensive; parsing before taking the lock and only committing parsed state under lock will improve concurrency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 166 - 182,
deserialize_result_pipe_blob(...) is done while holding tracker_mutex causing
unnecessary contention; move the expensive deserialize step out of the critical
section by first copying result_data and success locally, call
deserialize_result_pipe_blob(result_data.data(), result_data.size(), hdr,
arrays) before acquiring tracker_mutex, then acquire lock(tracker_mutex), find
job_tracker entry by job_id, verify it still exists and update it: set status to
COMPLETED or FAILED (JobStatus), set error_message on failure, or assign
result_header, result_arrays and result_size_bytes on success; ensure hdr and
arrays are moved into it->second to avoid copying and that result_size_bytes is
set from result_data.size().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d12476d8-baa0-42cb-9006-1d214ffa2716

📥 Commits

Reviewing files that changed from the base of the PR and between 612de9a and d228718.

📒 Files selected for processing (38)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/grpc_solution_mapper.cu

Comment on lines +279 to +294
if(TARGET protobuf::protoc)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
if(NOT _PROTOBUF_PROTOC)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
endif()
else()
find_package(protobuf CONFIG REQUIRED)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
if(NOT _PROTOBUF_PROTOC)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
endif()
endif()

if(NOT _PROTOBUF_PROTOC)
message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)")
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -300 cpp/CMakeLists.txt | tail -30

Repository: NVIDIA/cuopt

Length of output: 1081


🏁 Script executed:

sed -n '270,300p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 1118


🏁 Script executed:

cat > /tmp/verify_cmake.py << 'EOF'
# Verify CMake semantics documentation for IMPORTED_LOCATION behavior
print("""
CMake Semantics Verification:

1. IMPORTED_LOCATION and IMPORTED_LOCATION_<CONFIG>:
   - These properties are specific to IMPORTED targets
   - They store the filesystem path to the imported library/executable
   - For non-IMPORTED (regular) targets, these properties are not set by default
   - Reading them from non-IMPORTED targets returns empty/NOTFOUND

2. $<TARGET_FILE:target>:
   - Generator expression that works with both IMPORTED and non-IMPORTED targets
   - For IMPORTED targets: resolves to the path specified by IMPORTED_LOCATION variants
   - For non-IMPORTED targets: resolves to the build directory path
   - However: Must NOT be used in string contexts before generator evaluation
   - This is a generator expression, evaluated at generation time, not configuration time

3. Key issue with the suggested fix:
   - set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) 
   - This stores a generator expression STRING, not an actual path
   - If _PROTOBUF_PROTOC is used in an add_custom_command() COMMAND clause, it works
   - If used in string comparisons or other non-generator contexts, it will fail
""")
EOF
python /tmp/verify_cmake.py

Repository: NVIDIA/cuopt

Length of output: 1178


🏁 Script executed:

rg -n "_PROTOBUF_PROTOC" cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 550


🏁 Script executed:

sed -n '294,400p' cpp/CMakeLists.txt | head -50

Repository: NVIDIA/cuopt

Length of output: 1985


🏁 Script executed:

grep -n "_PROTOBUF_PROTOC" cpp/CMakeLists.txt -A 2 -B 2

Repository: NVIDIA/cuopt

Length of output: 1254


🏁 Script executed:

cat > /tmp/cmake_check.md << 'EOF'
# CMake Generator Expression Constraints

## When $<TARGET_FILE:...> works:
1. In add_custom_command(COMMAND ...) - YES
2. In add_custom_target(COMMAND ...) - YES
3. In file(WRITE ...) - NO (configuration time)
4. In string operations before usage - NO
5. In find_program() - NO
6. As a plain string variable - NO

## For the code in question:
The variable _PROTOBUF_PROTOC appears to be used to locate the protoc executable.
If it's passed to add_custom_command(), generator expressions work fine.
If it's used in conditional logic or string tests, generator expressions will fail.

## Alternative approaches:
1. Use $<TARGET_FILE:> only if usage context allows
2. Use file(REAL_PATH target_file) after getting the location
3. Keep the current IMPORTED_LOCATION approach but handle non-imported targets
4. Check if target is IMPORTED first, then use appropriate method
EOF
cat /tmp/cmake_check.md

Repository: NVIDIA/cuopt

Length of output: 925


🏁 Script executed:

sed -n '294,500p' cpp/CMakeLists.txt | head -100

Repository: NVIDIA/cuopt

Length of output: 3254


🏁 Script executed:

grep -n "PROTOBUF_PROTOC" cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 550


🏁 Script executed:

rg -B5 -A10 "PROTOBUF_PROTOC" cpp/CMakeLists.txt | head -100

Repository: NVIDIA/cuopt

Length of output: 2368


🏁 Script executed:

cat > /tmp/test_cmake_behavior.py << 'EOF'
import subprocess
import tempfile
import os

# Create a minimal CMakeLists.txt to test behavior
cmake_test = '''
cmake_minimum_required(VERSION 3.15)
project(test)

# Create an executable target (non-imported)
add_executable(my_exe main.cpp)

# Try to get IMPORTED_LOCATION from non-imported target
get_target_property(LOC my_exe IMPORTED_LOCATION)
message("IMPORTED_LOCATION for non-imported: ${LOC}")

# Try TARGET_FILE
message("TARGET_FILE: $<TARGET_FILE:my_exe>")
'''

main_cpp = '''
int main() { return 0; }
'''

result = """
CMake Behavior Test Results:
============================

For non-IMPORTED executable targets:
1. get_target_property(... IMPORTED_LOCATION) returns empty/NOTFOUND
2. $<TARGET_FILE:target> returns a generator expression string that will resolve at build time

Key findings:
- IMPORTED_LOCATION only works for IMPORTED targets
- Non-IMPORTED targets don't have IMPORTED_LOCATION set
- $<TARGET_FILE:> works for both but is a generator expression, not evaluated at config time
- If the variable is used in add_custom_command(), generator expressions are safe
- If used in conditionals or string comparisons, generator expressions won't work as expected
"""

print(result)
EOF
python /tmp/test_cmake_behavior.py

Repository: NVIDIA/cuopt

Length of output: 709


🏁 Script executed:

wc -l cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 78


🏁 Script executed:

sed -n '279,350p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 2701


🏁 Script executed:

grep -n "protoc" cpp/CMakeLists.txt | head -20

Repository: NVIDIA/cuopt

Length of output: 640


🏁 Script executed:

sed -n '285,400p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 3783


🏁 Script executed:

rg -i "fetchcontent|add_subdirectory.*protobuf" cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 264


🏁 Script executed:

grep -n "find_package\|add_subdirectory" cpp/CMakeLists.txt | head -20

Repository: NVIDIA/cuopt

Length of output: 568


🏁 Script executed:

sed -n '250,280p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 1303


🏁 Script executed:

sed -n '230,320p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 3181


🏁 Script executed:

sed -n '295,350p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 2109


🏁 Script executed:

cat > /tmp/analyze_protoc.md << 'EOF'
# Analysis of protoc Resolution Review

## Current Code (lines 279-294):
1. If TARGET protobuf::protoc exists:
   - Try IMPORTED_LOCATION_RELEASE property
   - If not set, try IMPORTED_LOCATION property
2. Else:
   - find_package(protobuf CONFIG REQUIRED)
   - Try IMPORTED_LOCATION_RELEASE property
   - If not set, try IMPORTED_LOCATION property
3. Check if _PROTOBUF_PROTOC was set, error if not

## Review's Claim:
- For non-imported targets, IMPORTED_LOCATION* is empty
- This causes a false fatal error
- Solution: use $<TARGET_FILE:protobuf::protoc>

## Critical Issue with the Claim:
- The code says "if(TARGET protobuf::protoc)" which checks if target exists
- Then it gets IMPORTED_LOCATION properties
- For non-imported targets created via add_executable, these properties would be empty
- However: find_package(protobuf CONFIG) should create IMPORTED targets
- add_subdirectory() or FetchContent would create non-IMPORTED targets

## Key Questions:
1. Can protobuf::protoc be non-IMPORTED in practice?
   - YES: if using add_subdirectory(protobuf) or FetchContent
   - This is possible if protobuf is built in-tree

2. Does $<TARGET_FILE:> work in the usage context?
   - Need to see where _PROTOBUF_PROTOC is used
   - If in add_custom_command(COMMAND), YES
   - Otherwise, probably NOT

## The Real Issue:
The code doesn't handle the case where:
- protobuf::protoc exists (non-IMPORTED)
- IMPORTED_LOCATION* are all empty
- This would incorrectly trigger FATAL_ERROR
EOF
cat /tmp/analyze_protoc.md

Repository: NVIDIA/cuopt

Length of output: 1536


🏁 Script executed:

rg "_PROTOBUF_PROTOC" cpp/ -A 3 -B 1

Repository: NVIDIA/cuopt

Length of output: 1642


🏁 Script executed:

sed -n '300,320p' cpp/CMakeLists.txt

Repository: NVIDIA/cuopt

Length of output: 807


🏁 Script executed:

cat > /tmp/final_verification.py << 'EOF'
import sys

analysis = """
VERIFICATION RESULTS:
=====================

1. CURRENT CODE BEHAVIOR (lines 279-294):
   - Assumes protobuf::protoc target has IMPORTED_LOCATION properties
   - For IMPORTED targets from find_package(): WORKS
   - For non-IMPORTED targets (add_subdirectory/FetchContent): FAILS
   - Results in false FATAL_ERROR "protoc not found"

2. REVIEWER'S CLAIM: ACCURATE
   - Non-imported targets don't have IMPORTED_LOCATION properties
   - This is a real issue if protobuf is built in-tree
   
3. PROPOSED SOLUTION VALIDITY:
   - Using $<TARGET_FILE:protobuf::protoc> is VALID
   - Works for both IMPORTED and non-IMPORTED targets
   - Context check: used in add_custom_command(COMMAND ...) - VALID
   - Generator expressions ARE evaluated in COMMAND context

4. CORRECTNESS OF SUGGESTED DIFF:
   - Replaces multiple IMPORTED_LOCATION lookups with single $<TARGET_FILE:...>
   - Simplifies logic: removes redundant fallback checks
   - Works for both configuration scenarios

5. POTENTIAL CONCERNS:
   - The diff in the review appears incomplete or slightly off
   - The else() branch has redundant logic addition
   - But the core approach ($<TARGET_FILE:>) is sound

VERDICT: Review comment is SUBSTANTIVELY CORRECT but diff may need refinement
"""

print(analysis)
EOF
python /tmp/final_verification.py

Repository: NVIDIA/cuopt

Length of output: 1298


🏁 Script executed:

cat > /tmp/verify_logic.py << 'EOF'
# Analyze the proposed logic

current = """
CURRENT:
if(TARGET protobuf::protoc)
  # try IMPORTED_LOCATION_RELEASE
  # else try IMPORTED_LOCATION
else()
  find_package(protobuf CONFIG REQUIRED)
  # try IMPORTED_LOCATION_RELEASE
  # else try IMPORTED_LOCATION
endif()
"""

proposed = """
PROPOSED:
if(TARGET protobuf::protoc)
  set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
else()
  find_package(protobuf CONFIG REQUIRED)
  if(TARGET protobuf::protoc)
    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
  endif()
endif()
"""

analysis = """
LOGIC ANALYSIS:

Current Code Flow:
1. Check if target exists
2. If yes: try to get IMPORTED_LOCATION
3. If no: find_package and try to get IMPORTED_LOCATION
4. If all empty: FATAL_ERROR

Proposed Code Flow:
1. Check if target exists
2. If yes: use TARGET_FILE generator expression
3. If no: find_package, then check target again, use TARGET_FILE
4. If not created by find_package: FATAL_ERROR

OBSERVATION: The proposed code is logically equivalent and simpler
- Removes config-specific property lookups
- Uses universal TARGET_FILE instead
- Handles both IMPORTED and non-IMPORTED targets
- Still maintains the final error check (implicit: if not set, error)

ISSUE: The final error check at line 293-294 is still needed!
The diff doesn't show it, but the original has:
  if(NOT _PROTOBUF_PROTOC)
    message(FATAL_ERROR "protoc not found...")
  endif()

This check MUST remain or the logic is incomplete.
"""

print(analysis)
EOF
python /tmp/verify_logic.py

Repository: NVIDIA/cuopt

Length of output: 965


🏁 Script executed:

cat > /tmp/protobuf_config.txt << 'EOF'
CMake protobuf CONFIG packages:
- protobuf_FOUND should be set
- Creates IMPORTED targets: protobuf::protoc, protobuf::libprotobuf
- These are IMPORTED targets with IMPORTED_LOCATION set

FetchContent/add_subdirectory protobuf:
- May create non-IMPORTED targets
- Would not have IMPORTED_LOCATION properties

The review's scenario is valid and real.
EOF
cat /tmp/protobuf_config.txt

Repository: NVIDIA/cuopt

Length of output: 405


Use $<TARGET_FILE:protobuf::protoc> to resolve protoc across both imported and non-imported target types.

At lines 279–294, the code assumes protobuf::protoc has IMPORTED_LOCATION* properties. This fails when protobuf is built in-tree (e.g., via FetchContent or add_subdirectory), where the target exists but lacks these properties, triggering a false fatal error. The $<TARGET_FILE:…> generator expression resolves the path for both imported and non-imported targets and simplifies the logic.

Suggested CMake adjustment
 if(TARGET protobuf::protoc)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
 else()
   find_package(protobuf CONFIG REQUIRED)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  if(TARGET protobuf::protoc)
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
 endif()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if(TARGET protobuf::protoc)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
if(NOT _PROTOBUF_PROTOC)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
endif()
else()
find_package(protobuf CONFIG REQUIRED)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
if(NOT _PROTOBUF_PROTOC)
get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
endif()
endif()
if(NOT _PROTOBUF_PROTOC)
message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)")
endif()
if(TARGET protobuf::protoc)
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
else()
find_package(protobuf CONFIG REQUIRED)
if(TARGET protobuf::protoc)
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
endif()
endif()
if(NOT _PROTOBUF_PROTOC)
message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)")
endif()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic tries to read
IMPORTED_LOCATION/IMPORTED_LOCATION_RELEASE from the protobuf::protoc target and
falsely fails when protobuf is an in-tree target; replace the multi-branch
property lookups by resolving the executable via the generator expression
$<TARGET_FILE:protobuf::protoc> (assign to _PROTOBUF_PROTOC or set
Protobuf_PROTOC_EXECUTABLE) and keep the fallback find_package path only if that
result is empty, finally erroring if the resolved value (e.g. _PROTOBUF_PROTOC
or Protobuf_PROTOC_EXECUTABLE) is still empty; update references to
protobuf::protoc and _PROTOBUF_PROTOC accordingly.

Comment on lines +401 to +403
} else if (arg == "--time-limit" && arg_idx + 1 < argc) {
time_limit = std::stod(argv[++arg_idx]);
arg_idx++;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the test_grpc_client.cpp file
find . -name "test_grpc_client.cpp" -type f 2>/dev/null | head -20

Repository: NVIDIA/cuopt

Length of output: 98


🏁 Script executed:

# Check if the file exists and get its line count
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
  wc -l cpp/src/grpc/client/test_grpc_client.cpp
else
  echo "File not found at expected path"
fi

Repository: NVIDIA/cuopt

Length of output: 100


🏁 Script executed:

# Examine lines around 401-403
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
  sed -n '395,410p' cpp/src/grpc/client/test_grpc_client.cpp
fi

Repository: NVIDIA/cuopt

Length of output: 615


🏁 Script executed:

# Examine lines around 531
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
  sed -n '525,540p' cpp/src/grpc/client/test_grpc_client.cpp
fi

Repository: NVIDIA/cuopt

Length of output: 492


🏁 Script executed:

# Search for all std::stod and std::stoll calls in the file
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
  echo "=== std::stod calls ===" && \
  rg "std::stod|std::stoll" cpp/src/grpc/client/test_grpc_client.cpp -n -B2 -A2
fi

Repository: NVIDIA/cuopt

Length of output: 503


Add exception handling for numeric CLI argument parsing.

std::stod and std::stoll throw on invalid input; currently they can terminate this tool abruptly instead of returning a normal CLI error.

Add try-catch blocks around numeric parsing at line 402 (--time-limit) and line 531 (incumbent mode index) to handle std::invalid_argument and std::out_of_range, then return an error code with a descriptive message. Additionally, validate that time_limit is positive before accepting it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the
numeric CLI parses in main where std::stod is used for "--time-limit" and where
std::stoll is used for the incumbent mode index in try-catch blocks that catch
std::invalid_argument and std::out_of_range; on exception print a descriptive
error to stderr (including the offending argument text and which flag failed)
and return a non-zero exit code instead of allowing an exception to propagate.
After parsing time_limit (the variable assigned from std::stod), validate it is
> 0 and if not print an error and return non-zero. Apply the same guarded
pattern around the incumbent-mode parsing (the std::stoll call) so invalid or
out-of-range inputs are handled gracefully. Ensure messages clearly reference
the flag name ("--time-limit" or the incumbent-mode flag) and the invalid value.

Comment on lines +120 to +124
if (pid < 0) {
std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
<< worker_id << "\n";
close_worker_pipes_server(worker_id);
return -1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /repo && find . -name "grpc_worker_infra.cpp" -type f

Repository: NVIDIA/cuopt

Length of output: 111


🏁 Script executed:

head -150 cpp/src/grpc/server/grpc_worker_infra.cpp | tail -40

Repository: NVIDIA/cuopt

Length of output: 1699


🏁 Script executed:

wc -l cpp/src/grpc/server/grpc_worker_infra.cpp

Repository: NVIDIA/cuopt

Length of output: 101


🏁 Script executed:

cat -n cpp/src/grpc/server/grpc_worker_infra.cpp | sed -n '100,130p'

Repository: NVIDIA/cuopt

Length of output: 1414


🏁 Script executed:

rg "close_worker_pipes" cpp/src/grpc/server/grpc_worker_infra.cpp -A 10 -B 2

Repository: NVIDIA/cuopt

Length of output: 2137


🏁 Script executed:

rg "worker_read_fd|worker_write_fd|worker_incumbent_write_fd" cpp/src/grpc/server/grpc_worker_infra.cpp -B 3 -A 3

Repository: NVIDIA/cuopt

Length of output: 2216


🏁 Script executed:

cat -n cpp/src/grpc/server/grpc_worker_infra.cpp | sed -n '70,110p'

Repository: NVIDIA/cuopt

Length of output: 1388


🏁 Script executed:

rg "void close_worker_pipes" cpp/src/grpc/server/grpc_worker_infra.cpp -A 15

Repository: NVIDIA/cuopt

Length of output: 1032


Close child-end FDs on fork() failure path.

When fork() fails at line 120, the pipes created at line 113 via create_worker_pipes() remain partially open in the parent. The call to close_worker_pipes_server(worker_id) at line 123 closes only the parent-end descriptors (to_worker_fd, from_worker_fd, incumbent_from_worker_fd). The child-end descriptors (worker_read_fd, worker_write_fd, worker_incumbent_write_fd) remain open and are leaked, since the child process was never created to inherit them.

Add close_worker_pipes_child_ends(worker_id) to properly clean up all file descriptors on fork failure, matching the cleanup pattern already present in the successful path (line 145).

Suggested fix
  if (pid < 0) {
    std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
              << worker_id << "\n";
    close_worker_pipes_server(worker_id);
+   close_worker_pipes_child_ends(worker_id);
    return -1;
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork()
failure in the block handling pid < 0, call
close_worker_pipes_child_ends(worker_id) in addition to the existing
close_worker_pipes_server(worker_id) so the child-end FDs (worker_read_fd,
worker_write_fd, worker_incumbent_write_fd) are closed and not leaked; mirror
the cleanup done in the successful path (the same pattern used around the
fork-success handling) and then return -1 as before.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

♻️ Duplicate comments (25)
cpp/src/grpc/server/grpc_job_management.cpp (2)

147-158: ⚠️ Potential issue | 🔴 Critical

submit_job_async still uses racy check-then-write slot acquisition.

Lines [147]-[158] can let concurrent submitters claim/overwrite the same queue slot. Use atomic reservation (compare_exchange) and publish ordering.

Proposed fix
-  for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+  for (size_t i = 0; i < MAX_JOBS; ++i) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected_claimed = false;
+    if (job_queue[i].claimed.compare_exchange_strong(expected_claimed, true)) {
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       job_queue[i].data_size    = request_data.size();
       job_queue[i].worker_pid   = 0;
       job_queue[i].worker_index = -1;
-      job_queue[i].data_sent    = false;
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].data_sent.store(false);
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158,
submit_job_async currently uses a racy check-then-write loop over job_queue
(iterating i from 0..MAX_JOBS) checking job_queue[i].ready and
job_queue[i].claimed then writing the slot; change this to atomically reserve a
slot using an atomic CAS on a dedicated slot state (e.g., atomically
compare_exchange the claimed/empty flag for job_queue[i]) before writing any
fields, and use release semantics when publishing ready (store with
memory_order_release) and acquire semantics when consumers read
(memory_order_acquire); update submit_job_async to attempt compare_exchange on
job_queue[i].claimed (or a new atomic state) to transition from empty->reserved,
only write the job fields after successful CAS, and then set ready=true with
appropriate publish ordering so concurrent submitters cannot overwrite the same
slot.

109-113: ⚠️ Potential issue | 🟠 Major

Bound incumbent payload size before resize to prevent memory exhaustion.

Line [111] trusts a pipe-provided size and allocates directly. A malformed size prefix can force unbounded allocation.

Proposed fix
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads an untrusted uint64_t size from a pipe and calls data.resize(size)
directly; bound the incoming size before resizing to prevent memory exhaustion
by introducing a sensible constant (e.g. JOB_PAYLOAD_MAX or MAX_PAYLOAD_SIZE)
and rejecting sizes > that limit (return false) or otherwise handling them; also
validate zero/negative semantics if needed and use that check right after
read_from_pipe(fd, &size, ...) and before data.resize(size) (referencing the
local variable size, function read_from_pipe, and the vector/data.resize call).
cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Buf package-directory mismatch is still unresolved.

Line [6] (package cuopt.remote;) remains in a path that violates Buf PACKAGE_DIRECTORY_MATCH for this module layout, so lint/CI will continue to fail until package/path (or Buf roots) are aligned.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") doesn't match the file path,
triggering Buf's PACKAGE_DIRECTORY_MATCH rule; fix by either updating the
package to reflect the file's directory structure (e.g., change package to match
the actual path) or move cuopt_remote.proto into a directory layout matching
"cuopt/remote" (or adjust buf.yaml roots/module configuration to include the
current path). Ensure the package token "cuopt.remote" and the filesystem layout
are aligned so Buf lint passes.
cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to prevent queue-slot double allocation.

At Line [71] and Line [265], releasing claimed before publishing ready re-opens a claim window for another submitter on the same slot.

Proposed fix (both locations)
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

Also applies to: 265-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently clears job_queue[job_idx].claimed before publishing
job_queue[job_idx].ready, which allows another submitter to claim the same slot
between those two stores; swap the operations so you set
job_queue[job_idx].ready.store(true) before
job_queue[job_idx].claimed.store(false) (and apply the same swap at the other
occurrence around lines 265-267) to ensure the slot is marked available only
after the ready flag is visible; if using explicit memory orders, use release
semantics for the store to ready to avoid the race.
SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add fence languages to satisfy markdownlint MD040

Several fenced blocks are untyped; markdownlint MD040 will keep warning/failing until those fences specify a language (e.g., text).

Proposed patch
-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐
 │                        Main Server Process                          │
 ...
 └─────────────────┘  └─────────────────┘  └─────────────────┘
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Worker
 ...
-```
+```

@@
-```
+```text
 ┌─────────┐  submit   ┌───────────┐  claim   ┌────────────┐
 ...
 └───────────┘          └─────────┘
-```
+```

Also applies to: 110-118, 122-136, 139-148, 154-168, 207-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has multiple
untyped fenced code blocks (the large ASCII "Main Server Process" diagram and
the smaller "Client / Server / Worker" and "submit / claim" pipeline diagrams)
causing markdownlint MD040 failures; update each triple-backtick fence around
these ASCII diagrams in SERVER_ARCHITECTURE.md to specify a language (use
"text") so the fences read ```text ... ```; look for the big box diagram
containing "Main Server Process", the repeated "Client                    
Server                      Worker" blocks, and the "submit   claim" pipeline
block and add the language token to each opening fence.
cpp/src/grpc/server/grpc_server_main.cpp (3)

156-168: ⚠️ Potential issue | 🔴 Critical

Don’t memset shared-memory structs that contain std::atomic

memset() on a type with std::atomic members is undefined behavior in C++. Even if you immediately .store() afterwards, the atomic object’s lifetime/rules are already violated.

Safer initialization pattern
   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    memset(&job_queue[i], 0, sizeof(JobQueueEntry));
+    new (&job_queue[i]) JobQueueEntry{};
     job_queue[i].ready.store(false);
     job_queue[i].claimed.store(false);
     job_queue[i].cancelled.store(false);
     job_queue[i].worker_index.store(-1);
   }

   for (size_t i = 0; i < MAX_RESULTS; ++i) {
-    memset(&result_queue[i], 0, sizeof(ResultQueueEntry));
+    new (&result_queue[i]) ResultQueueEntry{};
     result_queue[i].ready.store(false);
     result_queue[i].retrieved.store(false);
   }

As per coding guidelines, “Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 156 - 168, The code
currently calls memset on job_queue[i] and result_queue[i] which is undefined
for types containing std::atomic; remove the memset calls and instead properly
construct/value-initialize each entry (e.g., use placement new: new
(&job_queue[i]) JobQueueEntry(); new (&result_queue[i]) ResultQueueEntry();) and
then set the atomic fields via their .store(...) calls (worker_index.store(-1),
ready.store(false), claimed.store(false), cancelled.store(false),
retrieved.store(false)); update the loops that touch job_queue and result_queue
(and references to MAX_JOBS / MAX_RESULTS) to use this safe initialization
pattern.

173-212: ⚠️ Potential issue | 🔴 Critical

Avoid std::terminate on TLS error paths; guard BuildAndStart() failure

TLS validation errors currently return after starting workers/threads (joinable thread dtors will call std::terminate). Separately, BuildAndStart() can return null and server->Wait() will crash. Both need a single cleanup/teardown path.

Minimal direction: introduce a shutdown helper and use it on all early-exit paths
   spawn_workers();

   std::thread result_thread(result_retrieval_thread);
   std::thread incumbent_thread(incumbent_retrieval_thread);
   std::thread monitor_thread(worker_monitor_thread);
   std::thread reaper_thread(session_reaper_thread);

+  auto shutdown_runtime = [&]() {
+    keep_running                 = false;
+    shm_ctrl->shutdown_requested = true;
+    result_cv.notify_all();
+    if (result_thread.joinable()) result_thread.join();
+    if (incumbent_thread.joinable()) incumbent_thread.join();
+    if (monitor_thread.joinable()) monitor_thread.join();
+    if (reaper_thread.joinable()) reaper_thread.join();
+    wait_for_workers();
+    cleanup_shared_memory();
+  };
@@
   if (config.enable_tls) {
     if (config.tls_cert_path.empty() || config.tls_key_path.empty()) {
       std::cerr << "[Server] TLS enabled but --tls-cert/--tls-key not provided\n";
-      return 1;
+      shutdown_runtime();
+      return 1;
     }
@@
       if (ssl_opts.pem_root_certs.empty()) {
         std::cerr << "[Server] --require-client-cert requires --tls-root\n";
-        return 1;
+        shutdown_runtime();
+        return 1;
       }
@@
   std::unique_ptr<Server> server(builder.BuildAndStart());
+  if (!server) {
+    std::cerr << "[gRPC Server] Failed to start server on " << server_address << "\n";
+    shutdown_runtime();
+    return 1;
+  }
@@
   server->Wait();

As per coding guidelines, “Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state.”

Also applies to: 234-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 173 - 212, Spawned
worker threads and early TLS/BuildAndStart error paths can return while threads
are still joinable (causing std::terminate) and BuildAndStart() can return null
causing a crash; introduce a single shutdown helper (e.g.,
shutdown_server_and_workers()) that stops workers, signals threads to exit,
joins result_retrieval_thread, incumbent_retrieval_thread,
worker_monitor_thread, session_reaper_thread and cleans up any resources and the
grpc::Server pointer, call this helper on all early-exit/TLS error paths and
when builder.BuildAndStart() returns nullptr before returning non-zero, and
replace ad-hoc returns with calls to this helper; additionally ensure shared
solver/global state access in server handlers uses proper mutex/synchronization
(e.g., std::mutex/lock_guard around the solver state) to avoid thread-unsafe
access.

50-80: ⚠️ Potential issue | 🟠 Major

CLI parsing should reject unknown/incomplete flags and handle std::stoi failures

Right now unknown flags and missing values are silently ignored, and std::stoi/std::stoll can throw and abort the server. Also, --chunk-timeout is documented but not parsed at all.

Example patch sketch (apply pattern to all flags)
   for (int i = 1; i < argc; i++) {
     std::string arg = argv[i];
     if (arg == "-p" || arg == "--port") {
-      if (i + 1 < argc) { config.port = std::stoi(argv[++i]); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        print_usage(argv[0]);
+        return 1;
+      }
+      try {
+        config.port = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid port value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "--max-message-bytes") {
-      if (i + 1 < argc) { config.max_message_b = std::max(4096LL, std::stoll(argv[++i])); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.max_message_b = std::max<int64_t>(4096LL, std::stoll(argv[++i]));
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
+    } else if (arg == "--chunk-timeout") {
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.chunk_timeout_seconds = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "-h" || arg == "--help") {
       print_usage(argv[0]);
       return 0;
+    } else {
+      std::cerr << "ERROR: Unknown option '" << arg << "'\n";
+      print_usage(argv[0]);
+      return 1;
     }
   }

As per coding guidelines, “Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parsing loop in main currently silently ignores unknown flags and missing
values, and calls std::stoi/std::stoll without catching exceptions; also the
documented flag "--chunk-timeout" is not parsed. Update the argv parsing code
(the for loop handling argv[], touching config.port, config.num_workers,
config.max_message_mb, config.max_message_b, config.enable_transfer_hash,
config.enable_tls, config.tls_cert_path, config.tls_key_path,
config.tls_root_path, config.require_client, config.log_to_console,
config.verbose) to: 1) validate that flags requiring a value check (i + 1 <
argc) and on failure print_usage(argv[0]) and return a non-zero error code; 2)
wrap std::stoi/std::stoll usages in try/catch(std::exception&) to log a clear
error and exit non-zero on parse failure; 3) handle unknown flags by printing an
error + usage and exiting non-zero; and 4) add parsing for "--chunk-timeout" to
populate the appropriate config field (e.g., config.chunk_timeout) following the
same validation and exception-handling pattern. Ensure error messages include
the offending flag for easier debugging.
build.sh (1)

390-398: ⚠️ Potential issue | 🟠 Major

cuopt_grpc_server fast path still captures mixed-target builds

As written, build.sh cuopt_grpc_server cuopt will still take the server-only branch and skip the requested install target(s). That’s a surprising foot-gun for devs/CI.

Proposed patch
-    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll; then
+    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll \
+       && ! hasArg cuopt && ! hasArg cuopt_server && ! hasArg cuopt_mps_parser \
+       && ! hasArg cuopt_sh_client && ! hasArg docs && ! hasArg deb \
+       && ! hasArg libmps_parser; then
         # Build only the gRPC server (ninja resolves libcuopt as a dependency)
         cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server ${VERBOSE_FLAG} ${JFLAG}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build.sh` around lines 390 - 398, The fast-path condition that builds only
cuopt_grpc_server (the if branch using hasArg cuopt_grpc_server && ! hasArg
libcuopt && ! buildAll) incorrectly triggers when other targets (e.g., cuopt)
are also requested; change the guard so it only takes the server-only branch
when cuopt_grpc_server is the sole target requested (for example replace the
condition with a new predicate like hasOnlyArg cuopt_grpc_server or add explicit
negations for all other possible target args / check INSTALL_TARGET equals
cuopt_grpc_server), updating the if in build.sh that surrounds the cmake --build
"${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server block so mixed-target
invocations fall through to the normal install/build branch.
cpp/src/grpc/client/test_grpc_client.cpp (1)

381-412: ⚠️ Potential issue | 🟠 Major

Harden numeric argument parsing and validate --time-limit / incumbent index

std::stod/std::stoll throw on bad input, and static_cast<int>(time_limit) can be undefined if the value is out of range. This tool should fail with a clean CLI error instead of terminating.

Proposed patch
   while (arg_idx < argc && argv[arg_idx][0] == '-') {
     std::string arg = argv[arg_idx];
@@
     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      const char* value = argv[++arg_idx];
+      try {
+        time_limit = std::stod(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for --time-limit: '" << value << "'\n";
+        return 1;
+      }
+      if (!std::isfinite(time_limit) || time_limit <= 0.0) {
+        std::cerr << "ERROR: --time-limit must be a finite positive number (got '" << value
+                  << "')\n";
+        return 1;
+      }
       arg_idx++;
@@
   grpc_client_config_t config;
   config.server_address  = server_address;
-  config.timeout_seconds = static_cast<int>(time_limit) + 120;
+  if (time_limit > static_cast<double>(std::numeric_limits<int>::max() - 120)) {
+    std::cerr << "ERROR: --time-limit too large (max "
+              << (std::numeric_limits<int>::max() - 120) << ")\n";
+    return 1;
+  }
+  config.timeout_seconds = static_cast<int>(time_limit) + 120;
@@
   } else if (mode == "incumbent") {
@@
     std::string job_id = argv[arg_idx++];
-    int64_t from_index = (arg_idx < argc) ? std::stoll(argv[arg_idx]) : 0;
+    int64_t from_index = 0;
+    if (arg_idx < argc) {
+      const char* value = argv[arg_idx];
+      try {
+        from_index = std::stoll(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid incumbent index: '" << value << "'\n";
+        return 1;
+      }
+      if (from_index < 0) {
+        std::cerr << "ERROR: incumbent index must be >= 0 (got '" << value << "')\n";
+        return 1;
+      }
+    }
     return mode_incumbent(client, job_id, from_index);
   }

Also applies to: 425-426, 531-532

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 381 - 412, The CLI
currently calls std::stod (for time_limit) and std::stoll (for the incumbent
index elsewhere) which throw on invalid input and can lead to undefined behavior
when casting large values; update the argument parsing in the main option loop
(handling --time-limit / variable time_limit) to validate input robustly: parse
numeric arguments using non-throwing parsing (e.g., std::from_chars for integers
and a safe check for floating parsing or wrap std::stod in try/catch), detect
conversion failure or out-of-range values, emit a clear error message via
std::cerr and return a non-zero exit code, and before doing
static_cast<int>(time_limit) or casting the stoll result ensure the parsed value
fits the target range to avoid undefined behavior; apply the same
validation/failure handling to the other occurrences parsing integers (the
std::stoll sites noted in the review).
cpp/src/grpc/client/solve_remote.cu (2)

164-166: ⚠️ Potential issue | 🟠 Major

Do not pass incumbent objective as best-bound in callbacks.

Line 164–166 sets bound_copy equal to objective, which reports a fake zero-gap state to callback consumers.

A safe interim behavior is to pass NaN (unknown bound) until the incumbent protocol includes a real bound field end-to-end.

🔧 Interim mitigation in this file
+#include <limits>
...
-          double bound_copy                 = objective;  // Use objective as bound for incumbent
+          double bound_copy = std::numeric_limits<double>::quiet_NaN();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The callback
currently passes the incumbent objective as the best-bound (bound_copy =
objective) to get_callback->get_solution, which falsely reports a zero-gap;
change the interim behavior to pass an unknown bound (use NaN) instead of
objective when calling get_callback->get_solution (affecting bound_copy,
get_callback->get_solution(..., &bound_copy, callback->get_user_data()) and
related solution_copy/obj_copy variables) so callbacks receive NaN until the
protocol supplies a real bound end-to-end.

55-57: ⚠️ Potential issue | 🟠 Major

Validate env sizing overrides before applying them.

Line 55–57 still accepts zero/negative values for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES, which can destabilize transfer behavior.

🔧 Proposed guardrails
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides
for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES may be zero/negative; update
the code that calls parse_env_int64 to validate the returned value before
assigning to config.chunk_size_bytes and config.max_message_bytes (e.g., call
parse_env_int64 for each into a temp variable, check temp > 0 and optionally cap
to a safe maximum, then assign to config.*; if invalid, retain the existing
default and emit a debug/warn via config.enable_debug_log or logger). Ensure you
reference parse_env_int64, config.chunk_size_bytes, config.max_message_bytes,
and the CUOPT_CHUNK_SIZE / CUOPT_MAX_MESSAGE_BYTES env names when implementing
the guard.
cpp/CMakeLists.txt (1)

279-294: ⚠️ Potential issue | 🟠 Major

Resolve protoc via $<TARGET_FILE:protobuf::protoc> to support in-tree protobuf targets.

Line 279–294 still assumes protobuf::protoc has IMPORTED_LOCATION*. That fails for non-imported targets (e.g., FetchContent/add_subdirectory), causing false fatal errors.

🔧 Proposed CMake adjustment
 if(TARGET protobuf::protoc)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
 else()
   find_package(protobuf CONFIG REQUIRED)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  if(TARGET protobuf::protoc)
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
 endif()
In CMake, does get_target_property(... IMPORTED_LOCATION) work for non-imported targets (e.g., created via add_subdirectory/FetchContent), and is $<TARGET_FILE:...> the recommended way to reference executables in add_custom_command(COMMAND ...)?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic assumes
protobuf::protoc has IMPORTED_LOCATION properties and can fail for in-tree
targets; change the resolution to first check if the target exists and set
_PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc>
when TARGET protobuf::protoc is true, otherwise fall back to
get_target_property(... IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION) or the
PROTOBUF_PROTOC_EXECUTABLE cache variable; update any add_custom_command(...)
references to use ${_PROTOBUF_PROTOC} and keep the existing fatal error check if
_PROTOBUF_PROTOC is still empty.
cpp/src/grpc/server/grpc_pipe_serialization.hpp (3)

96-100: ⚠️ Potential issue | 🔴 Critical

Validate element_offset and overflow before pointer arithmetic in result deserialization.

Line 96–100 still allows negative offsets and unchecked offset + chunk_size arithmetic, which can underflow/overflow bounds checks and lead to unsafe memcpy.

🔧 Proposed fix
-    int64_t offset         = ac.element_offset();
-    const auto& chunk_data = ac.data();
-    if (offset + static_cast<int64_t>(chunk_data.size()) <= static_cast<int64_t>(dest.size())) {
-      std::memcpy(dest.data() + offset, chunk_data.data(), chunk_data.size());
-    }
+    int64_t offset         = ac.element_offset();
+    const auto& chunk_data = ac.data();
+    if (offset < 0) { return false; }
+    size_t uoffset = static_cast<size_t>(offset);
+    if (uoffset > dest.size()) { return false; }
+    if (chunk_data.size() > dest.size() - uoffset) { return false; }
+    std::memcpy(dest.data() + uoffset, chunk_data.data(), chunk_data.size());

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 96 - 100, The
current deserialization uses offset = ac.element_offset() and then does pointer
arithmetic without validating offset or preventing overflow/underflow before
memcpy; update the logic in grpc_pipe_serialization.hpp around the block using
offset, chunk_data and dest so you first validate element_offset() is
non-negative and within dest.size(), compute size_t chunk_size =
chunk_data.size(), and perform a safe bounds check using size_t conversions
(e.g., offset <= dest.size() && offset + chunk_size <= dest.size()) while
guarding against overflow on offset + chunk_size; only call
std::memcpy(dest.data() + offset, ...) when those checks pass and otherwise
handle the error/return early.

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on non-EOF parse errors in both deserializers.

Line 90 and Line 148 still break on any parse failure and eventually return true, which can accept malformed/truncated payloads as valid.

🔧 Proposed fix
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }  // normal end-of-stream
+      return false;              // malformed/truncated message
+    }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loops
calling google::protobuf::util::ParseDelimitedFromCodedStream (using variables
ac, coded, clean_eof) currently use "break" on parse failure which lets the
function return true for truncated/malformed input; change those breaks to
return false on a parse failure unless the failure was a clean EOF. Concretely,
at both call sites (around the ParseDelimitedFromCodedStream lines referenced)
check the clean_eof flag after a failed parse and return false for non-EOF
failures (only allow breaking/ending the loop when clean_eof is true).

151-163: ⚠️ Potential issue | 🟠 Major

Guard total_elements * elem_size and byte-offset math before resize/copy.

Line 151–163 still performs unchecked multiplication and resize from untrusted chunk metadata. A malformed payload can overflow size math or trigger excessive allocation.

🔧 Proposed fix
+    if (ac.total_elements() < 0) { return false; }
     if (dest.empty() && ac.total_elements() > 0) {
       int64_t elem_size = array_field_element_size(ac.field_id());
-      dest.resize(static_cast<size_t>(ac.total_elements() * elem_size), 0);
+      if (elem_size <= 0) { return false; }
+      uint64_t total_bytes = static_cast<uint64_t>(ac.total_elements()) *
+                             static_cast<uint64_t>(elem_size);
+      if (total_bytes > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+        return false;
+      }
+      dest.resize(static_cast<size_t>(total_bytes), 0);
     }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 151 - 163, The
code in the block that resizes dest and copies chunk bytes must validate sizes
to avoid integer overflow and huge allocations: before calling
dest.resize(ac.total_elements() * elem_size) and before computing byte_offset,
check that ac.total_elements() and computed elem_size are > 0 and that
static_cast<size_t>(ac.total_elements()) <= SIZE_MAX /
static_cast<size_t>(elem_size); bound the requested allocation against a safe
maximum (or return an error) to prevent resource exhaustion; verify
chunk_data.size() is <= dest.size() and that byte_offset >= 0 and byte_offset <=
dest.size() and that byte_offset + chunk_data.size() does not overflow (use
size_t or checked arithmetic) before calling std::memcpy; refer to symbols dest,
ac, array_field_element_size(), total_elements(), element_offset(), data(),
elem_size, byte_offset, resize, and std::memcpy when making these checks.
cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove async-signal-unsafe operations from signal_handler().

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock/corrupt state when invoked from a POSIX signal handler.

Suggested minimal signal-safe handler
 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Then have a normal thread (or the main loop) observe keep_running == false and do logging + result_cv.notify_all() outside the signal context.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308,
signal_handler currently performs async-unsafe operations (std::cout and
result_cv.notify_all()); modify it to only perform signal-safe actions: set a
sig_atomic_t/atomic<bool> flag (keep_running) and, if needed, set
shm_ctrl->shutdown_requested (ensure that shutdown_requested is signal-safe or
updated via atomic/sig_atomic_t), then return immediately—remove std::cout and
result_cv.notify_all() from signal_handler. Update the main loop or a dedicated
watcher thread to observe keep_running (or the atomic) and then perform logging
and call result_cv.notify_all() outside the signal context (use the symbols
signal_handler, keep_running, shm_ctrl, result_cv to locate changes).
cpp/src/grpc/client/grpc_client.cu (5)

702-745: ⚠️ Potential issue | 🔴 Critical

Harden chunked-result metadata validation before division/alloc/memcpy.

element_size_bytes and total_elements are trusted. elem_size <= 0 triggers division-by-zero (Line 708), and total_elems * elem_size can overflow size_t before allocation, making the memcpy (Line 739) unsafe.

Proposed bounds checks
   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    const int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) >
+          static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 745, The code trusts
element_size_bytes() and total_elements() and must validate them before any
division, allocation, or memcpy: check that elem_size =
arr_desc.element_size_bytes() > 0 and total_elems = arr_desc.total_elements() >=
0, ensure elems_per_chunk = chunk_data_budget / elem_size is computed only after
the elem_size > 0 check, and verify that total_elems * elem_size does not
overflow size_t before constructing array_bytes; also validate that elems_wanted
and elems_received are non-negative and that data.size() equals elems_received *
elem_size and fits within array_bytes bounds before calling std::memcpy
(references: arr_desc.field_id(), arr_desc.total_elements(),
arr_desc.element_size_bytes(), elems_per_chunk, array_bytes,
chunk_resp.elements_in_chunk(), chunk_resp.data(), and the std::memcpy call).

258-266: ⚠️ Potential issue | 🔴 Critical

Guard all public RPC methods against impl_->stub being null.

check_status(), wait_for_completion(), cancel_job(), delete_job(), get_incumbents(), and stream_logs() all dereference impl_->stub directly; calling these before connect() will crash.

Proposed guard pattern
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, All public RPC
methods must guard against impl_ or impl_->stub being null to avoid crashes when
called before connect(); update grpc_client_t::check_status,
wait_for_completion, cancel_job, delete_job, get_incumbents, and stream_logs to
early-return a sensible error result if impl_ == nullptr or impl_->stub ==
nullptr (e.g., set job_status_result_t to an error/failed state or populate the
method's error/status fields and avoid calling stub->*), and add a clear logged
message indicating the client is not connected; ensure you use the same guard
pattern in the other listed methods (the ones around the ranges noted:
check_status, wait_for_completion, cancel_job, delete_job, get_incumbents,
stream_logs) so no method dereferences impl_->stub without this null check.

221-252: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang indefinitely (blocked reader->Read()).

Setting stop_logs_ doesn’t unblock a synchronous server-streaming Read(), so join() can deadlock if the server stops emitting logs (or the stream is otherwise idle). This is still the same failure mode previously called out.

Proposed fix (make stop cancel the active ClientContext)
 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
-    grpc::ClientContext context;
+    grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }

You’ll also need matching members on grpc_client_t (likely in cpp/src/grpc/client/grpc_client.hpp):

mutable std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The
stop_log_streaming() can block because reader->Read() is synchronous; modify
start_log_streaming() and stop_log_streaming() to track and cancel the active
grpc::ClientContext: add a mutable std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ member, then in start_log_streaming()
create the ClientContext, lock log_context_mutex_, set active_log_context_ =
&context before calling impl_->stub->StreamLogs, unlock, and clear
active_log_context_ (under the mutex) after reader->Finish(); in
stop_log_streaming() lock log_context_mutex_, if active_log_context_ is non-null
call active_log_context_->TryCancel(), then set stop_logs_ and join/reset the
thread; keep the existing context.TryCancel() check inside the thread but ensure
all access to active_log_context_ is synchronized to avoid races.

258-266: ⚠️ Potential issue | 🟠 Major

Add deadlines to operational RPCs (current code can block forever on network stalls).

Only the connection probe sets a deadline; most other RPC calls create a grpc::ClientContext without set_deadline(). That bypasses config_.timeout_seconds and can hang indefinitely (including inside polling loops).

Proposed helper + per-RPC use
+static inline void set_context_deadline(grpc::ClientContext& ctx, int timeout_seconds)
+{
+  if (timeout_seconds <= 0) return;
+  ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(timeout_seconds));
+}
+
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;

   grpc::ClientContext context;
+  set_context_deadline(context, config_.timeout_seconds);
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Apply similarly to WaitForCompletion, CancelJob, DeleteResult, GetIncumbents, GetResult, and all chunked upload/download RPCs.

Also applies to: 293-303, 325-333, 354-361, 387-395, 419-443, 468-470, 509-510, 540-541, 561-562, 588-589, 625-626, 674-675, 724-725, 1000-1004, 1172-1176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The RPCs in
grpc_client_t (e.g., grpc_client_t::check_status) create grpc::ClientContext
without deadlines and can block indefinitely; add a helper (e.g.,
make_client_context_with_deadline or set_deadline_on_context) that constructs a
grpc::ClientContext and calls set_deadline(std::chrono::system_clock::now() +
std::chrono::seconds(config_.timeout_seconds)) (or uses config_.timeout_seconds
as a std::chrono duration) and use that helper everywhere you create a context
for RPCs (check_status, WaitForCompletion, CancelJob, DeleteResult,
GetIncumbents, GetResult, all chunked upload/download calls, and the other
locations noted in the review) so each RPC honors config_.timeout_seconds and
cannot hang forever.

992-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before dividing (division-by-zero).

Both LP and MIP polling paths compute max_polls by dividing by config_.poll_interval_ms without guarding <= 0.

Proposed validation
   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

Also applies to: 1130-1133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 992 - 995, Before computing
max_polls in the LP and MIP polling paths, validate config_.poll_interval_ms > 0
to avoid division-by-zero: if config_.poll_interval_ms <= 0, emit a clear error
log (e.g., CUOPT_LOG_ERROR) that includes the invalid value and either return an
error status or substitute a safe fallback (e.g., set poll_interval_ms = 1)
before computing max_polls = (config_.timeout_seconds * 1000) /
config_.poll_interval_ms; apply this check around the existing uses of
poll_count and max_polls so both the LP and MIP paths use the validated value.
cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Fix potential out-of-bounds writes in get_doubles/get_ints (misaligned payload sizes).

n is computed with floor division, but memcpy copies it->second.size() bytes. If the byte payload isn’t an exact multiple of the element size, this can overflow the destination vector.

Proposed fix (validate alignment and copy only validated byte count)
   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       std::vector<double> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(double));
       return std::vector<f_t>(tmp.begin(), tmp.end());
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);
     if constexpr (std::is_same_v<i_t, int32_t>) {
       std::vector<int32_t> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(int32_t));
       return v;
     } else {
       std::vector<int32_t> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(int32_t));
       return std::vector<i_t>(tmp.begin(), tmp.end());
     }
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints compute element count n via floor division but then
memcpy the full it->second.size() bytes, risking out-of-bounds when payload size
isn't a multiple of element size; fix by validating alignment (check
it->second.size() % sizeof(double) / sizeof(int32_t) == 0) and compute a safe
byte_count = n * sizeof(element_type), resize the destination vector to n, then
memcpy only byte_count (or return an error/empty vector if misaligned); update
both get_doubles (use sizeof(double) and f_t conversions) and get_ints (use
sizeof(int32_t) and i_t conversions) accordingly so copying never exceeds the
destination buffer.
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently (data race).

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock in set_server_log_path(), get_server_logs(), dump_logs(), and the accessors. This can race if tests poll logs while another thread resets paths/markers.

Proposed approach (copy shared state under lock, then do I/O unlocked)
 void set_server_log_path(const std::string& path)
 {
-  server_log_path_      = path;
-  server_log_start_pos_ = 0;
-  test_start_marked_    = false;
+  std::lock_guard<std::mutex> lock(mutex_);
+  server_log_path_      = path;
+  server_log_start_pos_ = 0;
+  test_start_marked_    = false;
 }

 std::string get_server_logs(bool since_test_start = true) const
 {
-  if (server_log_path_.empty()) { return ""; }
+  std::string path;
+  std::streampos start_pos = 0;
+  bool marked              = false;
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    path      = server_log_path_;
+    start_pos = server_log_start_pos_;
+    marked    = test_start_marked_;
+  }
+
+  if (path.empty()) { return ""; }

-  std::ifstream file(server_log_path_);
+  std::ifstream file(path);
   if (!file.is_open()) { return ""; }

-  if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-    file.seekg(server_log_start_pos_);
+  if (since_test_start && marked && start_pos > 0) {
+    file.seekg(start_pos);
   }

Consider also making server_log_path() return a copy (std::string) instead of a reference if you need it to be thread-safe.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state".

Also applies to: 229-244, 332-345, 350-356

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, There is a data race on server_log_path_, server_log_start_pos_, and
test_start_marked_: protect all reads/writes with a single mutex (e.g., add
std::mutex server_log_mutex_) and update set_server_log_path(),
get_server_logs(), dump_logs(), and the accessors to acquire the lock when
accessing those members; for I/O work, copy the guarded state (make
server_log_path() return a std::string copy rather than a const ref) while
holding the lock then release and perform file reading/writing unlocked to avoid
long holds; ensure all places that currently read the fields directly use the
mutex-protected copy to eliminate races.
cpp/src/grpc/grpc_solution_mapper.cu (1)

512-535: ⚠️ Potential issue | 🔴 Critical

Harden bytes_to_typed() against misaligned byte payloads (potential OOB write).

All branches compute n with floor division and then copy raw.size() bytes. If raw.size() is not an exact multiple of the expected element size, this overflows tmp/v.

Proposed fix
   const auto& raw = it->second;
   if constexpr (std::is_same_v<T, float>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> tmp(n);
-    std::memcpy(tmp.data(), raw.data(), raw.size());
+    std::memcpy(tmp.data(), raw.data(), n * sizeof(double));
     return std::vector<T>(tmp.begin(), tmp.end());
   } else if constexpr (std::is_same_v<T, double>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(double));
     return v;
   } else {
+    if (raw.size() % sizeof(T) != 0) return {};
     size_t n = raw.size() / sizeof(T);
     std::vector<T> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(T));
     return v;
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 512 - 535, The
bytes_to_typed() implementation can write out-of-bounds because it computes n =
raw.size()/sizeof(...) but then memcpy's raw.size() bytes; fix by always using
sizeof(T) (not sizeof(double) in the float branch), compute size_t n =
raw.size() / sizeof(T), and memcpy only n * sizeof(T) bytes into the destination
buffers (and return the n-length vector); if raw.size() % sizeof(T) != 0,
silently ignore trailing bytes or log a warning—ensure this same change is
applied in the float branch (tmp) and the double/other branches (v) so all
copies use n * sizeof(T) and use sizeof(T) to compute n.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 53823cec-ccad-4fbb-b90f-ec216b29d2e2

📥 Commits

Reviewing files that changed from the base of the PR and between d228718 and e46028c.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/solve_remote.cu
  • cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/linear_programming/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • GRPC_ARCHITECTURE.md
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/grpc_settings_mapper.cu

Comment on lines +53 to +99
struct grpc_client_config_t {
std::string server_address = "localhost:8765";
int poll_interval_ms = 1000; // How often to poll for status (when use_wait=false)
int timeout_seconds = 3600; // Max time to wait for job completion (1 hour)
bool use_wait = false; // Use WaitForCompletion RPC instead of polling CheckStatus
bool stream_logs = false; // Whether to stream logs from server
std::function<void(const std::string&)> log_callback = nullptr; // Called for each log line

// Incumbent callback for MIP solves — invoked each time the server finds a
// new best-feasible (incumbent) solution. Parameters: index, objective value,
// solution vector. Return true to continue solving, or false to request
// early termination (e.g. the objective is good enough for the caller's
// purposes). Only supported in polling mode (use_wait=false).
std::function<bool(int64_t index, double objective, const std::vector<double>& solution)>
incumbent_callback = nullptr;
int incumbent_poll_interval_ms = 1000; // How often to poll for new incumbents

// TLS configuration
bool enable_tls = false;
std::string tls_root_certs; // PEM-encoded root CA certificates (for verifying server)
std::string tls_client_cert; // PEM-encoded client certificate (for mTLS)
std::string tls_client_key; // PEM-encoded client private key (for mTLS)

// gRPC max message size (used for unary SubmitJob and result download decisions).
int64_t max_message_bytes = 256LL * 1024 * 1024; // 256 MiB

// Chunk size for chunked array upload and chunked result download.
int64_t chunk_size_bytes = 16LL * 1024 * 1024; // 16 MiB

// --- Test / debug options (not intended for production use) -----------------

// Receives internal client debug messages (for test verification).
std::function<void(const std::string&)> debug_log_callback = nullptr;

// Enable debug / throughput logging to stderr.
// Controlled by CUOPT_GRPC_DEBUG env var (0|1). Default: off.
bool enable_debug_log = false;

// Log FNV-1a hashes of uploaded/downloaded data on both client and server.
// Comparing the two hashes confirms data was not corrupted in transit.
bool enable_transfer_hash = false;

// Override for the chunked upload threshold (bytes). Normally computed
// automatically as 75% of max_message_bytes. Set to 0 to force chunked
// upload for all problems, or a positive value to override. -1 = auto.
int64_t chunked_array_threshold_bytes = -1;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document/validate config invariants that can crash the client (e.g., poll_interval_ms > 0).

The implementation currently divides by poll_interval_ms and uses timeout_seconds for deadlines; it would help to either:

  • validate/clamp these values in grpc_client_t constructors, or
  • explicitly document “must be > 0” constraints here in the config struct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.hpp` around lines 53 - 99, The config struct
grpc_client_config_t contains fields that must satisfy invariants (e.g.,
poll_interval_ms, incumbent_poll_interval_ms > 0; timeout_seconds > 0;
chunk_size_bytes > 0; max_message_bytes > 0; chunked_array_threshold_bytes >=
-1) but are not enforced; update grpc_client_t constructors to validate and
either clamp or reject invalid values: check each field (poll_interval_ms,
incumbent_poll_interval_ms, timeout_seconds, chunk_size_bytes,
max_message_bytes, and chunked_array_threshold_bytes) and if out of range either
set a safe default/clamp and emit a message via config.log_callback or
debug_log_callback or throw an std::invalid_argument with a clear message; also
document the enforced invariants in grpc_client_config_t comments so callers
know the constraints.

Comment on lines +174 to +191
// Constraint bounds (prefer lower/upper bounds if available)
if (pb_problem.constraint_lower_bounds_size() > 0) {
std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(),
pb_problem.constraint_lower_bounds().end());
std::vector<f_t> con_ub(pb_problem.constraint_upper_bounds().begin(),
pb_problem.constraint_upper_bounds().end());
cpu_problem.set_constraint_lower_bounds(con_lb.data(), static_cast<i_t>(con_lb.size()));
cpu_problem.set_constraint_upper_bounds(con_ub.data(), static_cast<i_t>(con_ub.size()));
} else if (pb_problem.b_size() > 0) {
// Use b (RHS) + row_types format
std::vector<f_t> b(pb_problem.b().begin(), pb_problem.b().end());
cpu_problem.set_constraint_bounds(b.data(), static_cast<i_t>(b.size()));

if (!pb_problem.row_types().empty()) {
const std::string& row_types_str = pb_problem.row_types();
cpu_problem.set_row_types(row_types_str.data(), static_cast<i_t>(row_types_str.size()));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Constraint bounds decoding should validate both LB/UB are present (and sizes match).

constraint_lower_bounds_size() > 0 implicitly assumes constraint_upper_bounds() exists and matches; otherwise you can end up setting mismatched arrays with no explicit error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 174 - 191, The code assumes
constraint_lower_bounds_size() > 0 implies matching upper bounds; update the
condition in the constraint decoding block to require both
pb_problem.constraint_lower_bounds_size() > 0 and
pb_problem.constraint_upper_bounds_size() > 0 and verify their sizes match
before calling cpu_problem.set_constraint_lower_bounds(...) and
cpu_problem.set_constraint_upper_bounds(...); if sizes differ or upper bounds
are missing, either fall back to the pb_problem.b()/row_types() branch or
return/log an explicit error (so you don’t call the setters with mismatched
arrays).

Comment on lines +435 to +448
auto get_string_list = [&](int32_t field_id) -> std::vector<std::string> {
auto it = arrays.find(field_id);
if (it == arrays.end() || it->second.empty()) return {};
std::vector<std::string> names;
const char* s = reinterpret_cast<const char*>(it->second.data());
const char* s_end = s + it->second.size();
while (s < s_end) {
const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s));
if (!nul) nul = s_end;
names.emplace_back(s, nul);
s = nul + 1;
}
return names;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid UB in get_string_list when the blob is missing a trailing NUL.

If memchr doesn’t find '\0', the code sets nul = s_end and then advances s = nul + 1, which forms a pointer beyond one-past-end. Even if producers always NUL-terminate, this is server-side parsing of untrusted payload bytes.

Proposed fix
     while (s < s_end) {
       const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s));
       if (!nul) nul = s_end;
       names.emplace_back(s, nul);
-      s = nul + 1;
+      if (nul == s_end) break;
+      s = nul + 1;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 435 - 448, get_string_list
currently advances s = nul + 1 when memchr returns nullptr, producing a pointer
one-past-end; instead, when memchr returns nullptr (nul == s_end) append the
final string using the length (s_end - s) and then break the loop rather than
doing s = nul + 1. Update the lambda (get_string_list) to compute the substring
length from s to nul (or s_end) and stop the loop when nul == s_end to avoid
forming pointers past the buffer while still handling a missing trailing NUL.

Comment on lines +23 to +55
// Append chunk requests for a single typed array.
template <typename T>
void chunk_typed_array(std::vector<cuopt::remote::SendArrayChunkRequest>& out,
cuopt::remote::ArrayFieldId field_id,
const std::vector<T>& data,
const std::string& upload_id,
int64_t chunk_data_budget)
{
if (data.empty()) return;

const int64_t elem_size = static_cast<int64_t>(sizeof(T));
const int64_t total_elements = static_cast<int64_t>(data.size());

int64_t elems_per_chunk = chunk_data_budget / elem_size;
if (elems_per_chunk <= 0) elems_per_chunk = 1;

const auto* raw = reinterpret_cast<const uint8_t*>(data.data());

for (int64_t offset = 0; offset < total_elements; offset += elems_per_chunk) {
int64_t count = std::min(elems_per_chunk, total_elements - offset);
int64_t byte_offset = offset * elem_size;
int64_t byte_count = count * elem_size;

cuopt::remote::SendArrayChunkRequest req;
req.set_upload_id(upload_id);
auto* ac = req.mutable_chunk();
ac->set_field_id(field_id);
ac->set_element_offset(offset);
ac->set_total_elements(total_elements);
ac->set_data(raw + byte_offset, byte_count);
out.push_back(std::move(req));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential float-path corruption: chunked upload sends f_t bytes, but server reconstruction looks double-based.

chunk_typed_array() writes raw sizeof(T) bytes. In build_array_chunk_requests(), all floating arrays (A_VALUES, C, bounds, etc.) use T = f_t. If f_t == float, this sends float bytes; however, map_chunked_arrays_to_problem() currently decodes floating arrays as double bytes (it divides by sizeof(double) and memcpys into double buffers). That combination would truncate element counts and can overflow copies (and/or silently corrupt values).

I’d strongly suggest verifying the intended chunked wire format:

  • If wire format is always double for floats, convert std::vector<float> → chunked double bytes before calling chunk_typed_array (ideally in a streaming way to avoid doubling peak memory).
  • If wire format is native element size, then server-side chunked decode should use element-size metadata (e.g., element_size_bytes) instead of hard-coding sizeof(double).
#!/bin/bash
set -euo pipefail

# 1) Inspect the proto schema: does ArrayChunk carry element_size_bytes?
rg -n "message\s+ArrayChunk|element_size_bytes|total_elements|field_id|element_offset" cpp/src/grpc -S

# 2) Find where SendArrayChunk is decoded/validated on server
rg -n "SendArrayChunk|StartChunkedUpload|FinishChunkedUpload|FIELD_A_VALUES|FIELD_C" cpp/src/grpc -S

# 3) Confirm how chunked arrays are decoded into cpu_optimization_problem_t
rg -n "map_chunked_arrays_to_problem|sizeof\(double\)|FIELD_A_VALUES" cpp/src/grpc -S

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results".

Also applies to: 121-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_service_mapper.cu` around lines 23 - 55, chunk_typed_array
currently writes raw sizeof(T) bytes (used by build_array_chunk_requests with T
= f_t), but map_chunked_arrays_to_problem decodes floats as sizeof(double) which
can corrupt/overflow; fix by making the wire-format consistent: either convert
float arrays to double bytes before calling chunk_typed_array (perform
conversion in build_array_chunk_requests or a streaming converter so you don't
double memory) or change the server-side decoder in
map_chunked_arrays_to_problem to use the element size of the incoming
SendArrayChunkRequest/ArrayChunk (use
element_count/element_offset/total_elements and element_size_bytes or add one)
and memcpy/interpret using that element size; update tests to cover f_t=float
and f_t=double roundtrip and ensure SendArrayChunkRequest uses/accepts the
chosen element-size convention.

Comment on lines +30 to +32
std::vector<uint8_t> buffer(msg.ByteSizeLong());
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
return buffer;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "grpc_incumbent_proto.hpp" -type f

Repository: NVIDIA/cuopt

Length of output: 102


🏁 Script executed:

cat -n ./cpp/src/grpc/server/grpc_incumbent_proto.hpp

Repository: NVIDIA/cuopt

Length of output: 2203


🏁 Script executed:

rg "SerializeToArray" --type hpp --type cpp -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

rg "SerializeToArray" -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 1552


🏁 Script executed:

rg "SerializeToArray" cpp/cuopt_grpc_server.cpp -B 5 -A 5

Repository: NVIDIA/cuopt

Length of output: 842


Check SerializeToArray return value and validate size narrowing.

Line 31 silently ignores SerializeToArray failure and narrows ByteSizeLong() (returning size_t) to int without validation. Failed serialization returns an empty/invalid buffer indistinguishable from success, potentially causing data loss. Size narrowing could lose data for large payloads. Other parts of the codebase (e.g., cuopt_grpc_server.cpp) properly check this return value; apply the same error handling pattern here.

Proposed fix
+#include <limits>
@@
-  std::vector<uint8_t> buffer(msg.ByteSizeLong());
-  msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
+  const auto size = msg.ByteSizeLong();
+  if (size > static_cast<size_t>(std::numeric_limits<int>::max())) { return {}; }
+  std::vector<uint8_t> buffer(size);
+  if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) { return {}; }
   return buffer;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The
current serialize path allocates buffer using msg.ByteSizeLong(), then calls
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) without
checking the int narrowing or the return value; update the code around
msg.SerializeToArray to first verify that buffer.size() fits into int (e.g.,
compare against std::numeric_limits<int>::max()) before narrowing, then capture
and check the boolean return of msg.SerializeToArray and handle failures
(log/error/throw) similar to cuopt_grpc_server.cpp; ensure the function that
returns the std::vector<uint8_t> (refer to variables buffer and the call
msg.SerializeToArray) does not return an empty/invalid buffer on serialize
failure.

Comment on lines +12 to +54
class IncumbentPipeCallback : public cuopt::internals::get_solution_callback_t {
public:
IncumbentPipeCallback(std::string job_id, int fd, size_t num_vars, bool is_float)
: job_id_(std::move(job_id)), fd_(fd)
{
n_variables = num_vars;
isFloat = is_float;
}

void get_solution(void* data,
void* objective_value,
void* /*solution_bound*/,
void* /*user_data*/) override
{
if (fd_ < 0 || n_variables == 0) { return; }

double objective = 0.0;
std::vector<double> assignment;
assignment.resize(n_variables);

if (isFloat) {
const float* float_data = static_cast<const float*>(data);
for (size_t i = 0; i < n_variables; ++i) {
assignment[i] = static_cast<double>(float_data[i]);
}
objective = static_cast<double>(*static_cast<const float*>(objective_value));
} else {
const double* double_data = static_cast<const double*>(data);
std::copy(double_data, double_data + n_variables, assignment.begin());
objective = *static_cast<const double*>(objective_value);
}

auto buffer = build_incumbent_proto(job_id_, objective, assignment);
std::cout << "[Worker] Incumbent callback job_id=" << job_id_ << " obj=" << objective
<< " vars=" << assignment.size() << "\n";
std::cout.flush();
send_incumbent_pipe(fd_, buffer);
}

private:
std::string job_id_;
int fd_;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incumbent callback: validate pipe-send result and gate logging to configured verbosity levels

The send_incumbent_pipe() return value must be checked—a broken pipe (e.g., client disconnection, EPIPE) will silently fail to deliver the incumbent solution, leaving the client without results. Consider: storing the pipe status state in the callback and disabling further sends on failure, or propagating the error to fail the job.

Additionally, the unconditional std::cout logging of every incumbent produces unbounded output when optimizers report many solutions (typical in combinatorial problems). Gate this logging behind a configuration flag (e.g., config.verbose or job.log_level) to prevent production logs from becoming unusable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 12 - 54, The incumbent
callback currently ignores send_incumbent_pipe() errors and always logs; update
IncumbentPipeCallback (constructor and get_solution) to accept a
verbosity/logging flag (e.g., bool verbose or int log_level) and to check the
return value of send_incumbent_pipe(fd_, buffer) on each call; on error, record
the pipe as closed (set fd_ = -1 or a boolean failed_ member) to stop future
sends and optionally propagate/fail the job, and wrap the std::cout logging
behind the new verbosity flag so it only prints when verbose is enabled;
reference get_solution, send_incumbent_pipe, job_id_, fd_, build_incumbent_proto
and add a member like verbose_ or failed_ to gate behavior.

Comment on lines +56 to +74
static void store_simple_result(const std::string& job_id,
int worker_id,
int status,
const char* error_message)
{
for (size_t i = 0; i < MAX_RESULTS; ++i) {
if (!result_queue[i].ready) {
copy_cstr(result_queue[i].job_id, job_id);
result_queue[i].status = status;
result_queue[i].data_size = 0;
result_queue[i].worker_index = worker_id;
copy_cstr(result_queue[i].error_message, error_message);
result_queue[i].error_message[sizeof(result_queue[i].error_message) - 1] = '\0';
result_queue[i].retrieved = false;
result_queue[i].ready = true;
break;
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the grpc_worker.cpp file
find . -name "grpc_worker.cpp" -type f

Repository: NVIDIA/cuopt

Length of output: 93


🏁 Script executed:

# Check file size and read the relevant sections
wc -l cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 95


🏁 Script executed:

# Read lines around 56-74 to verify the snippet
sed -n '45,85p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 1743


🏁 Script executed:

# Read lines around 341-363 mentioned in "Also applies to"
sed -n '330,375p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 2506


🏁 Script executed:

# Search for where store_simple_result is called
rg -n "store_simple_result" cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 271


🏁 Script executed:

# Search for result_queue definition and synchronization
rg -n "result_queue|result_queue\[" cpp/src/grpc/server/grpc_worker.cpp -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 1454


🏁 Script executed:

# Look for mutex/lock declarations and usage related to result_queue
rg -n "mutex|lock|atomic|critical|synchronized" cpp/src/grpc/server/grpc_worker.cpp -i

Repository: NVIDIA/cuopt

Length of output: 343


🏁 Script executed:

# Look for result_queue declaration and type
rg -n "result_queue" cpp/src/grpc/server/grpc_worker.cpp -B 5 | head -40

Repository: NVIDIA/cuopt

Length of output: 1453


🏁 Script executed:

# Search for global mutex, lock, or synchronization declarations
rg -n "std::mutex|pthread_mutex|lock_guard|unique_lock" cpp/src/grpc/server/ -r

Repository: NVIDIA/cuopt

Length of output: 469


🏁 Script executed:

# Check the shm_ctrl structure and its synchronization
rg -n "shm_ctrl" cpp/src/grpc/server/grpc_worker.cpp -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 427


🏁 Script executed:

# Look for any `#include` headers that might provide synchronization
head -50 cpp/src/grpc/server/grpc_worker.cpp | grep -E "#include|^#"

Repository: NVIDIA/cuopt

Length of output: 188


🏁 Script executed:

# Check if there are any CAS (compare-and-swap) or atomic operations
rg -n "CAS|compare_exchange|cas_flag" cpp/src/grpc/server/grpc_worker.cpp -i

Repository: NVIDIA/cuopt

Length of output: 923


🏁 Script executed:

# Look for the result_queue and ResultQueueEntry declaration with more context
rg -n "result_queue|ResultQueueEntry" cpp/src/grpc/server/ -B 5 -A 5 | head -80

Repository: NVIDIA/cuopt

Length of output: 6078


🏁 Script executed:

# Check header files for result_queue definition
fd -e hpp -e h cpp/src/grpc/server/ -exec grep -l "ResultQueueEntry\|result_queue" {} \;

Repository: NVIDIA/cuopt

Length of output: 321


🏁 Script executed:

# Look at job_queue to see how it uses atomics for comparison
rg -n "job_queue\[.*\].claimed" cpp/src/grpc/server/grpc_worker.cpp -B 3 -A 3

Repository: NVIDIA/cuopt

Length of output: 493


🏁 Script executed:

# Check the full context around where result_queue is declared/initialized
rg -n "^(static\s+|extern\s+)?.*result_queue" cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 875


🏁 Script executed:

# Check the ResultQueueEntry definition for field types
rg -n "struct ResultQueueEntry|class ResultQueueEntry" cpp/src/grpc/server/ -A 10

Repository: NVIDIA/cuopt

Length of output: 999


🏁 Script executed:

# Check if ResultQueueEntry.ready is atomic
fd -name "*.hpp" -o -name "*.h" cpp/src/grpc/server/ | xargs grep -l "ResultQueueEntry"

Repository: NVIDIA/cuopt

Length of output: 288


🏁 Script executed:

# Check how job_queue uses atomics for comparison to understand the pattern difference
cat cpp/src/grpc/server/grpc_server_types.hpp | grep -A 20 "struct JobQueueEntry"

Repository: NVIDIA/cuopt

Length of output: 1108


🏁 Script executed:

# Check if result_queue is in shared memory (multi-process)
rg -n "shm_open|mmap.*result_queue" cpp/src/grpc/server/

Repository: NVIDIA/cuopt

Length of output: 394


🏁 Script executed:

# Look at how result_queue is initialized
rg -n "result_queue\s*=" cpp/src/grpc/server/grpc_worker_infra.cpp -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Look for result_queue initialization and declarations
rg -n "result_queue" cpp/src/grpc/server/grpc_worker_infra.cpp | head -20

Repository: NVIDIA/cuopt

Length of output: 565


🏁 Script executed:

# Check the actual initialization in grpc_server_main.cpp
rg -n "result_queue" cpp/src/grpc/server/grpc_server_main.cpp -B 3 -A 3

Repository: NVIDIA/cuopt

Length of output: 1024


Multi-worker race condition on result queue slot allocation

Both store_simple_result() and the main enqueue path pick the first !ready slot without atomic claim. With multiple worker processes writing to shared memory (via mmap), two workers can read the same slot as available and clobber each other's data on non-atomic fields (job_id, status, data_size, error_message). The atomic ready flag alone does not protect concurrent writes to these fields. Additionally, when the queue is full, result_slot remains -1 and the job result is silently lost without error.

Use atomic compare-and-swap on a dedicated claim flag (following the JobQueueEntry.claimed pattern) to atomically reserve a slot before writing, or protect the entire slot allocation and write sequence with a mutex. Fail explicitly if no slots are available.

Applies to lines 56-74, 341-363, and similar pattern in grpc_worker_infra.cpp:194-206.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74,
store_simple_result currently scans result_queue for the first entry with
ready==false and writes non-atomic fields, which races with other workers;
replace that pattern with an atomic claim step (mirror JobQueueEntry.claimed
semantics) or protect allocation+write with a mutex: perform an atomic
compare-and-swap on a per-slot claim flag to reserve the slot, only then write
job_id/status/data_size/error_message and finally set ready; ensure the claim
flag is cleared on failure and set/cleared correctly on success; also make the
function return or log an explicit error when no slots can be claimed (do not
silently drop results). Apply the same fix where result_slot is selected in the
main enqueue path and in grpc_worker_infra.cpp where the same pattern appears.

Comment on lines +248 to +252
cudaMemcpy(host_solution.data(),
device_solution.data(),
device_solution.size() * sizeof(double),
cudaMemcpyDeviceToHost);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get basic information
cd /tmp/repo 2>/dev/null || echo "Repo not available"
find . -name "grpc_worker.cpp" -type f 2>/dev/null | head -5

Repository: NVIDIA/cuopt

Length of output: 112


🏁 Script executed:

# Search for grpc_worker.cpp in the repository
fd "grpc_worker\.cpp" 2>/dev/null | head -10

Repository: NVIDIA/cuopt

Length of output: 91


🏁 Script executed:

# Get the content of the file at the specified lines
fd "grpc_worker\.cpp" -x wc -l {} \; 2>/dev/null

Repository: NVIDIA/cuopt

Length of output: 97


🏁 Script executed:

# Read the relevant sections from grpc_worker.cpp
fd "grpc_worker\.cpp" -x cat -n {} \; 2>/dev/null | sed -n '240,260p; 290,315p'

Repository: NVIDIA/cuopt

Length of output: 2534


🏁 Script executed:

if [ -f "cpp/src/grpc/server/grpc_worker.cpp" ]; then
  echo "File found at expected path"
  wc -l "cpp/src/grpc/server/grpc_worker.cpp"
  echo "=== Lines 240-260 ==="
  sed -n '240,260p' "cpp/src/grpc/server/grpc_worker.cpp" | cat -n
  echo ""
  echo "=== Lines 290-315 ==="
  sed -n '290,315p' "cpp/src/grpc/server/grpc_worker.cpp" | cat -n
else
  echo "File not found at expected path"
  find . -name "grpc_worker.cpp" -type f 2>/dev/null
fi

Repository: NVIDIA/cuopt

Length of output: 2647


🏁 Script executed:

# Look for all cudaMemcpy calls in the file to understand the pattern
fd "grpc_worker\.cpp" -x grep -n "cudaMemcpy" {} \;

Repository: NVIDIA/cuopt

Length of output: 425


🏁 Script executed:

# Check for CUDA error checking patterns in the file
fd "grpc_worker\.cpp" -x grep -n -E "(CUDA_CHECK|cudaGetLastError|cudaDeviceSynchronize)" {} \;

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Look at the broader context around error handling in the file
fd "grpc_worker\.cpp" -x grep -n -E "(try|catch|throw|exception)" {} \; | head -20

Repository: NVIDIA/cuopt

Length of output: 319


🏁 Script executed:

# Get broader context around the cudaMemcpy calls
sed -n '165,320p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 8704


🏁 Script executed:

# Check for CUDA_CHECK or other error checking macros defined in the file or headers
head -50 cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 2099


🏁 Script executed:

# Search for all cudaMemcpy in the entire file to see how many are unchecked
grep -c "cudaMemcpy" cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 57


🏁 Script executed:

# Look for error checking patterns after cudaMemcpy calls
grep -A 3 "cudaMemcpy" cpp/src/grpc/server/grpc_worker.cpp | head -40

Repository: NVIDIA/cuopt

Length of output: 977


🏁 Script executed:

# Check what includes are at the top of the file
head -30 cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 949


Check CUDA return codes and fail the job on copy errors

cudaMemcpy failures are currently ignored, which can yield corrupted results or hard-to-debug downstream failures. Since this worker already wraps work in a try/catch, you can throw on CUDA failures and surface a clean error message to clients.

Example patch (apply to all cudaMemcpy sites)
-        cudaMemcpy(host_solution.data(),
-                   device_solution.data(),
-                   device_solution.size() * sizeof(double),
-                   cudaMemcpyDeviceToHost);
+        {
+          auto err = cudaMemcpy(host_solution.data(),
+                                device_solution.data(),
+                                device_solution.size() * sizeof(double),
+                                cudaMemcpyDeviceToHost);
+          if (err != cudaSuccess) {
+            throw std::runtime_error(std::string("cudaMemcpy(device_solution) failed: ") +
+                                     cudaGetErrorString(err));
+          }
+        }

Also applies to: lines 296-307 (host_primal, host_dual, host_reduced_cost)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 252, The cudaMemcpy
call copying device_solution to host_solution ignores return codes and can
silently fail; change the cudaMemcpy(...) call to capture the cudaError_t
result, check if it's not cudaSuccess, and throw a descriptive exception (e.g.,
std::runtime_error) including cudaGetErrorString(err) so the worker's try/catch
surfaces a clean error to clients; apply the same pattern to the other copies
that populate host_primal, host_dual, and host_reduced_cost (i.e., any uses of
cudaMemcpy with device_* -> host_* in grpc_worker.cpp).

Comment on lines +53 to +57
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "== Check required include =="
rg -n '^\s*#include <fcntl\.h>' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp || echo "fcntl.h NOT found"
echo "== Check open/O_* usage =="
rg -n '\bopen\s*\(|\bO_WRONLY\b|\bO_CREAT\b|\bO_TRUNC\b' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp || echo "No usage found"
echo "== Show context around line 53-57 =="
sed -n '50,60p' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
echo "== Show context around line 120 =="
sed -n '115,125p' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp

Repository: NVIDIA/cuopt

Length of output: 840


Add the missing POSIX header for open/O_* usage.

The <fcntl.h> header is not included, but open() and flags like O_WRONLY, O_CREAT, and O_TRUNC are used at line 120. This can fail on stricter toolchains that don't include fcntl.h transitively.

🔧 Proposed fix
 `#include` <signal.h>
+#include <fcntl.h>
 `#include` <sys/types.h>
 `#include` <sys/wait.h>
 `#include` <unistd.h>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
`#include` <signal.h>
`#include` <fcntl.h>
`#include` <sys/types.h>
`#include` <sys/wait.h>
`#include` <unistd.h>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53
- 57, The test file is missing the POSIX header required for open() and O_*
flags; add the missing include by inserting `#include` <fcntl.h> near the other
system headers in grpc_integration_test.cpp so symbols like open(), O_WRONLY,
O_CREAT, and O_TRUNC used later (around the code that opens the log file) are
defined on strict toolchains.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
cpp/src/grpc/client/test_grpc_client.cpp (1)

340-340: Print the client error on log-stream failure for parity with other modes.

At Line 340, a failed stream_logs returns non-zero but drops the error detail. Emit client.get_last_error() before returning to keep failure diagnosis consistent.

Proposed fix
-  return success ? 0 : 1;
+  if (!success) {
+    std::cerr << "Log stream failed: " << client.get_last_error() << "\n";
+    return 1;
+  }
+  return 0;

As per coding guidelines: "Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` at line 340, The return path after
stream_logs currently returns non-zero on failure but drops the error detail;
update the failure branch surrounding stream_logs in test_grpc_client.cpp to
call and emit client.get_last_error() (or equivalent accessor) to the log/stream
before returning the non-zero exit code so the client error is logged for parity
with other modes and aids diagnosis (locate the code around stream_logs and the
return of success ? 0 : 1 and insert a process/log call that prints
client.get_last_error() when success is false).
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

24-27: Consider making kPipeChunkBytes configurable.

The fixed 64 MiB chunk cap is a hard resource limit; exposing it via server config/env (with this as default) would improve deploy-time tuning.

As per coding guidelines: "Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 24 - 27,
kPipeChunkBytes is a hard-coded constexpr limit; make it configurable by
replacing the fixed constexpr with a runtime-configured value (e.g., a
GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that reads an
environment variable or server config and falls back to 64ULL * 1024 * 1024 as
the default; update references to kPipeChunkBytes to call the accessor and
validate the parsed value (positive, reasonable upper bound) and document the
new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)

957-982: Assert the log-callback path in SolveMIPWithLogCallback.

received_logs is populated but never asserted, so this test can pass even if callback streaming regresses.

💡 Suggested assertion
   auto result = client->solve_mip(problem, settings, false);
   EXPECT_TRUE(result.success) << result.error_message;
+  EXPECT_FALSE(received_logs.empty()) << "Expected at least one streamed log line";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 957
- 982, The test registers a log callback (received_logs with log_mutex) but
never asserts anything, so add assertions after the call to client->solve_mip in
SolveMIPWithLogCallback to verify the callback was invoked and logs look
correct: acquire lock(log_mutex) and assert !received_logs.empty() (and
optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/CMakeLists.txt`:
- Around line 297-304: The CMake snippet only checks for the non-namespaced
target grpc_cpp_plugin; update the check to first test for the namespaced target
gRPC::grpc_cpp_plugin (like how gRPC::grpc++ is handled) and use its target file
if present, otherwise fall back to checking TARGET grpc_cpp_plugin and then
find_program for the executable; adjust references to set
_GRPC_CPP_PLUGIN_EXECUTABLE from "$<TARGET_FILE:gRPC::grpc_cpp_plugin>" when
available to ensure installed gRPC packages are recognized.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp`:
- Around line 1387-1402: The test currently expects GetResult to be called even
though CheckStatus returns PROCESSING (not complete); update the test so
GetResult is not expected for PROCESSING jobs: remove or replace the
EXPECT_CALL(*mock_stub_, GetResult(...)).WillOnce(...) with an expectation that
it is not called (e.g., EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0))
or simply delete the GetResult expectation, keeping the CheckStatus stub that
sets resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the
intended "do not call GetResult for non-complete status" behavior.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 275-283: The server_log_count function can infinite-loop when
substring is empty; add a guard at the start of server_log_count (which calls
get_server_logs()) to return 0 (or handle as appropriate) if substring.empty()
before entering the while loop so pos will never rely on substring.length() == 0
to advance.

---

Duplicate comments:
In `@cpp/CMakeLists.txt`:
- Around line 279-294: The current CMake logic relies on IMPORTED_LOCATION*
which can be empty for non-imported targets; update the resolution to use
$<TARGET_FILE:protobuf::protoc> when the target exists: in the branch that tests
if(TARGET protobuf::protoc) set _PROTOBUF_PROTOC to the generator expression
$<TARGET_FILE:protobuf::protoc> (instead of querying IMPORTED_LOCATION*),
otherwise fall back to find_package(protobuf CONFIG REQUIRED) and then use
IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION as a fallback; keep the existing
fatal check for empty _PROTOBUF_PROTOC. Ensure you reference the symbols
protobuf::protoc and _PROTOBUF_PROTOC in the change so the target-file generator
expression is used for in-tree/FetchContent builds.

In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-265: grpc_client_t::check_status (and similar RPC helpers)
currently dereferences impl_->stub without checking connection state which can
crash if called before connect(); add a null-check for impl_ and impl_->stub at
the start of grpc_client_t::check_status and for the same pattern in the other
RPC helper methods (the ones calling impl_->stub->...). If either is null,
return an appropriate error job_status_result_t (populate error code/message or
a failed state consistent with existing error handling) instead of calling the
stub; otherwise proceed to build the request and call CheckStatus as before.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without checking for <=0; add a guard before any
division (both where max_polls is computed and the other occurrence around the
1132 region) to validate config_.poll_interval_ms > 0, and if not either (a)
return or raise a clear error/invalid-argument from the surrounding function, or
(b) log an error and use a safe default (e.g., 1 ms) before computing max_polls;
ensure you reference config_.poll_interval_ms and update the computations that
set max_polls to avoid divide-by-zero crashes.
- Around line 221-251: The read loop in start_log_streaming blocks on
reader->Read() so stop_log_streaming can hang; to fix, add a mutex-protected
pointer member (log_context_mutex_ and active_log_context_) and set
active_log_context_ = &context (protected by log_context_mutex_) right after
creating grpc::ClientContext context in start_log_streaming, clear it (set
nullptr) after reader->Finish(); then update stop_log_streaming to lock
log_context_mutex_ and if active_log_context_ is non-null call
active_log_context_->TryCancel() before setting stop_logs_ and joining the
thread so the blocking StreamLogs read is cancelled safely.
- Around line 703-712: Validate elem_size and guard against overflow before
dividing/allocating: check elem_size > 0 before computing elems_per_chunk =
chunk_data_budget / elem_size (return/continue or log and skip if zero), ensure
total_elems > 0 (already present) and verify total_elems <= SIZE_MAX / elem_size
(or use safe_mul_check) before computing total_elems * elem_size and allocating
array_bytes, and if the allocation would exceed a configured budget or SIZE_MAX
also error/skip; apply the same checks when computing per-chunk sizes and before
any memcpy/copy operations. Reference symbols: elem_size, total_elems,
chunk_data_budget, elems_per_chunk, array_bytes, arr_desc.total_elements(),
arr_desc.element_size_bytes() (also update the similar block around the other
occurrence).
- Around line 262-265: The CheckStatus RPC (and other RPCs that create
grpc::ClientContext, e.g., the sites calling impl_->stub->CheckStatus and the
other stub->* calls listed) currently create a ClientContext with no deadline;
fix by setting a per-RPC deadline on the context before the call using
context.set_deadline(std::chrono::system_clock::now() + configured_timeout).
Prefer reusing or adding a small helper (e.g., set_deadline(context, timeout_ms)
or a method on impl_ that reads a configured RPC timeout) and apply it to the
Call sites such as the CheckStatus call (impl_->stub->CheckStatus) and the other
grpc::ClientContext creation sites flagged in the review so all RPCs
consistently use the same configured deadline.

In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 53-99: The grpc_client_config_t allows invalid values for
intervals/sizes (poll_interval_ms, incumbent_poll_interval_ms, timeout_seconds,
chunk_size_bytes, max_message_bytes, chunked_array_threshold_bytes) which can
break runtime behavior; add a validation method (e.g.,
grpc_client_config_t::Validate() or a constructor/static factory) that enforces
and documents invariants (positive non-zero ints for intervals/timeout, sensible
min/max for chunk_size_bytes and max_message_bytes, and explicit handling for
chunked_array_threshold_bytes semantics (-1 for auto, 0 for force-chunked, >0 as
threshold)), make the client call Validate() early (e.g., before using in
grpc_client constructors or SubmitJob path) and return/throw a clear error on
invalid config or clamp values where safe; update comments to list accepted
ranges for each field.

In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env-derived overrides for config.chunk_size_bytes and
config.max_message_bytes are applied without validation (using parse_env_int64),
allowing zero/negative/oversized values; update the logic where
config.chunk_size_bytes, config.max_message_bytes (and optionally
config.enable_debug_log) are set so that after calling parse_env_int64 you
validate the result against sane bounds (e.g., >0 and <= a defined
MAX_MESSAGE_BYTES/CHUNK_SIZE constant), ignore or clamp invalid values, and emit
a debug/warning log via the existing logger; specifically modify the section
using parse_env_int64 to perform the check and fallback to existing config
values when the parsed value is out-of-range.
- Around line 164-166: The callback is being passed the incumbent objective as
the bound (bound_copy = objective) which can show a false zero gap; change the
code so bound_copy is set to the solver's actual best bound variable (the
maintained global/best bound used by the B&B engine) instead of 'objective'
before calling get_callback->get_solution(...); locate where the solver stores
the current best bound (e.g., variables named best_bound, global_bound,
current_bound, or similar in the surrounding code/branch-and-bound context) and
use that identifier (preserving obj_copy, solution_copy, and
callback->get_user_data() usage).

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: The parsing of numeric CLI arguments uses std::stod and
std::stoll without guards and converts a double to int unsafely (variable
time_limit and the static_cast<int>(time_limit)); wrap each std::stod/std::stoll
call in a try/catch (catch std::invalid_argument and std::out_of_range),
validate the parsed value for positivity and reasonable bounds, and return/exit
with a clear error on bad input; before converting time_limit to int (the
static_cast<int>(time_limit) usage) check that time_limit is finite, >= 0 and <=
std::numeric_limits<int>::max() (or clamp/handle overflow) to avoid UB; apply
the same defensive checks to the stoll usages (ensure values fit in target
integer types) and use the same error path if validation fails.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto conflicts with its directory path (cpp/src/grpc) and triggers
Buf PACKAGE_DIRECTORY_MATCH; fix by making the package name match the file path
or by moving the file to match the package. Either change the package line in
cpp/src/grpc/cuopt_remote.proto from "package cuopt.remote" to "package
cpp.src.grpc" (matching the directory segments as dot-separated identifiers) or
relocate cuopt_remote.proto into a cuopt/remote directory so the existing
"package cuopt.remote" matches the filesystem; update any import/usage
references accordingly and re-run buf lint to verify.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 175-182: The code assumes constraint_upper_bounds exists when
pb_problem.constraint_lower_bounds_size() > 0; instead validate both arrays and
their sizes before calling cpu_problem.set_constraint_lower_bounds and
cpu_problem.set_constraint_upper_bounds: check
pb_problem.constraint_lower_bounds_size() > 0 &&
pb_problem.constraint_upper_bounds_size() > 0 and that their sizes match, and
only then build con_lb/con_ub and call the two set_* methods; if sizes mismatch
or one is missing, handle the case (skip setting bounds or raise/log an error)
to avoid creating an inconsistent problem state.
- Around line 435-446: get_string_list can advance s past s_end when memchr
returns null (no trailing NUL); fix by computing remaining = s_end - s, call
memchr over remaining, and if memchr returns null push the final string using
the remaining length and then break (do not set s = nul + 1). Update the loop in
get_string_list to use it->second.data() and it->second.size() to compute s and
s_end, use the computed remaining length for std::memchr, and on null result
construct names.emplace_back(s, remaining) and break instead of setting s = nul
+ 1.
- Around line 399-411: The lambdas (notably get_doubles) compute element count
as it->second.size() / sizeof(double) then memcpy the full byte payload, which
can overflow if the byte length is not a multiple of sizeof(double); add an
explicit alignment/length check before any typed memcpy: verify
it->second.size() % sizeof(double) == 0 (or the equivalent check for the other
lambda’s element type), and if the check fails either return an empty vector or
log/throw an error; only then allocate the destination vector with n =
it->second.size() / sizeof(double) and perform memcpy; apply the same fix to the
other similar lambda at the noted region (around 414-426) to avoid misaligned
copies.

In `@cpp/src/grpc/grpc_service_mapper.cu`:
- Around line 23-55: Chunked wire format is inconsistent for floats: update
chunk_typed_array (and the client-side build_array_chunk_requests that calls it)
to always emit floating arrays as IEEE double on the wire (so server get_doubles
can continue to interpret chunks as sizeof(double)). Concretely, when T ==
float, convert the input vector to a temporary std::vector<double> and use that
buffer for chunking (set elem_size = sizeof(double) and point raw at the double
buffer) so byte offsets/counts match get_doubles; for non-floating types keep
the existing behavior. Ensure build_array_chunk_requests callers use the
adjusted chunk_typed_array semantics and do not send raw float bytes anymore.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 520-534: The bytes_to_typed template is copying raw.size() bytes
into buffers sized for n elements, causing overflow when raw.size() isn't an
exact multiple of the element byte-size; update each branch (the
std::is_same_v<T,float> branch that builds tmp as double, the
std::is_same_v<T,double> branch that builds v as double, and the generic branch)
to: compute n = raw.size() / sizeof(ElemType), check that raw.size() %
sizeof(ElemType) == 0 and handle (throw/log/return error) if not, and use
memcpy(dst.data(), raw.data(), n * sizeof(ElemType)) instead of memcpy(...,
raw.size()) so you only copy the exact bytes that fit the destination (refer to
function bytes_to_typed and the template T branches).

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 46-49: The assignment vector is not cleared before appending
parsed values, so repeated calls to parse_incumbent_proto will accumulate stale
entries; fix by calling assignment.clear() (or assign empty) before
assignment.reserve(...) in the code that iterates over incumbent_msg
(referencing the assignment variable and the loop that uses
incumbent_msg.assignment_size() and incumbent_msg.assignment(i)) so the vector
starts empty each parse.
- Around line 30-32: The serialization ignores SerializeToArray's boolean return
and narrows ByteSizeLong() (size_t) to int unsafely: update the code that
constructs buffer and calls msg.SerializeToArray to first capture size_t sz =
msg.ByteSizeLong(), validate sz <= static_cast<size_t>(INT_MAX) before casting,
allocate buffer with that size, call bool ok =
msg.SerializeToArray(buffer.data(), static_cast<int>(sz)), and handle a false
result (e.g., throw or return an error) instead of ignoring it; reference the
buffer variable, msg.ByteSizeLong(), and msg.SerializeToArray(...) when making
the change.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 107-113: The recv_incumbent_pipe function trusts the incoming size
prefix and can allocate arbitrarily large memory; add a sanity check before
resizing by comparing the read size (variable size) against a defined cap (e.g.,
a new constant like kMaxIncumbentSize or MAX_INCUMBENT_SIZE) and reject/return
false if size is 0 or exceeds that cap, then only call data.resize(size) and
read_from_pipe when the size is within limits; update recv_incumbent_pipe to
enforce this limit and document the constant choice.
- Around line 147-158: The loop in submit_job_async is racy because threads
perform a check-then-write on job_queue[i] (checking ready/claimed then
initializing), so change the slot acquisition to an atomic claim: use an atomic
flag (e.g., make job_queue[].claimed an std::atomic_bool or add a per-slot
std::atomic<int> state) and perform a compare_exchange_strong to transition from
false->true to reserve the slot; only the thread that succeeds should populate
the slot fields (job_id, problem_type, data_size, worker_pid, worker_index,
data_sent, cancelled) and set job_queue[i].ready = true as the final publish
step. Ensure submit_job_async uses this atomic CAS on job_queue[i].claimed and
does not set ready until after all other fields are initialized.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 93-100: Validate and sanitize the metadata before resizing or
memcpy: check ac.total_elements() and ac.element_offset() are non-negative and
within reasonable caps, ensure chunk_data.size() + offset does not overflow and
fits within a bounded/validated size_t, and guard against pathological
total_elements to avoid huge allocations; in the functions using ac (e.g., where
dest.resize(...) and std::memcpy(...) are called), convert to size_t only after
validation, return or throw an error on invalid/too-large values instead of
resizing/ copying, and ensure the memcpy condition uses the validated,
overflow-checked bounds.
- Line 90: The deserializers currently treat ParseDelimitedFromCodedStream
failures by breaking out and eventually returning true, accepting malformed
payloads; change the error handling around
google::protobuf::util::ParseDelimitedFromCodedStream(...) (the calls that pass
ac, coded, clean_eof) so that on a parse failure you immediately fail closed and
return false (or propagate an error) instead of break/continuing; apply the same
change to both deserializer sites that call ParseDelimitedFromCodedStream to
ensure truncated/malformed messages are rejected.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 234-250: Builder::BuildAndStart() can return nullptr so avoid
dereferencing server without checking; after calling builder.BuildAndStart()
(unique symbols: builder.BuildAndStart(), std::unique_ptr<Server> server,
server->Wait(), shutdown_thread, keep_running), check if server is null and if
so log an error/print to stderr, ensure shutdown_thread is signaled/joined or
cleaned up, and exit or return with a non-zero status; if server is valid
continue to start the shutdown_thread and call server->Wait() as currently
implemented.
- Around line 89-90: The signal handler registered via signal(SIGINT,
signal_handler) performs non-async-signal-safe operations (std::cout and
result_cv.notify_all()) — change it to a minimal, async-signal-safe handler:
have signal_handler set a std::sig_atomic_t or std::atomic<bool> flag (e.g.,
shutdown_requested) and (optionally) write a byte to a dedicated pipe to wake
the main loop; then remove any std::cout or condition_variable use from
signal_handler and move the actual logging and result_cv.notify_all() into the
main thread’s event loop where you check shutdown_requested (or the pipe) and
call result_cv.notify_all() and perform printing. Update grpc_server_types.hpp
to expose the atomic flag (or pipe FD) and ensure only async-signal-safe
functions are called from signal_handler.
- Around line 156-168: The loops in grpc_server_main.cpp must not call memset on
JobQueueEntry and ResultQueueEntry because they contain std::atomic members;
remove the memset calls and instead default-construct or placement-new each
element and then explicitly initialize fields. Replace memset(&job_queue[i], 0,
sizeof(JobQueueEntry)) with either new (&job_queue[i]) JobQueueEntry() or ensure
the array is value-initialized, then call job_queue[i].ready.store(false),
.claimed.store(false), .cancelled.store(false) and
job_queue[i].worker_index.store(-1); do the same for result_queue elements
(value-initialize or placement-new, then .ready.store(false) and
.retrieved.store(false)). Ensure JobQueueEntry and ResultQueueEntry constructors
handle non-atomic fields if you prefer constructor-based initialization.
- Around line 186-212: The early-return TLS error paths currently exit
immediately after spawn_workers() and creating threads, leaking workers,
threads, and shared memory; change these branches to perform proper cleanup
before returning: after calling spawn_workers() and after launching the threads
(referencing spawn_workers() and the container holding created std::thread
objects), call the corresponding shutdown/stop routine for workers (e.g.,
stop_workers() or signal shutdown on the worker pool), join all threads in the
thread container, and release any shared memory/IPC resources (the handles
created earlier, and any variables like ssl_opts or pem_root_certs can be left
untouched) before returning an error code. Ensure every early-return path that
occurs after worker spawn or thread creation follows this cleanup sequence so no
threads or shared resources remain leaked.
- Around line 50-80: The CLI parsing loop currently in main (the for-loop that
processes argv and assigns to config) silently ignores unknown flags and missing
values; update it to validate every flag and required value: for each branch
that expects a value (flags that call std::stoi/std::stoll or assign argv[++i]
to config fields like config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) ensure you check i+1<argc and on failure call
print_usage(argv[0]) and return a non-zero exit code; add an else branch for
unrecognized args (arg starts with '-') that also prints usage and exits
non-zero; additionally wrap stoi/stoll conversions in try/catch to detect
invalid numeric inputs and treat them as errors that print usage and exit
non-zero so the server cannot start with unintended defaults.

In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 166-229: When deserialize_result_pipe_blob(...) fails you mark
it->second.status = JobStatus::FAILED but you leave the local success flag true,
causing waiter->success to be set incorrectly; update the logic so that after a
deserialization failure you set success = false (or set waiter->success based on
it->second.status != JobStatus::FAILED) before assigning waiter->success in the
waiting_threads block; modify the code paths around
deserialize_result_pipe_blob, job_tracker, and the waiter assignment
(waiter->success) so waiters observe the actual failed state and error_message
from it->second.error_message.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe APIs
(std::cout and result_cv.notify_all); change it to only perform
async-signal-safe operations: remove the std::cout and result_cv.notify_all
calls and instead set an atomic/sig_atomic_t flag (ensure keep_running is an
std::atomic_bool or set a sig_atomic_t atomic_shutdown) and, if needed, set
shm_ctrl->shutdown_requested using a sig_atomic_t field; add a normal runtime
watcher thread that observes keep_running/shutdown flag and performs logging and
calls result_cv.notify_all when it detects shutdown; refer to signal_handler,
keep_running, shm_ctrl, shutdown_requested and result_cv when making these
changes.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code releases job_queue[job_idx].claimed before
publishing job_queue[job_idx].ready, creating a race where another producer can
claim the slot before consumers see ready; invert the operations so you
store(true) to job_queue[job_idx].ready before you store(false) to
job_queue[job_idx].claimed (apply same fix at the other occurrence around lines
with job_queue[...] claimed/ready, e.g., the site referenced as 265-267) to
ensure ready is visible prior to allowing re-claim.
- Around line 633-642: Hold on to waiter->mutex only while inspecting/modifying
waiter state; do not call check_job_status() while holding waiter->mutex to
avoid lock-order inversion with tracker_mutex used in cancel paths (e.g.,
cancel_job). In WaitForCompletion, restructure the loop so you acquire
unique_lock waiter's mutex, check waiter->ready and context->IsCancelled (and
adjust waiter->waiters on cancel) but before invoking check_job_status(job_id,
msg) release the lock (lock.unlock()), call check_job_status(), then re-lock
(lock.lock()) to re-evaluate waiter->ready and continue/wait; alternatively
adopt a consistent global lock order (acquire tracker_mutex before waiter->mutex
everywhere) — prefer the unlock/call/relock pattern around the call site to
quickly eliminate the deadlock risk.
- Around line 189-192: The code currently blindly adds each chunk to
meta.received_bytes, state.total_bytes, state.chunks and increments
state.total_chunks without enforcing a cumulative bound, so add a pre-check that
rejects or truncates a chunk when meta.received_bytes + raw.size() (or
state.total_bytes + raw.size()) would exceed the declared/expected field size
(use the existing expected-size field on meta or state), and only update
meta.received_bytes, state.total_bytes, state.chunks.push_back(ac) and
++state.total_chunks after the check succeeds; on failure return an error/status
indicating oversized upload to prevent unbounded memory growth.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 120-124: The fork failure branch only calls
close_worker_pipes_server(worker_id) leaving child-end FDs open and leaking;
update the pid < 0 handling to also close the child-side pipe FDs (e.g., call
the corresponding cleanup function for child ends such as
close_worker_pipes_child(worker_id) or otherwise close the child file
descriptors associated with worker_id) before returning -1, ensuring both server
and child pipe endpoints are cleaned up when fork() fails; reference symbols:
fork, pid, worker_id, is_replacement, close_worker_pipes_server.
- Around line 165-168: The loop in wait_for_workers is calling waitpid
unconditionally for every entry in worker_pids; skip any non-positive PIDs (pid
<= 0) and call waitpid only for valid PIDs, and handle EINTR by retrying the
wait in a loop until waitpid succeeds or returns a permanent error; also handle
-1 results such that if errno == ECHILD you can break/continue safely and for
other errors log/propagate as appropriate. Locate the loop that iterates over
worker_pids and update it to validate pid > 0, wrap the waitpid call in a retry
loop that checks errno == EINTR, and handle other error cases (e.g., ECHILD)
instead of proceeding blindly.

In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 56-74: store_simple_result currently scans result_queue for an
entry with !ready and writes multiple non-atomic fields, causing a race; change
the slot allocation to atomically claim a slot before writing (e.g., add or
reuse an atomic claim flag on result_queue[i] and use
compare_exchange_strong/compare_exchange_weak to transition from UNCLAIMED to
CLAIMED), only proceed to copy job_id/status/data_size/error_message and set
retrieved/ready after the CAS succeeds, and ensure you set the atomic ready flag
last so other threads see a fully-initialized entry; update store_simple_result
to loop until a CAS claim succeeds (or handle full-queue) and reference
result_queue, MAX_RESULTS, and the ready/claimed flags in your changes.
- Around line 44-48: Check and handle the return value of
send_incumbent_pipe(fd_, buffer): if it indicates failure (e.g., broken pipe or
write error) log an error including job_id_ and the returned errno/details and
avoid further work or retry/cleanup as appropriate; only emit the incumbent
logging (currently using std::cout with job_id_, objective, assignment.size())
when the send succeeded or gate it behind a debug/log-level flag or rate-limit
to prevent unbounded output. Locate this logic around the call sites using
send_incumbent_pipe, build_incumbent_proto, and the variables job_id_,
objective, assignment, and fd_ to add the error check and conditional logging.
- Around line 248-252: The cudaMemcpy call copying device_solution to
host_solution currently ignores CUDA return codes; change it to capture the
cudaMemcpy return value (cudaError_t err = cudaMemcpy(...)) and if err !=
cudaSuccess throw an exception (e.g., std::runtime_error) with a descriptive
message including cudaGetErrorString(err) so the existing try/catch surfaces a
clear failure to the client; apply the same pattern to all other cudaMemcpy
sites in this file (including the similar copies near lines 296-307) and
reference the same symbols (device_solution, host_solution, cudaMemcpy) when
making the fixes.
- Around line 341-363: The loop that finds an empty slot in result_queue and
then fills it (using result_slot, ResultQueueEntry, copy_cstr, MAX_RESULTS,
worker_id) has a race: another thread may grab the same slot after the ready
check but before setting ready, and if no slot is found the result is silently
dropped; fix by protecting the selection-and-write with the same synchronization
used in store_simple_result (e.g., a mutex or atomic compare-and-set on
result_queue[i].ready) so the check-and-set is atomic, and add explicit handling
for the case result_slot remains -1 (log an error via the worker logger /
std::cerr and return or propagate a failure code) so lost results are reported.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 53-57: The test file uses open() and O_WRONLY/O_CREAT/O_TRUNC but
does not include <fcntl.h>, which is required on some platforms; add a `#include`
<fcntl.h> to the top of grpc_integration_test.cpp alongside the existing system
headers (near the includes for <signal.h>, <sys/types.h>, <sys/wait.h>,
<unistd.h>) so that the open() call and O_* constants used by the test compile
portably.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setter set_server_log_path writes server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization; protect
these shared members by acquiring mutex_ (e.g., std::lock_guard or
std::unique_lock on mutex_) at the start of set_server_log_path and similarly in
every reader/callback that touches server_log_path_, server_log_start_pos_, or
test_start_marked_ (the read blocks around lines ~229-244 and ~332-355). Under
the lock, perform updates or read-and-copy the values, and for readers prefer
copying the protected values into local variables while holding mutex_ then
release the lock before doing any IO/long operations to avoid holding mutex_
during blocking work.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The phrase claiming "no custom serialization logic is
implemented" is too absolute; update the line to clarify the scope by stating
that protocol buffers generated by protoc and grpc_cpp_plugin are used for gRPC
wire payloads, and explicitly note the exception that
grpc_pipe_serialization.hpp provides custom pipe blob framing helpers (i.e., the
custom framing is outside gRPC wire payload serialization). Locate the sentence
in GRPC_ARCHITECTURE.md and reword it to mention "for gRPC wire payloads" and
add a parenthetical or following sentence referencing
grpc_pipe_serialization.hpp as the known custom framing helper.

---

Nitpick comments:
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Line 340: The return path after stream_logs currently returns non-zero on
failure but drops the error detail; update the failure branch surrounding
stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or
equivalent accessor) to the log/stream before returning the non-zero exit code
so the client error is logged for parity with other modes and aids diagnosis
(locate the code around stream_logs and the return of success ? 0 : 1 and insert
a process/log call that prints client.get_last_error() when success is false).

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 24-27: kPipeChunkBytes is a hard-coded constexpr limit; make it
configurable by replacing the fixed constexpr with a runtime-configured value
(e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that
reads an environment variable or server config and falls back to 64ULL * 1024 *
1024 as the default; update references to kPipeChunkBytes to call the accessor
and validate the parsed value (positive, reasonable upper bound) and document
the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 957-982: The test registers a log callback (received_logs with
log_mutex) but never asserts anything, so add assertions after the call to
client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked
and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty()
(and optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bfed585-0c3e-4c04-9fc0-af2be140a6f2

📥 Commits

Reviewing files that changed from the base of the PR and between e46028c and 6c06a40.

📒 Files selected for processing (39)
  • GRPC_ARCHITECTURE.md
  • SERVER_ARCHITECTURE.md
  • build.sh
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_grpc_server.cpp
  • cpp/src/grpc/client/grpc_client.cu
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cu
  • cpp/src/grpc/client/test_grpc_client.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cu
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cu
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cu
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cu
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • build.sh
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • cpp/src/grpc/grpc_settings_mapper.cu

Comment on lines +1387 to +1402
EXPECT_CALL(*mock_stub_, CheckStatus(_, _, _))
.WillOnce([](grpc::ClientContext*,
const cuopt::remote::StatusRequest&,
cuopt::remote::StatusResponse* resp) {
resp->set_job_status(cuopt::remote::PROCESSING);
resp->set_result_size_bytes(0);
return grpc::Status::OK;
});

// GetResult should be called because CheckStatus doesn't show large result
EXPECT_CALL(*mock_stub_, GetResult(_, _, _))
.WillOnce([](grpc::ClientContext*,
const cuopt::remote::GetResultRequest&,
cuopt::remote::ResultResponse*) {
return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready");
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This test asserts the opposite of its stated behavior for PROCESSING jobs.

Line 1383-1385 says GetResult should not be attempted when status is not complete, but Line 1397 expects a GetResult RPC. That codifies an unnecessary call path.

🔧 Proposed fix
-  // GetResult should be called because CheckStatus doesn't show large result
-  EXPECT_CALL(*mock_stub_, GetResult(_, _, _))
-    .WillOnce([](grpc::ClientContext*,
-                 const cuopt::remote::GetResultRequest&,
-                 cuopt::remote::ResultResponse*) {
-      return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready");
-    });
+  // For PROCESSING status, client should fail fast without issuing GetResult.
+  EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1387 -
1402, The test currently expects GetResult to be called even though CheckStatus
returns PROCESSING (not complete); update the test so GetResult is not expected
for PROCESSING jobs: remove or replace the EXPECT_CALL(*mock_stub_,
GetResult(...)).WillOnce(...) with an expectation that it is not called (e.g.,
EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0)) or simply delete the
GetResult expectation, keeping the CheckStatus stub that sets
resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the intended
"do not call GetResult for non-complete status" behavior.

Comment on lines +275 to +283
int server_log_count(const std::string& substring) const
{
std::string logs = get_server_logs();
int count = 0;
size_t pos = 0;
while ((pos = logs.find(substring, pos)) != std::string::npos) {
++count;
pos += substring.length();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle empty substring in server_log_count to prevent infinite loop.

At Line 282, pos advances by substring.length(). If substring is empty, the loop never progresses.

🛠️ Suggested fix
 int server_log_count(const std::string& substring) const
 {
+  if (substring.empty()) { return 0; }
   std::string logs = get_server_logs();
   int count        = 0;
   size_t pos       = 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 275
- 283, The server_log_count function can infinite-loop when substring is empty;
add a guard at the start of server_log_count (which calls get_server_logs()) to
return 0 (or handle as appropriate) if substring.empty() before entering the
while loop so pos will never rely on substring.length() == 0 to advance.

@tmckayus tmckayus force-pushed the grpc-server-v4 branch 2 times, most recently from a94be38 to 67e1796 Compare March 6, 2026 21:20
cuopt uses embedded grpc client to solve problems on a remote server
@anandhkb anandhkb added this to the 26.04 milestone Mar 8, 2026
@tmckayus tmckayus force-pushed the grpc-server-v4 branch 2 times, most recently from 46bcc8c to 3c1c06e Compare March 10, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants