Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a complete gRPC remote execution subsystem: protobuf schemas and codegen, a full C++ gRPC client API and CLI test client, a production gRPC server with worker subprocess IPC and chunked upload/download, bidirectional mappers and settings converters, build/test integrations, extensive unit/integration tests, and architecture docs. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
152-163:⚠️ Potential issue | 🟠 MajorStrengthen this warmstart test to assert numerical correctness, not just non-None outputs.
After adding
CUOPT_PRESOLVEat Line 152, the test still only checks that a primal solution exists (Line 162). Please also validate warmstart numerical correctness (e.g., objective consistency/improvement within tolerance).Proposed test-strengthening diff
sol1 = linear_programming.Solve(dm, settings) ws = sol1.get_pdlp_warm_start_data() + obj1 = sol1.get_primal_objective() if ws is not None: settings.set_pdlp_warm_start_data(ws) settings.set_parameter(CUOPT_ITERATION_LIMIT, 200) sol2 = linear_programming.Solve(dm, settings) assert sol2.get_primal_solution() is not None + obj2 = sol2.get_primal_objective() + assert obj2 is not None + assert abs(obj2 - obj1) / max(1.0, abs(obj1)) < 1e-6As per coding guidelines,
**/*test*.{cpp,cu,py}must “Write tests validating numerical correctness of optimization results (not just 'runs without error').”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 152 - 163, The test currently only asserts sol2.get_primal_solution() is not None; strengthen it by checking numerical correctness: retrieve both primal solutions (via sol1.get_primal_solution() and sol2.get_primal_solution()), compute their objective values (either via solX.get_primal_objective() if available or by forming the dot product of the primal vector with the problem cost from dm), then assert the objective values are consistent within a small tolerance (e.g., abs(obj1 - obj2) <= 1e-6) or that sol2's objective is no worse (<=) than sol1's given the increased CUOPT_ITERATION_LIMIT; keep tolerance and exact assertion choice explicit in the test.
🟠 Major comments (18)
cpp/src/grpc/server/grpc_worker.cpp-56-74 (1)
56-74:⚠️ Potential issue | 🟠 MajorPotential race condition in result queue access.
store_simple_resultiterates throughresult_queuelooking for an empty slot without synchronization. If multiple workers call this concurrently, they could both find the same slot with!result_queue[i].readyand overwrite each other's results.Consider using an atomic compare-exchange on
readysimilar to the job claiming pattern (lines 86-87), or protect access with a mutex.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74, store_simple_result currently scans result_queue and sets a slot when result_queue[i].ready is false without synchronization, causing a race where multiple callers can claim the same slot; update this function to claim a slot atomically (e.g., change result_queue[i].ready to an std::atomic<bool> and perform an atomic compare-exchange to flip false->true before writing the slot) or surround the scan-and-write with a mutex; ensure you only write the other fields (job_id, status, data_size, worker_index, error_message, retrieved) after successfully claiming the slot to avoid concurrent overwrites.cpp/src/grpc/server/grpc_server_types.hpp-55-64 (1)
55-64: 🛠️ Refactor suggestion | 🟠 MajorAvoid
usingdirectives in headers.The
usingdeclarations (lines 55-61) andusing namespacedirective (line 63) pollute the global namespace for any file that includes this header. This can cause unexpected name collisions.Suggested fix: Use qualified names instead
Remove these lines and use fully qualified names (e.g.,
grpc::Server,grpc::Status) in the implementation files that include this header. Alternatively, move these declarations inside the#ifdef CUOPT_ENABLE_GRPCguard's associated.cppfiles only.-using grpc::Server; -using grpc::ServerBuilder; -using grpc::ServerContext; -using grpc::ServerReaderWriter; -using grpc::ServerWriter; -using grpc::Status; -using grpc::StatusCode; - -using namespace cuopt::linear_programming; -// Note: NOT using "using namespace cuopt::remote" to avoid JobStatus enum conflict🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 55 - 64, Remove the header-level using declarations for grpc types (grpc::Server, grpc::ServerBuilder, grpc::ServerContext, grpc::ServerReaderWriter, grpc::ServerWriter, grpc::Status, grpc::StatusCode) and the using namespace cuopt::linear_programming; to avoid leaking names (and the JobStatus enum conflict). Replace usages in this header with fully-qualified names (e.g., grpc::Server, cuopt::linear_programming::Whatever) or move the using declarations into the corresponding .cpp files (or inside the CUOPT_ENABLE_GRPC guard in .cpp) so only implementation files import those namespaces; ensure any references in grpc_server_types.hpp are updated to the qualified symbols.cpp/src/grpc/server/grpc_worker_infra.cpp-194-205 (1)
194-205:⚠️ Potential issue | 🟠 MajorHandle full
result_queuewhen marking worker failures.If all result slots are occupied, this path drops the terminal failure record. A waiting client can then stall even though the job already failed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 194 - 205, The current loop that writes a terminal failure into result_queue stops if every slot is already marked ready, which can drop the failure; modify the failure handling logic so that if no slot with !ready is found it will (1) search for a slot with ready && retrieved (i.e., already consumed) and reuse it, using the same writes (copy_cstr to job_id/error_message, set status, data_size, worker_index, retrieved=false, ready=true), and (2) if none exist, as a last resort overwrite the oldest/first slot (e.g., index 0) to guarantee the client gets a terminal response; update the block around result_queue/MAX_RESULTS and reuse the existing copy_cstr calls so the behavior is deterministic.cpp/src/grpc/server/grpc_server_main.cpp-50-60 (1)
50-60:⚠️ Potential issue | 🟠 MajorHarden numeric CLI parsing and range validation.
Line 53/55/57/59 use
stoi/stollwithout exception handling or range checks. Invalid input or negative values (e.g., workers/port) can crash startup or create invalid runtime config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 60, The CLI numeric parsing loop that assigns config.port, config.num_workers, config.max_message_mb and config.max_message_b uses std::stoi/std::stoll without error handling or range checks; wrap each conversion in a try/catch for std::invalid_argument and std::out_of_range, validate values (e.g., port > 0 && port <= 65535, num_workers > 0, max_message_mb >= 0, max_message_b >= 4096) and on invalid input log an error via the existing logger (or std::cerr) and return/exit with non‑zero status or fall back to safe defaults; update the parsing branches that set config.port, config.num_workers, config.max_message_mb and the --max-message-bytes branch that sets config.max_message_b to use these guarded conversions and checks.cpp/src/grpc/server/grpc_worker_infra.cpp-120-124 (1)
120-124:⚠️ Potential issue | 🟠 MajorClose child-end FDs on fork failure to prevent descriptor leaks.
On fork failure, only server-side pipe ends are closed. Parent-owned child ends (
worker_read_fd,worker_write_fd,worker_incumbent_write_fd) remain open.🔧 Minimal fix
if (pid < 0) { std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ") << worker_id << "\n"; close_worker_pipes_server(worker_id); + close_worker_pipes_child_ends(worker_id); return -1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork failure in grpc_worker_infra.cpp (the pid < 0 branch), close the parent-owned child-end file descriptors to avoid leaks: ensure worker_read_fd, worker_write_fd, and worker_incumbent_write_fd are closed (if valid) in addition to calling close_worker_pipes_server(worker_id); update the pid < 0 error path in the code that creates workers so it explicitly closes those descriptors before returning -1.cpp/src/grpc/client/solve_remote.cu-55-57 (1)
55-57:⚠️ Potential issue | 🟠 MajorValidate env overrides before applying chunk/message sizes.
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTESare accepted as-is; zero/negative/too-small values can break chunk sizing and message-limit behavior.✅ Minimal validation example
static void apply_env_overrides(grpc_client_config_t& config) { - config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + auto max_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + + config.chunk_size_bytes = std::max<int64_t>(4096, chunk_size); + config.max_message_bytes = + (max_bytes <= 0) ? config.max_message_bytes : std::max<int64_t>(4096, max_bytes); config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The code currently applies CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES directly which can allow zero/negative/too-small values to break chunking; after calling parse_env_int64 for these into config.chunk_size_bytes and config.max_message_bytes, validate each parsed value against sensible thresholds (e.g., >0 and >= a defined MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constant) and only assign the override if it passes validation; if invalid, keep the existing default and emit a warning/log mentioning the env var name and rejected value (use the same symbols config.chunk_size_bytes, config.max_message_bytes, parse_env_int64 and add MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constants or inline thresholds).cpp/src/grpc/client/solve_remote.cu-153-166 (1)
153-166:⚠️ Potential issue | 🟠 MajorDon’t forward incumbent objective as the callback bound.
Line 164 sets
bound_copyequal to incumbent objective. That can make gap-sensitive callback logic interpret the incumbent as already tightly bounded, which is semantically wrong when a true best bound is unavailable.Based on learnings: objective direction is encoded by objective-sense metadata, and callback bounds should use direction-appropriate “no bound” values when the true bound is not available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 153 - 166, The incumbent callback is incorrectly using the incumbent objective as the callback bound (bound_copy), which can mislead gap-sensitive logic; instead, set bound_copy to a direction-appropriate "no bound" sentinel (e.g., +infinity for minimization, -infinity for maximization) rather than objective. In the config.incumbent_callback lambda (referencing mip_callbacks and get_solution_callback_t::get_solution), query the objective sense from the callback or problem metadata (e.g., a get_objective_sense() or equivalent) and assign bound_copy = std::numeric_limits<double>::infinity() for minimization or -std::numeric_limits<double>::infinity() for maximization before calling get_solution; do not forward the incumbent objective as the bound.cpp/src/grpc/server/grpc_server_threads.cpp-253-263 (1)
253-263:⚠️ Potential issue | 🟠 MajorSynchronize
worker_pipesaccess across polling and worker-restart paths.This thread reads
worker_pipes(lines 254–255) while the monitor thread replaces dead workers viaspawn_single_worker()→close_worker_pipes_server(), which modifies the same entries without a shared lock. This is a data race: the vector can reallocate during concurrent modifications, and file descriptors can be closed while the incumbent thread polls them, causing undefined behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 253 - 263, The polling code reads shared worker_pipes without synchronization, causing a data race with spawn_single_worker() / close_worker_pipes_server() which mutate that vector; fix by introducing and using a shared mutex (e.g., worker_pipes_mutex) that both the polling thread and the worker-restart code lock when accessing/modifying worker_pipes, and change the polling path (the loop that builds pfds) to take the mutex, copy the needed file descriptors into a local vector<int> (or local pfds) while holding the lock, then release the lock before calling poll() so the poll is not done while holding the mutex; also ensure close_worker_pipes_server() and spawn_single_worker() acquire the same mutex when closing/removing entries to avoid closing fds being polled.cpp/src/grpc/grpc_settings_mapper.cu-144-146 (1)
144-146:⚠️ Potential issue | 🟠 MajorGuard
iteration_limitnarrowing conversion.Line 145 casts protobuf
int64directly intoi_t. Oversized values can overflow/truncate and corrupt solver limits.Proposed fix
- if (pb_settings.iteration_limit() >= 0) { - settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); - } + if (pb_settings.iteration_limit() >= 0) { + const auto iter_limit = pb_settings.iteration_limit(); + if (iter_limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) { + settings.iteration_limit = std::numeric_limits<i_t>::max(); + } else { + settings.iteration_limit = static_cast<i_t>(iter_limit); + } + }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The code currently casts pb_settings.iteration_limit() directly to i_t (setting settings.iteration_limit) which can overflow; update the assignment in the block that checks pb_settings.iteration_limit() to first compare the int64 value against the representable range of i_t (use std::numeric_limits<i_t>::max() and min()/0 as appropriate), and then either clamp to that range or return/log an error before assigning to settings.iteration_limit; reference pb_settings.iteration_limit(), settings.iteration_limit, and the i_t type when making this change so the narrowing conversion is guarded.cpp/src/grpc/server/grpc_job_management.cpp-109-113 (1)
109-113:⚠️ Potential issue | 🟠 MajorCap incumbent payload size before
data.resize(size).Line 111 trusts wire-provided size and can allocate arbitrarily large memory.
Proposed fix
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + constexpr uint64_t kMaxIncumbentBytes = 64ULL * 1024 * 1024; + if (size > kMaxIncumbentBytes) { + std::cerr << "[Server] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code reads a wire-provided uint64_t size and calls data.resize(size) without validation; before calling data.resize (in the block that uses read_from_pipe, size and data), validate and cap size against a defined maximum payload (e.g. kMaxPayloadSize) and reject sizes that are zero or exceed the cap by returning false; use the same symbols from the snippet (uint64_t size, read_from_pipe(fd, &size,...), data.resize(size)) so the check occurs immediately after reading size and before any allocation or further read_from_pipe calls.cpp/src/grpc/server/grpc_service_impl.cpp-699-716 (1)
699-716:⚠️ Potential issue | 🟠 Major
StreamLogscan block indefinitely for terminal jobs without a log file.The wait loop only exits on file existence or NOT_FOUND; a completed/failed/cancelled job with no log file never terminates the RPC.
Proposed fix
if (waited_ms >= 2000) { std::string msg; JobStatus s = check_job_status(job_id, msg); if (s == JobStatus::NOT_FOUND) { if (config.verbose) { std::cout << "[gRPC] StreamLogs job not found: " << job_id << std::endl; } return Status(grpc::StatusCode::NOT_FOUND, "Job not found: " + job_id); } + if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) { + cuopt::remote::LogMessage done; + done.set_line(""); + done.set_byte_offset(from_byte); + done.set_job_complete(true); + writer->Write(done); + return Status::OK; + } waited_ms = 0; }As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 699 - 716, The StreamLogs loop can hang for terminal jobs that never produce a log file; modify the loop (in grpc_service_impl::StreamLogs) to call check_job_status(job_id, msg) when the file is still missing and, if the returned JobStatus is a terminal state (e.g., JobStatus::COMPLETED, JobStatus::FAILED, JobStatus::CANCELLED or other terminal enums your code defines), break out and return a gRPC error (for example Status(grpc::StatusCode::NOT_FOUND, "Log file not available for job: " + job_id)) or an appropriate Status; keep the existing verbose logging path (config.verbose) and context->IsCancelled() checks, and retain the waited_ms retry/backoff behavior but ensure terminal-state handling resets/short-circuits the wait to avoid indefinite blocking.cpp/src/grpc/server/grpc_service_impl.cpp-189-193 (1)
189-193:⚠️ Potential issue | 🟠 Major
SendArrayChunkallows unbounded duplicate/overlapping payload growth.
meta.received_bytesandstate.total_bytesare incremented unconditionally; repeated overlapping chunks can inflate memory usage until OOM.Proposed guard
- meta.received_bytes += static_cast<int64_t>(raw.size()); + if (meta.received_bytes + static_cast<int64_t>(raw.size()) > array_bytes) { + return Status(StatusCode::INVALID_ARGUMENT, "Received bytes exceed declared array size"); + } + meta.received_bytes += static_cast<int64_t>(raw.size()); state.total_bytes += static_cast<int64_t>(raw.size());As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 189 - 193, SendArrayChunk currently unconditionally increments meta.received_bytes and state.total_bytes and pushes state.chunks, allowing duplicate/overlapping chunks to inflate memory; modify SendArrayChunk to first validate the incoming chunk against already-received ranges (e.g., track per-array received byte ranges in state or use an interval set) and skip or trim duplicate/overlapping payloads before modifying meta.received_bytes, state.total_bytes, state.chunks, and state.total_chunks; also enforce a global per-array max size check and reject or cap chunks that would push state.total_bytes beyond that limit to prevent unbounded growth.cpp/src/grpc/grpc_settings_mapper.cu-154-154 (1)
154-154:⚠️ Potential issue | 🟠 MajorValidate protobuf
presolvervalue before casting topresolver_t.Line 154 and Line 221 trust unvalidated wire values; invalid integers become unsupported internal enum states.
As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
Also applies to: 221-221
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` at line 154, The code casts the protobuf integer pb_settings.presolver() directly into presolver_t (see assignment to settings.presolver and the similar use at the other spot), which can produce invalid enum states for out-of-range wire values; before static_casting to presolver_t, validate the integer is within the defined presolver_t range (e.g., compare against the enum's min/max or use a switch/case for known values), handle unknown values by logging an error/warning and selecting a safe default (or returning an error), and then assign the validated value to settings.presolver to avoid unsupported internal enum states.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-673-699 (1)
673-699:⚠️ Potential issue | 🟠 Major
SolveInfeasibleLPcan pass without actually validating infeasibility behavior.The assertion is gated by
if (result.success && result.solution), so transport or solve regressions can still produce a passing test.Proposed fix
auto result = client->solve_lp(problem, settings); - if (result.success && result.solution) { - EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal); - } + ASSERT_TRUE(result.success) << result.error_message; + ASSERT_NE(result.solution, nullptr); + EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);As per coding guidelines "
**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 673 - 699, The test SolveInfeasibleLP currently skips assertions when the solver failed or returned no solution; replace the guarded if-check with hard assertions so the test fails on transport/solve regressions: assert that result.success is true (ASSERT_TRUE(result.success)), assert that result.solution is non-null (ASSERT_NE(result.solution, nullptr) or ASSERT_TRUE(result.solution)), and then assert the termination status equals pdlp_termination_status_t::Infeasible (ASSERT_EQ(result.solution->get_termination_status(), pdlp_termination_status_t::Infeasible)) so the test verifies infeasibility behavior unconditionally.cpp/src/grpc/grpc_problem_mapper.cu-175-182 (1)
175-182:⚠️ Potential issue | 🟠 MajorValidate lower/upper bound array sizes before applying both.
Line 175 only checks
constraint_lower_bounds_size() > 0; a malformed payload with mismatched upper-size still gets applied.Proposed guard
- if (pb_problem.constraint_lower_bounds_size() > 0) { + if (pb_problem.constraint_lower_bounds_size() > 0 || + pb_problem.constraint_upper_bounds_size() > 0) { + if (pb_problem.constraint_lower_bounds_size() != pb_problem.constraint_upper_bounds_size()) { + return; // or propagate an explicit mapping error + } std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(), pb_problem.constraint_lower_bounds().end());As per coding guidelines "Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 175 - 182, The code applies constraint lower and upper bounds when only constraint_lower_bounds_size() > 0 is checked, which allows mismatched arrays; update the guard to validate both pb_problem.constraint_lower_bounds_size() and pb_problem.constraint_upper_bounds_size() are > 0 and equal before calling cpu_problem.set_constraint_lower_bounds(...) and cpu_problem.set_constraint_upper_bounds(...); if sizes differ or one is zero, skip applying them (or log/return an error) to avoid out-of-bounds/misaligned mappings and ensure index consistency across transformations.cpp/src/grpc/grpc_problem_mapper.cu-454-461 (1)
454-461:⚠️ Potential issue | 🟠 MajorRequire complete CSR/Q triplets before setting matrix data.
Line 454 and Line 526 gate only on values array presence; applying matrices without matching indices/offsets can create malformed models.
Proposed guard pattern
- if (!a_values.empty()) { + if (!a_values.empty() || !a_indices.empty() || !a_offsets.empty()) { + if (a_values.empty() || a_indices.empty() || a_offsets.empty()) { return; } cpu_problem.set_csr_constraint_matrix(...); }- if (!q_values.empty()) { + if (!q_values.empty() || !q_indices.empty() || !q_offsets.empty()) { + if (q_values.empty() || q_indices.empty() || q_offsets.empty()) { return; } cpu_problem.set_quadratic_objective_matrix(...); }As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
Also applies to: 526-533
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 454 - 461, The current guard only checks a_values before calling cpu_problem.set_csr_constraint_matrix, which can yield malformed CSR data if a_indices or a_offsets are missing or their sizes don't match; update the condition to require a_values, a_indices, and a_offsets to be non-empty and verify their sizes are consistent (e.g., a_indices.size() matches a_values.size() and a_offsets.size() matches expected row count + 1) before calling cpu_problem.set_csr_constraint_matrix, and apply the same strengthened guard/validation to the other matrix-setting block referenced around the second call (the similar call later in the file).cpp/src/grpc/server/grpc_service_impl.cpp-823-824 (1)
823-824:⚠️ Potential issue | 🟠 MajorFix incumbent pagination cursor calculation.
next_indexis always set toavailable, which skips data whenmax_countis smaller than remaining incumbents.Proposed fix
- response->set_next_index(available); + response->set_next_index(from_index + count);As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 823 - 824, The pagination logic incorrectly sets response->set_next_index(available) which skips items when max_count < remaining; change it to compute the next cursor as the start index plus the number of items actually returned (e.g., start + returned_count) or equivalently start + std::min(max_count, available - start), then call response->set_next_index(...) with that value and set done when that next index >= available; update the code around response->set_next_index(available) and the subsequent bool done = ... expression accordingly.cpp/src/grpc/client/grpc_client.cu-525-531 (1)
525-531:⚠️ Potential issue | 🟠 MajorUse
compute_chunk_size()to clamp chunk payload to negotiated message-size limits.
chunk_data_budgetis derived only fromconfig_.chunk_size_byteswithout enforcing server/client limits. If user config exceeds server capabilities, chunk RPC calls will fail with message-size errors. Apply the existingcompute_chunk_size()helper at both occurrences (lines 525-531 and 694-698):🔧 Proposed fix
- int64_t chunk_data_budget = config_.chunk_size_bytes; - if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; } + int64_t chunk_data_budget = + compute_chunk_size(server_max_message_bytes_, config_.max_message_bytes, config_.chunk_size_bytes); const int64_t proto_overhead = 64; if (chunk_data_budget > proto_overhead) { chunk_data_budget -= proto_overhead; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 525 - 531, The code sets chunk_data_budget directly from config_.chunk_size_bytes then subtracts proto_overhead before calling build_array_chunk_requests, which can exceed negotiated gRPC limits; replace the direct assignment with a call to compute_chunk_size(config_.chunk_size_bytes) (or the appropriate compute_chunk_size overload) to clamp to negotiated message-size limits, then subtract proto_overhead and pass that clamped chunk_data_budget into build_array_chunk_requests; apply the exact same change to the other place in this file where chunk_data_budget is computed later (the second occurrence) so both chunk payload computations use compute_chunk_size().
🟡 Minor comments (10)
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-152-152 (1)
152-152:⚠️ Potential issue | 🟡 MinorClarify the presolve mode and add numerical validation to the warmstart test.
At line 152, the literal
0forCUOPT_PRESOLVEobscures intent—there is noPresolverModeenum exported to Python, so add an inline comment explaining that0disables presolve (required for warmstart). More importantly, the test at line 162 only checks that the solution is non-None; it should validate numerical correctness of the warmstart result by comparing it against a baseline solve or checking solution bounds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` at line 152, Replace the magic literal by documenting intent: add an inline comment next to settings.set_parameter(CUOPT_PRESOLVE, 0) stating that 0 disables presolve (required for warmstart) since no PresolverMode enum is exposed to Python. Then extend the warmstart test (the test that currently only asserts solution is not None) to perform numerical validation: run a baseline solve without warmstart (or use known optimal objective/variable bounds) and assert the warmstarted solution’s objective and key variable values are within a tight tolerance (e.g., via pytest.approx) of the baseline; reference the same settings and solution objects used in the test to obtain values for comparison.cpp/src/grpc/server/grpc_incumbent_proto.hpp-30-32 (1)
30-32:⚠️ Potential issue | 🟡 MinorUnchecked serialization return value.
SerializeToArraycan fail and returns a boolean, but the return value is ignored. This could lead to returning an invalid/incomplete buffer silently.🛡️ Proposed fix
std::vector<uint8_t> buffer(msg.ByteSizeLong()); - msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())); + if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) { + return {}; // Return empty vector on failure + } return buffer;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The code ignores the boolean result of msg.SerializeToArray which can fail; update the serialization block to check the return value of msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) and handle failure (e.g., throw a std::runtime_error or return an empty/optional buffer) rather than returning a potentially incomplete vector; keep the existing use of msg.ByteSizeLong() and the buffer variable but ensure you validate SerializeToArray's result and produce a clear error/alternative return before returning buffer.cpp/src/grpc/server/grpc_worker.cpp-248-251 (1)
248-251:⚠️ Potential issue | 🟡 MinorMissing CUDA error checking on cudaMemcpy.
Per coding guidelines, CUDA errors should be caught and mapped to meaningful error codes. These
cudaMemcpycalls could fail (e.g., device out of memory) but errors are not checked.🛡️ Proposed fix pattern
- cudaMemcpy(host_solution.data(), - device_solution.data(), - device_solution.size() * sizeof(double), - cudaMemcpyDeviceToHost); + cudaError_t err = cudaMemcpy(host_solution.data(), + device_solution.data(), + device_solution.size() * sizeof(double), + cudaMemcpyDeviceToHost); + if (err != cudaSuccess) { + throw std::runtime_error(std::string("cudaMemcpy failed: ") + cudaGetErrorString(err)); + }Apply the same pattern to the other
cudaMemcpycalls on lines 296-307.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 251, The cudaMemcpy calls (e.g., the call copying device_solution -> host_solution) lack CUDA error checks; wrap each cudaMemcpy (including the other calls around lines 296-307) to capture the returned cudaError_t, check if it is not cudaSuccess, and convert it to a meaningful error/Status (use/create a helper like mapCudaErrorToStatus or mapCudaErrorToGrpcStatus) and return/log that error instead of proceeding; update the cudaMemcpy calls that reference device_solution, host_solution, and any other device/host buffers to use this pattern so failures (OOM, invalid value, etc.) are detected and mapped to the existing error-handling flow.cpp/src/grpc/client/test_grpc_client.cpp-401-403 (1)
401-403:⚠️ Potential issue | 🟡 MinorUnhandled exception from
std::stod.If the user provides an invalid
--time-limitvalue (e.g., non-numeric),std::stodthrowsstd::invalid_argumentorstd::out_of_range, resulting in an unhandled exception with a cryptic error message.🛡️ Proposed fix
} else if (arg == "--time-limit" && arg_idx + 1 < argc) { - time_limit = std::stod(argv[++arg_idx]); + try { + time_limit = std::stod(argv[++arg_idx]); + } catch (const std::exception& e) { + std::cerr << "ERROR: Invalid time limit value\n"; + return 1; + } arg_idx++;The same issue exists for
std::stollon line 531.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the std::stod call that parses --time-limit (and the std::stoll call at the other location) in a try/catch that catches std::invalid_argument and std::out_of_range, then log a clear error via the existing logging/exit path and/or set a safe default; specifically, around the parsing of argv in the branch handling "--time-limit" (variable time_limit and std::stod) catch parsing exceptions, call the same error-handling used elsewhere (print usage or error and exit with non-zero) and do the same for the std::stoll parse to avoid uncaught exceptions and cryptic crashes.cpp/src/grpc/cuopt_remote_service.proto-314-318 (1)
314-318:⚠️ Potential issue | 🟡 MinorInconsistent
job_idtype:byteshere vsstringelsewhere.
IncumbentRequest.job_idis defined asbytes(line 315), but all other request messages (e.g.,StatusRequest,GetResultRequest,CancelRequest,DeleteRequest) usestringforjob_id. This inconsistency could cause confusion and require different handling on client/server.Suggested fix for consistency
message IncumbentRequest { - bytes job_id = 1; + string job_id = 1; int64 from_index = 2; // Return incumbents starting from this index int32 max_count = 3; // Optional limit (0 or negative => no limit) }Similarly,
Incumbent.job_idat line 324 should also bestringfor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 318, IncumbentRequest.job_id (and matching Incumbent.job_id) are declared as bytes while all other request messages use string; change the field type from bytes to string in the proto for IncumbentRequest and Incumbent so they match StatusRequest/GetResultRequest/CancelRequest/DeleteRequest, then recompile/regenerate the gRPC/protobuf stubs (look for the IncumbentRequest and Incumbent message definitions) to ensure client/server code uses the consistent string type.cpp/src/grpc/server/grpc_server_types.hpp-263-265 (1)
263-265:⚠️ Potential issue | 🟡 MinorHardcoded shared memory names may cause conflicts with concurrent server instances.
The shared memory segment names (
/cuopt_job_queue,/cuopt_result_queue,/cuopt_control) are hardcoded constants. If multiple server instances run on the same host, they will conflict.Consider incorporating the port number or a unique identifier into the segment names.
Suggested approach
// Instead of inline constants, generate names dynamically: inline std::string get_shm_job_queue_name(int port) { return "/cuopt_job_queue_" + std::to_string(port); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 263 - 265, The three hardcoded shared-memory name constants (SHM_JOB_QUEUE, SHM_RESULT_QUEUE, SHM_CONTROL) will collide across concurrent server instances; change them from inline const char* values to functions that generate unique names at runtime (e.g., append the server port, PID, or UUID) and update call sites to use the generator functions (e.g., get_shm_job_queue_name(port), get_shm_result_queue_name(port), get_shm_control_name(port)) so each server instance uses distinct segment names.GRPC_ARCHITECTURE.md-298-304 (1)
298-304:⚠️ Potential issue | 🟡 MinorDocument required remote endpoint env vars in the Environment Variables section.
This table currently lists only transfer tuning variables. Please also include
CUOPT_REMOTE_HOSTandCUOPT_REMOTE_PORT, since remote execution depends on them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 298 - 304, The Environment Variables table is missing the remote endpoint variables: add two rows for CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT in the "Environment Variables" section, specifying sensible defaults (e.g., blank or "localhost" for CUOPT_REMOTE_HOST and a default port like 50051 for CUOPT_REMOTE_PORT) and clear descriptions such as "Remote execution host" and "Remote execution port" so remote execution is documented alongside CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES.GRPC_ARCHITECTURE.md-16-39 (1)
16-39:⚠️ Potential issue | 🟡 MinorSpecify languages on fenced code blocks to satisfy markdown lint.
These blocks are missing a language identifier (
MD040), which is currently failing lint hygiene.📝 Example fix pattern
-``` +```text cpp/src/grpc/ ... -``` +```Also applies to: 105-120, 136-151, 359-371
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 16 - 39, The markdown fenced code blocks in GRPC_ARCHITECTURE.md are missing language identifiers (MD040); update each triple-backtick block (including the shown cpp/src/grpc/ tree block and the other blocks at ranges 105-120, 136-151, 359-371) to include a language token such as "text" (i.e., change ``` to ```text) so linting passes; ensure all similar block openings (```...) in the file are consistently annotated.cpp/src/grpc/grpc_solution_mapper.cu-559-560 (1)
559-560:⚠️ Potential issue | 🟡 MinorWarm-start presence check is too narrow.
Warm-start reconstruction only triggers when
RESULT_WS_CURRENT_PRIMALis non-empty; other warm-start fields can be present and still be dropped.Suggested adjustment
- auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL); - if (!ws_primal.empty()) { + auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL); + auto ws_dual = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_DUAL); + auto ws_init_p = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_PRIMAL_AVG); + auto ws_init_d = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_DUAL_AVG); + if (!ws_primal.empty() || !ws_dual.empty() || !ws_init_p.empty() || !ws_init_d.empty()) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 560, The code only triggers warm-start reconstruction when ws_primal (from bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL)) is non-empty, which misses cases where other warm-start fields exist; change the presence check to detect any warm-start data key in arrays (e.g., cuopt::remote::RESULT_WS_CURRENT_PRIMAL, RESULT_WS_CURRENT_DUAL, RESULT_WS_PREV_PRIMAL, RESULT_WS_PREV_DUAL or other RESULT_WS_* entries) before skipping reconstruction. Implement a small helper or boolean that queries arrays for any of those RESULT_WS_* keys (or calls bytes_to_typed for each and checks non-empty) and use that combined condition instead of only ws_primal.empty() so reconstruction runs when any warm-start field is present.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-53-57 (1)
53-57:⚠️ Potential issue | 🟡 MinorAdd
<fcntl.h>include foropen()and flag constants.Line 120 uses
open()withO_WRONLY|O_CREAT|O_TRUNC, but<fcntl.h>is not explicitly included. While some systems may include it transitively throughunistd.h, this is not guaranteed by the POSIX standard and creates a portability risk.File handle cleanup is correct;
fdis properly closed at line 125.Proposed fix
`#include` <signal.h> +#include <fcntl.h> `#include` <sys/types.h> `#include` <sys/wait.h> `#include` <unistd.h>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53 - 57, The test uses open() with flags O_WRONLY|O_CREAT|O_TRUNC (call site: open(..., O_WRONLY|O_CREAT|O_TRUNC)) but does not include <fcntl.h>, which can break portability; add `#include` <fcntl.h> to the top includes so the O_* flag constants and open declaration are defined and the code compiles reliably across platforms.
🧹 Nitpick comments (10)
cpp/src/grpc/server/grpc_worker.cpp (1)
149-161: Duplicate error handling blocks.Lines 149-151 and 153-161 both handle
!read_success. The first block logs to stderr, while the second handles the failure logic. These could be consolidated.♻️ Proposed consolidation
- if (!read_success) { - std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n"; - } - if (!read_success) { + std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n"; store_simple_result(job_id, worker_id, 1, "Failed to read job data"); job.worker_pid = 0; // ... rest of cleanup🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 149 - 161, The duplicate handling for !read_success should be consolidated into a single block: remove the first standalone stderr log and merge its message into the existing failure branch so that when read_success is false you both log the error (including worker_id) and execute the failure logic (call store_simple_result(job_id, worker_id, 1, "..."), reset job fields job.worker_pid, job.worker_index, job.data_sent, job.ready, job.claimed, and continue). Update the block around the read_success check in grpc_worker.cpp to perform logging and the cleanup exactly once to avoid duplicate branches.cpp/src/grpc/server/grpc_incumbent_proto.hpp (1)
44-49: Consider clearing the output vector before populating.The
assignmentvector is passed by reference but not cleared beforepush_backcalls. If a caller reuses the same vector across multiple calls, stale data will accumulate.♻️ Proposed fix
job_id = incumbent_msg.job_id(); objective = incumbent_msg.objective(); + assignment.clear(); assignment.reserve(incumbent_msg.assignment_size()); for (int i = 0; i < incumbent_msg.assignment_size(); ++i) { assignment.push_back(incumbent_msg.assignment(i)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 44 - 49, The assignment vector is populated via push_back without being cleared, so reuse will accumulate stale entries; before calling assignment.reserve(...) and the for-loop that pushes from incumbent_msg.assignment(i), call assignment.clear() (or use assignment.assign(...) from the repeated field) to ensure the vector is emptied first—update the code around the symbols assignment and incumbent_msg in grpc_incumbent_proto.hpp to clear assignment before reserving/pushing.cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
93-100: Verify semantic difference between result and request deserialization.For result blobs (line 94),
total_elementsis used directly as byte count for resize. For request blobs (lines 151-153),total_elements * elem_sizeis used. This asymmetry suggeststotal_elementshas different semantics in result vs request contexts - consider adding a clarifying comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 100, The resize logic for result deserialization uses ac.total_elements() directly as bytes while request deserialization multiplies total_elements by element size, so confirm whether ac.total_elements() represents bytes or element count and make semantics consistent: either change the result-path allocation (the dest.resize call) to use total_bytes = ac.total_elements() * ac.elem_size() (or equivalent accessor) before resizing and memcpy, or add a clear comment by the result-path block documenting that total_elements already includes element size in bytes; update any related uses (element_offset(), chunk_data.size(), memcpy) to match the chosen unit to avoid off-by-factor bugs.cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp (1)
16-20: Consider reordering includes for consistency.Standard convention places system/library headers before local headers. The
<memory>include should come before the local"grpc_client.hpp"include.Suggested reorder
`#include` <cuopt_remote_service.grpc.pb.h> `#include` <grpcpp/grpcpp.h> -#include "grpc_client.hpp" `#include` <memory> + +#include "grpc_client.hpp"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp` around lines 16 - 20, Reorder the includes so system/library headers come before local headers: move the <memory> include above the "grpc_client.hpp" include in grpc_client_test_helper.hpp; ensure the block now lists <cuopt_remote_service.grpc.pb.h>, <grpcpp/grpcpp.h>, <memory> and then "grpc_client.hpp" to follow the standard include ordering convention.cpp/tests/linear_programming/grpc/CMakeLists.txt (1)
24-33: Minor: Redundant GTest library linkage.
GTest::gmock_mainalready includesGTest::gmock, andGTest::gtest_mainincludesGTest::gtest. You can simplify to just the_mainvariants.Simplified linkage
target_link_libraries(GRPC_CLIENT_TEST PRIVATE cuopt - GTest::gmock GTest::gmock_main - GTest::gtest GTest::gtest_main gRPC::grpc++ protobuf::libprotobuf )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 24 - 33, Remove redundant GTest link entries in the target_link_libraries call for GRPC_CLIENT_TEST: keep only the main variants (GTest::gmock_main and GTest::gtest_main) and remove GTest::gmock and GTest::gtest to avoid duplicate linkage; update the target_link_libraries block that references GRPC_CLIENT_TEST, cuopt, gRPC::grpc++, and protobuf::libprotobuf accordingly so only the necessary _main symbols remain.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (2)
359-360: Remove unusedserver_logs_member.The
server_logs_vector is declared but never populated or used. Server logs are read directly from the file system viaget_server_logs(). This dead code should be removed to avoid confusion.Remove unused member
private: mutable std::mutex mutex_; std::vector<LogEntry> client_logs_; - std::vector<LogEntry> server_logs_; std::string server_log_path_; std::streampos server_log_start_pos_ = 0; // Position in server log file when test started bool test_start_marked_ = false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 359 - 360, Remove the dead member server_logs_ from the class: delete the declaration of std::vector<LogEntry> server_logs_ (it is never populated or used; server logs are accessed via get_server_logs()). After removing the member, run a quick search for server_logs_ to ensure there are no remaining references and update any tests or constructors that may have initialized it.
123-126: Callback capturesthispointer — document lifetime requirement.The lambda returned by
client_callback()capturesthisby pointer. If the callback is invoked after theGrpcTestLogCaptureinstance is destroyed, this will cause undefined behavior. Consider adding a brief note in the documentation that theGrpcTestLogCaptureinstance must outlive any client using this callback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 123 - 126, The lambda returned by GrpcTestLogCapture::client_callback() captures this by pointer and can dangle if the GrpcTestLogCapture instance is destroyed, so add a clear lifetime note: update the documentation/comment for the GrpcTestLogCapture class and/or the client_callback() method to state that the GrpcTestLogCapture instance must outlive any client that retains or invokes the returned callback; reference the client_callback() function name in the comment so callers know the requirement and consider mentioning alternative designs (e.g., providing a weak_ptr-based factory) if you want to avoid the strict lifetime requirement.SERVER_ARCHITECTURE.md (1)
15-46: Add language specifier to fenced code block.Per markdownlint, fenced code blocks should have a language specified. For ASCII diagrams, use
textorplaintext.Fix markdown lint warning
-``` +```text ┌────────────────────────────────────────────────────────────────────┐🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The fenced ASCII diagram block in SERVER_ARCHITECTURE.md is missing a language specifier; change the opening fence from ``` to ```text (or ```plaintext) so the block becomes ```text and satisfies markdownlint for the diagram that starts with the "┌────────────────────────────────────────────────────────────────────┐" line.cpp/src/grpc/client/solve_remote.cu (1)
69-74: Mark currently-unused LP flags explicitly (or forward them).
problem_checkinganduse_pdlp_solver_modeare currently unused. Marking them explicitly avoids silent interface drift and warning churn.Small cleanup option
std::unique_ptr<lp_solution_interface_t<i_t, f_t>> solve_lp_remote( cpu_optimization_problem_t<i_t, f_t> const& cpu_problem, pdlp_solver_settings_t<i_t, f_t> const& settings, - bool problem_checking, - bool use_pdlp_solver_mode) + [[maybe_unused]] bool problem_checking, + [[maybe_unused]] bool use_pdlp_solver_mode)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 69 - 74, The function solve_lp_remote currently accepts parameters problem_checking and use_pdlp_solver_mode but never uses them, which can cause warnings and interface drift; update the function (solve_lp_remote) to explicitly mark these parameters as intentionally unused (e.g., cast to void or annotate with [[maybe_unused]]) or forward them to the underlying call that needs them, ensuring the symbols problem_checking and use_pdlp_solver_mode appear in the implementation so the intent is clear and compiler warnings are silenced.cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)
1086-1340: Add at least one degenerate solve-case assertion in this suite.The end-to-end mock solve tests are strong, but adding explicit infeasible/unbounded/empty (or singleton edge) solve assertions would better protect status/result mapping regressions.
As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1086 - 1340, Add a new degenerate-case unit test (e.g., TEST_F(GrpcClientTest, SolveLP_Infeasible)) that uses the existing GrpcClientTest fixture and MockCuOptStub to exercise grpc_client_t::solve_lp for an infeasible LP: have SubmitJob(...) return a job_id, have CheckStatus/WaitForCompletion indicate COMPLETED (or skip if use_wait), and have GetResult(...) return a cuopt::remote::LPSolution with termination_status set to PDLP_INFEASIBLE (and resp->set_status(cuopt::remote::SUCCESS) if appropriate); then assert the returned result indicates failure (result.success == false) and that result.error_message contains a clear indicator of infeasibility (e.g., "INFEASIBLE" or "infeasible"); reuse symbols SubmitJob, CheckStatus/WaitForCompletion, GetResult, grpc_client_t::solve_lp, and GrpcClientTest to locate where to add the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6984e3cb-4874-432a-8307-c8f3a64fd9ba
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/routing/utils.pypython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
| job_status_result_t grpc_client_t::check_status(const std::string& job_id) | ||
| { | ||
| job_status_result_t result; | ||
|
|
||
| grpc::ClientContext context; | ||
| auto request = build_status_request(job_id); | ||
| cuopt::remote::StatusResponse response; | ||
| auto status = impl_->stub->CheckStatus(&context, request, &response); | ||
|
|
There was a problem hiding this comment.
Guard manual RPC APIs against null stub dereference.
These public methods call impl_->stub->... directly (e.g., Line 265). If called before a successful connect(), this can crash.
🔧 Proposed guard pattern
job_status_result_t grpc_client_t::check_status(const std::string& job_id)
{
job_status_result_t result;
+ if (!impl_->stub) {
+ result.error_message = "Not connected to server";
+ return result;
+ }
grpc::ClientContext context;
auto request = build_status_request(job_id);
cuopt::remote::StatusResponse response;Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC
methods (e.g., grpc_client_t::check_status) currently dereference impl_->stub
directly and can crash if called before connect(); add a guard at the start of
each affected method to check that impl_ and impl_->stub are non-null (and
optionally that impl_->connected flag is true) and return an appropriate failure
job_status_result_t (or set an error state) instead of proceeding; update all
other methods listed (those around lines 293-303, 325-333, 354-361, 381-395,
419-430) to use the same pattern so no method ever calls impl_->stub->...
without the null check.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check the file exists and get basic info
if [ -f cpp/src/grpc/client/grpc_client.cu ]; then
echo "=== File size and structure ==="
wc -l cpp/src/grpc/client/grpc_client.cu
echo ""
echo "=== Code around line 258-266 (check_status function) ==="
sed -n '250,275p' cpp/src/grpc/client/grpc_client.cu
else
echo "File not found at expected path. Searching for grpc_client files..."
find . -name "*grpc_client*" -type f 2>/dev/null
fiRepository: NVIDIA/cuopt
Length of output: 1047
🏁 Script executed:
#!/bin/bash
# Search for all RPC call patterns and deadline usage in the codebase
echo "=== Searching for set_deadline usage ==="
rg "set_deadline" --type cpp --type cuda -B 2 -A 2
echo ""
echo "=== Searching for ClientContext usage ==="
rg "ClientContext" cpp/src/grpc/client/ --type cpp --type cuda -B 1 -A 3 | head -100Repository: NVIDIA/cuopt
Length of output: 214
🏁 Script executed:
#!/bin/bash
# Check for config_ member and timeout configuration
echo "=== Searching for config_.timeout_seconds ==="
rg "timeout_seconds|config_\.timeout" cpp/src/grpc/client/ --type cpp --type cuda -B 2 -A 2
echo ""
echo "=== Check class definition and constructor ==="
rg "class grpc_client|grpc_client_t\s*\(" cpp/src/grpc/client/ --type cpp --type cuda -B 2 -A 5Repository: NVIDIA/cuopt
Length of output: 224
🏁 Script executed:
#!/bin/bash
echo "=== Searching for set_deadline in grpc_client files ==="
rg "set_deadline" cpp/src/grpc/client/ -B 2 -A 2
echo ""
echo "=== Searching for timeout_seconds ==="
rg "timeout_seconds" cpp/src/grpc/client/ -B 2 -A 2
echo ""
echo "=== Searching for config_ initialization and class definition ==="
rg "class grpc_client_t|config_\s*=" cpp/src/grpc/client/ -B 1 -A 3Repository: NVIDIA/cuopt
Length of output: 3404
🏁 Script executed:
#!/bin/bash
echo "=== Examine lines 293-303 (second location) ==="
sed -n '290,305p' cpp/src/grpc/client/grpc_client.cu
echo ""
echo "=== Examine lines 329-333 ==="
sed -n '325,340p' cpp/src/grpc/client/grpc_client.cu
echo ""
echo "=== Examine lines 356-360 ==="
sed -n '352,365p' cpp/src/grpc/client/grpc_client.cuRepository: NVIDIA/cuopt
Length of output: 1533
🏁 Script executed:
#!/bin/bash
echo "=== Search for all ClientContext creations ==="
rg "grpc::ClientContext" cpp/src/grpc/client/
echo ""
echo "=== Check for any probe/connection code that might have deadlines ==="
rg -i "probe|connection|deadline" cpp/src/grpc/client/ -B 1 -A 1Repository: NVIDIA/cuopt
Length of output: 3593
🏁 Script executed:
#!/bin/bash
echo "=== Get grpc_client class definition and members ==="
sed -n '1,100p' cpp/src/grpc/client/grpc_client.cu | head -80
echo ""
echo "=== Search for config_ member definition in header files ==="
find cpp/src/grpc/client -name "*.h*" -o -name "*.hpp" | xargs grep -l "grpc_client\|config_" 2>/dev/nullRepository: NVIDIA/cuopt
Length of output: 2659
🏁 Script executed:
#!/bin/bash
echo "=== Check all remaining affected line ranges ==="
for range in "387-395" "429-430" "468-470" "509-510" "540-541" "625-626" "674-675" "724-725"; do
start=$(echo $range | cut -d'-' -f1)
end=$(echo $range | cut -d'-' -f2)
echo "Lines $range:"
sed -n "${start},$((end+3))p" cpp/src/grpc/client/grpc_client.cu | grep -A 10 "grpc::ClientContext"
echo ""
doneRepository: NVIDIA/cuopt
Length of output: 862
🏁 Script executed:
#!/bin/bash
echo "=== Verify that timeout_seconds is accessible in all RPC methods ==="
echo "Checking class context and member variables..."
sed -n '1,200p' cpp/src/grpc/client/grpc_client.hpp | grep -A 20 "^class grpc_client_t"Repository: NVIDIA/cuopt
Length of output: 171
🏁 Script executed:
#!/bin/bash
echo "=== Verify config_ field and method signatures ==="
rg "grpc_client_config_t|config_\s*grpc_client_config_t|private:|protected:" cpp/src/grpc/client/grpc_client.hpp -A 5 -B 2 | head -80Repository: NVIDIA/cuopt
Length of output: 1246
Add deadlines to all operational RPC contexts to prevent indefinite blocking on network stalls.
Most RPC calls lack deadlines and can block indefinitely, bypassing configured timeout behavior. The connection probe correctly demonstrates using set_deadline(), but this pattern must be applied to all operational RPCs (CheckStatus, WaitForCompletion, CancelJob, DeleteJob, GetIncumbents, SubmitJob, etc.). Use the existing config_.timeout_seconds field (default 1 hour, configurable) to set deadlines consistently across all RPC methods.
🔧 Proposed fix (apply deadline per RPC)
grpc::ClientContext context;
+if (config_.timeout_seconds > 0) {
+ context.set_deadline(std::chrono::system_clock::now() +
+ std::chrono::seconds(config_.timeout_seconds));
+}
auto status = impl_->stub->CheckStatus(&context, request, &response);Also applies to: 293-303 (WaitForCompletion), 329-333 (CancelJob), 356-360 (DeleteJob), 387-395 (GetIncumbents), 429-430, 468-470 (SubmitJob), 509-510, 540-541, 625-626, 674-675, 724-725
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, RPC calls like
grpc_client_t::check_status (and other operational RPCs: WaitForCompletion,
CancelJob, DeleteJob, GetIncumbents, SubmitJob, etc.) currently create a
grpc::ClientContext but do not set a deadline, allowing indefinite blocking; fix
each RPC by calling context.set_deadline(std::chrono::system_clock::now() +
std::chrono::seconds(config_.timeout_seconds)) on the grpc::ClientContext before
invoking stub->... so every call uses the configured timeout
(config_.timeout_seconds) consistently, mirroring the connection probe pattern.
| auto field_id = arr_desc.field_id(); | ||
| int64_t total_elems = arr_desc.total_elements(); | ||
| int64_t elem_size = arr_desc.element_size_bytes(); | ||
| if (total_elems <= 0) continue; | ||
|
|
||
| int64_t elems_per_chunk = chunk_data_budget / elem_size; | ||
| if (elems_per_chunk <= 0) elems_per_chunk = 1; | ||
|
|
||
| std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size)); | ||
|
|
There was a problem hiding this comment.
Harden chunked-result metadata validation before math/alloc/copy.
element_size_bytes and total_elements are used unchecked. elem_size <= 0 causes division by zero (Line 708), and total_elems * elem_size can overflow before allocation, making the memcpy at Line 739 unsafe.
🔧 Proposed bounds checks
int64_t total_elems = arr_desc.total_elements();
int64_t elem_size = arr_desc.element_size_bytes();
if (total_elems <= 0) continue;
+ if (elem_size <= 0) {
+ last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+ return false;
+ }
+ if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+ last_error_ = "Invalid chunk metadata: total byte size overflow";
+ return false;
+ }
+ const int64_t total_bytes = total_elems * elem_size;
+ if (total_bytes < 0 ||
+ static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+ last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+ return false;
+ }
int64_t elems_per_chunk = chunk_data_budget / elem_size;
if (elems_per_chunk <= 0) elems_per_chunk = 1;
- std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+ std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));Also applies to: 734-740
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and
sanity-check arr_desc.element_size_bytes() and arr_desc.total_elements() before
any math, allocation, or memcpy: ensure elem_size > 0 (avoid division-by-zero
when computing elems_per_chunk), ensure total_elems >= 0 and that total_elems *
elem_size does not overflow size_t (use a safe check like total_elems <=
SIZE_MAX / elem_size or similar), and recompute elems_per_chunk using the
validated elem_size and chunk_data_budget; if any check fails, skip the array or
return an error. Specifically update the logic around element_size_bytes(),
total_elements, elems_per_chunk, the allocation of array_bytes, and the
subsequent memcpy to rely only on validated, non-overflowing sizes.
cpp/src/grpc/client/grpc_client.cu
Outdated
| int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms; | ||
|
|
There was a problem hiding this comment.
Validate poll_interval_ms before dividing.
Line 994 and Line 1132 divide by config_.poll_interval_ms without checking <= 0. This can crash (division by zero) or produce invalid polling behavior.
🔧 Proposed validation
+if (config_.poll_interval_ms <= 0) {
+ stop_log_streaming();
+ result.error_message = "poll_interval_ms must be > 0";
+ return result;
+}
int poll_count = 0;
int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;Also applies to: 1132-1132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, Validate
config_.poll_interval_ms before using it in divisions that compute max_polls:
check if config_.poll_interval_ms <= 0 and handle it (log an error and
return/throw, or clamp to a safe default like 1 ms) prior to the line that
computes int max_polls = (config_.timeout_seconds * 1000) /
config_.poll_interval_ms; do the same validation wherever you compute max_polls
(the other division using config_.poll_interval_ms), and make sure to reference
config_.timeout_seconds, config_.poll_interval_ms, and the max_polls calculation
so the check is applied in both places.
| auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { | ||
| auto it = arrays.find(field_id); | ||
| if (it == arrays.end() || it->second.empty()) return {}; | ||
| size_t n = it->second.size() / sizeof(double); | ||
| if constexpr (std::is_same_v<f_t, double>) { | ||
| std::vector<double> v(n); | ||
| std::memcpy(v.data(), it->second.data(), it->second.size()); | ||
| return v; | ||
| } else { | ||
| std::vector<double> tmp(n); | ||
| std::memcpy(tmp.data(), it->second.data(), it->second.size()); | ||
| return std::vector<f_t>(tmp.begin(), tmp.end()); | ||
| } |
There was a problem hiding this comment.
Add alignment checks before memcpy in chunked array decoding.
n is computed with integer division, then memcpy copies full raw.size(). Misaligned payload sizes can write past vector bounds.
Proposed fix
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
auto it = arrays.find(field_id);
if (it == arrays.end() || it->second.empty()) return {};
+ if (it->second.size() % sizeof(double) != 0) return {};
size_t n = it->second.size() / sizeof(double); auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
auto it = arrays.find(field_id);
if (it == arrays.end() || it->second.empty()) return {};
+ if (it->second.size() % sizeof(int32_t) != 0) return {};
size_t n = it->second.size() / sizeof(int32_t);As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."
Also applies to: 414-426
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The get_doubles
lambda copies bytes from arrays (lookup via arrays.find(field_id) / it->second)
into a vector using memcpy but computes n = it->second.size() / sizeof(double)
without validating that it->second.size() is a multiple of sizeof(double), which
can overflow the destination; before any memcpy in get_doubles (and the
analogous block handling f_t != double), check that it->second.size() is nonzero
and that it->second.size() % sizeof(double) == 0, compute n = it->second.size()
/ sizeof(double) from the validated byte size, and use that n to allocate the
target vector; if the size is misaligned, handle gracefully (return empty vector
or surface an error) instead of performing memcpy to avoid out-of-bounds writes.
There was a problem hiding this comment.
Actionable comments posted: 14
♻️ Duplicate comments (9)
cpp/src/grpc/server/grpc_server_main.cpp (2)
186-212:⚠️ Potential issue | 🔴 CriticalAdd teardown on TLS config failures after workers/threads are started.
These early returns occur after runtime startup and bypass shutdown/join/cleanup paths.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 186 - 212, The TLS validation branches that currently do "return 1" (around checks using config, ssl_opts and read_file_to_string for cert/key/root and require_client) can occur after workers/threads have been started; replace those early returns with calls to the centralized shutdown/teardown routine (e.g., call your existing server stop/cleanup functions such as StopServer(), shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs shutdown and joins) and then return the error code; ensure every failure path that currently returns from inside the TLS setup invokes that teardown routine before exiting so workers/threads and other resources are properly cleaned up.
234-251:⚠️ Potential issue | 🔴 CriticalGuard
BuildAndStart()failure before waiting on the server pointer.
server->Wait()is called unconditionally; if startup fails and returns null, this dereferences a null pointer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251, BuildAndStart() may return nullptr so avoid unguarded dereference of server; after std::unique_ptr<Server> server(builder.BuildAndStart()), check if (server) before calling server->Wait() and starting the shutdown logic that assumes a live server. If BuildAndStart() fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the shutdown_thread's capture/Shutdown call remains safe by only invoking server->Shutdown() when server is non-null; move creation of shutdown_thread and the server->Wait() call inside the if (server) block and handle the failure path explicitly.cpp/src/grpc/server/grpc_job_management.cpp (1)
147-158:⚠️ Potential issue | 🔴 CriticalSynchronize all
job_queueaccess; current slot claim/update path is racy.
submit_job_asyncstill does unsynchronized check-then-write on shared queue state, andcheck_job_status/cancel_jobconcurrently read/write the same fields. This can corrupt queue ownership and status transitions under concurrent requests.As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
Also applies to: 195-196, 267-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The submit_job_async path performs an unsynchronized check-then-write on shared job_queue entries (e.g., checking job_queue[i].ready/claimed then writing fields like job_id, ready, claimed), which races with check_job_status and cancel_job; protect all accesses to job_queue (both scans and per-entry reads/writes) by introducing a synchronization primitive (e.g., a std::mutex or per-entry mutex array) and acquire the lock(s) around the loop that scans MAX_JOBS and before modifying fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure check_job_status and cancel_job acquire the same lock(s) before reading or updating those fields so ownership and state transitions are atomic and race-free.cpp/src/grpc/client/grpc_client.cu (4)
258-266:⚠️ Potential issue | 🔴 CriticalGuard public RPC methods before dereferencing
impl_->stub.These methods can be called before
connect(). Dereferencingimpl_->stubwithout a null check can crash.🔧 Suggested guard pattern
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_ || !impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; auto request = build_status_request(job_id); cuopt::remote::StatusResponse response;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC method grpc_client_t::check_status dereferences impl_->stub without ensuring impl_ is initialized or stub is non-null (can be called before connect()), which can crash; add a null-check guard at the start of check_status (and other public RPC methods) that verifies impl_ and impl_->stub are valid, and return an appropriate job_status_result_t error/failed state (or set result.error and result.code) when the client is not connected instead of dereferencing; refer to grpc_client_t::connect and impl_->stub when adding the guard so the check mirrors connect()'s initialization contract.
994-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore using it as a divisor.
max_pollsdivides byconfig_.poll_interval_mswithout checking<= 0, which can crash.🔧 Suggested validation
+ if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;Also applies to: 1132-1132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, The code computes max_polls by dividing by config_.poll_interval_ms without validating it; update the logic that sets max_polls (the line initializing int max_polls) to first validate config_.poll_interval_ms is > 0 (and also guard against extremely large/small values), returning an error or using a safe default/policy (e.g., set poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid division-by-zero and undefined behavior; apply the same validation where max_polls is computed elsewhere (the other occurrence around the code that uses config_.poll_interval_ms).
703-712:⚠️ Potential issue | 🔴 CriticalValidate chunk metadata and chunk bounds before allocation/copy.
elem_size,total_elems, andelements_in_chunkare trusted too early. This can lead to division by zero, overflowed allocation sizes, or out-of-boundsmemcpy.🔧 Suggested hardening
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element size"; + return false; + } + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: byte-size overflow"; + return false; + } + int64_t total_bytes = total_elems * elem_size; + if (total_bytes < 0 || + static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size; if (elems_per_chunk <= 0) elems_per_chunk = 1; - std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size)); + std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes)); for (int64_t elem_offset = 0; elem_offset < total_elems; elem_offset += elems_per_chunk) { int64_t elems_wanted = std::min(elems_per_chunk, total_elems - elem_offset); @@ int64_t elems_received = chunk_resp.elements_in_chunk(); const auto& data = chunk_resp.data(); if (static_cast<int64_t>(data.size()) != elems_received * elem_size) { last_error_ = "GetResultChunk: data size mismatch"; return false; } + if (elems_received < 0 || elem_offset + elems_received > total_elems) { + last_error_ = "GetResultChunk: elements_in_chunk out of bounds"; + return false; + } std::memcpy(array_bytes.data() + elem_offset * elem_size, data.data(), data.size());Also applies to: 731-740
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and guard all chunk metadata before allocating or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and arr_desc.total_elements() (total_elems) >= 0 before using them; ensure chunk_data_budget is >0 before computing elems_per_chunk and handle division safely; check for size_t/multiplication overflow when computing total_elems * elem_size before constructing array_bytes and use a safe cap; when iterating chunks validate each elements_in_chunk is within [0, total_elems] and compute copy_size = elements_in_chunk * elem_size with overflow checks before any memcpy; add clear error handling/logging and skip/abort the chunk on invalid metadata rather than performing the allocation or memcpy (refer to symbols: arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk, elements_in_chunk, array_bytes, memcpy).
247-251:⚠️ Potential issue | 🔴 Critical
stop_log_streaming()can hang waiting on a blocked reader thread.Setting
stop_logs_does not unblock a thread already stuck in stream read, sojoin()can block indefinitely.🔧 Suggested fix
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 247 - 251, stop_log_streaming currently only sets stop_logs_ and then joins log_thread_, which can hang if the thread is blocked in a gRPC read; change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close the gRPC stream used by the reader (e.g. call the grpc::ClientContext::TryCancel() or cancel/Finish the ClientReader/ReaderInterface instance used for log streaming), (3) ensure any CompletionQueue or async reader is shutdown/drained so the blocked read returns, and only then join log_thread_ (use a timed join as a fallback and log an error if join times out); update stop_log_streaming to reference stop_logs_, log_thread_ and the stream/reader/client-context member (the object that performs the stream read) when implementing the cancel/cleanup steps.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-411:⚠️ Potential issue | 🔴 CriticalValidate byte alignment before
memcpyin chunked array decoders.
nis computed with integer division, butmemcpycopies full byte count. Misaligned payload sizes can write past allocated vectors.🔧 Suggested fix
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; size_t n = it->second.size() / sizeof(double); @@ auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; size_t n = it->second.size() / sizeof(int32_t);Also applies to: 414-426
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The lambda get_doubles uses n = it->second.size() / sizeof(double) but then memcpy's the full byte size, which can overflow if the payload isn't aligned; validate that it->second.size() is an exact multiple of sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and only memcpy expected_bytes (or return/throw on misaligned size), and when converting to f_t ensure you only read expected_bytes into the temporary vector; apply the same check/fix to the other chunked-array decoder blocks that perform memcpy (the other double-to-f_t conversion in this file).cpp/src/grpc/server/grpc_service_impl.cpp (1)
71-73:⚠️ Potential issue | 🔴 CriticalPublish
readybefore releasingclaimedto avoid queue slot races.Current store order can allow another submitter to reserve the same slot before readiness is published.
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."🔧 Suggested ordering fix
- job_queue[job_idx].claimed.store(false); - job_queue[job_idx].ready.store(true); + job_queue[job_idx].ready.store(true, std::memory_order_release); + job_queue[job_idx].claimed.store(false, std::memory_order_release);Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code currently does job_queue[job_idx].claimed.store(false) before job_queue[job_idx].ready.store(true), which can let another submitter claim the slot before readiness is published; swap the two stores so ready.store(true) is executed before claimed.store(false) for the job_queue entry, and apply the same fix to the other identical location where claimed and ready are updated (the later occurrence around the second claim/release sequence) to prevent the race.
🧹 Nitpick comments (5)
cpp/src/grpc/client/test_grpc_client.cpp (1)
401-403: Consider graceful handling for invalid numeric CLI input instead of process abort.
std::stod/std::stollat lines 402 and 531 throw on malformed input (e.g.,--time-limit abcor invalid incumbent index). While this is test code, a try-catch wrapper would provide a friendlier CLI experience instead of silently crashing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Replace direct calls to std::stod and std::stoll in the CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string) used to set time_limit and the std::stoll used to set incumbent_index in try-catch blocks that catch std::invalid_argument and std::out_of_range, print a clear error message indicating the offending flag and value (e.g., "--time-limit" or "--incumbent-index"), and exit with a non-zero status instead of letting the process throw; update the parsing branches that reference time_limit and incumbent_index so they validate the parsed numeric ranges if applicable and fall back or exit cleanly on parse failure.cpp/src/grpc/cuopt_remote_service.proto (1)
9-9: Standardizejob_idfield type across RPC schema.Most request messages use
string job_id(StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.), butIncumbentRequest.job_idandIncumbent.job_idare defined asbytes. Similarly,cuopt_remote.protousesbytes job_idinSubmitResponse. This type inconsistency can cause friction in client libraries expecting uniform scalar types across the API contract. Either standardize onstringfor consistency with the majority of the schema, or usebytesuniformly if avoiding UTF-8 validation is a requirement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id are defined as bytes while most RPC messages (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string; change those fields to type string to standardize the schema, update any proto message definitions where job_id is bytes to string (IncumbentRequest, Incumbent, SubmitResponse), run proto generation to refresh client/server stubs, and adjust any serialization/handler code that expects bytes to handle string instead.cpp/src/grpc/grpc_solution_mapper.cu (2)
93-101: Consider bulk operations for repeated fields.The element-by-element
add_*()calls work correctly but may be slower for large solutions. Protobuf'smutable_primal_solution()->Reserve(primal.size())followed byAdd()or usingmutable_primal_solution()->Assign(...)could improve performance for large-scale problems.💡 Example using Reserve for primal_solution
// Solution vectors - CPU solution already has data in host memory const auto& primal = solution.get_primal_solution_host(); const auto& dual = solution.get_dual_solution_host(); const auto& reduced_cost = solution.get_reduced_cost_host(); + pb_solution->mutable_primal_solution()->Reserve(primal.size()); for (const auto& v : primal) { pb_solution->add_primal_solution(static_cast<double>(v)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 93 - 101, The per-element protobuf adds (pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should be replaced with bulk operations to improve performance for large vectors: call pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal data (cast to double as needed), and do the same for dual and reduced_cost using mutable_dual_solution()->Reserve(dual.size())/Assign and mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so the bulk copy converts elements to double before Assign/adding.
559-598: Warm-start presence detection relies on empty array check.The warm-start data presence is inferred from
ws_primal.empty(). This works correctly as long as any valid warm-start data always includes the primal solution. If there's ever a case where warm-start exists but current_primal is empty, this would miss it. Consider checking the header for a dedicated warm-start flag if available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 598, The code currently detects warm-start presence by testing ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty()) guard); change that condition to prefer a dedicated header flag (e.g., use h.has_warm_start() or h.ws_present() if such a field exists) and fall back to the primal-array emptiness check if the flag is unavailable, e.g., replace if (!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so warm-starts with an empty current_primal are still recognized.cpp/src/grpc/client/grpc_client.hpp (1)
236-240: Non-movable semantics are intentional but could be revisited.The class is non-copyable and non-movable due to
std::atomic<bool>andstd::threadmembers. Whilestd::atomicis actually movable in C++17, thestd::threadmember and PIMPL pattern make this reasonable. If movability becomes needed, consider wrapping the thread in a nullable type or resettingstop_logs_after move.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.hpp` around lines 236 - 240, The class grpc_client_t is explicitly non-copyable and non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread member; if you need movability, add a noexcept move constructor and move assignment that transfer ownership of the PIMPL, move the std::thread (std::move on the thread member) and ensure the source thread is left in a benign state (e.g., set to not joinable or detached), and transfer/reset the atomic flag (use stop_logs_.store(false) or std::exchange to set a sensible value in the moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and handle stop_logs_ correctly; ensure proper synchronization when transferring thread ownership to avoid races.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env overrides for CUOPT_CHUNK_SIZE and
CUOPT_MAX_MESSAGE_BYTES must be validated before assignment: call
parse_env_int64 for each (as currently done), then only assign to
config.chunk_size_bytes and config.max_message_bytes when the parsed value is a
sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
- Around line 164-166: The incumbent callback incorrectly uses the incumbent
objective as the "bound" (bound_copy = objective) when calling
get_callback->get_solution; change the flow so the actual best bound from the
remote solver (use get_solution_bound()) is forwarded through the incumbent
callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.
In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 315: The proto uses two different types for job_id (string in
SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest vs bytes in
IncumbentRequest and Incumbent), so unify them: choose one canonical type
(prefer string for UTF-8 identifiers or bytes if binary IDs are required) and
update the job_id field in the outlier messages (IncumbentRequest and Incumbent
if switching to string, or update the eight string-typed messages if switching
to bytes) to match; after changing the proto make sure to regenerate language
bindings and add any necessary UTF-8 validation or encoding/decoding notes in
the RPC handling code.
In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto doesn't match Buf's package-directory rules; either change
the package line to match the file's directory-based package (so the package
mirrors the proto file location) or move cuopt_remote.proto into a directory
structure that matches "cuopt/remote"; update the package declaration (package
cuopt.remote) accordingly so the package and directory structure are consistent
and the PACKAGE_DIRECTORY_MATCH lint passes.
In `@cpp/src/grpc/grpc_settings_mapper.cu`:
- Around line 144-146: The protobuf iteration_limit can exceed the range of the
target type i_t and overflow when statically cast; before assigning
pb_settings.iteration_limit() to settings.iteration_limit (the
static_cast<i_t>), check that the value lies within numeric_limits<i_t>::min()
and numeric_limits<i_t>::max(); if it is out of range, either clamp to the
appropriate min/max or return/log an error and refuse the assignment, ensuring
you reference pb_settings.iteration_limit() and settings.iteration_limit when
implementing the guard to prevent narrowing overflow.
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 109-113: The code currently reads a uint64_t size from the pipe
and blindly calls data.resize(size) which permits unbounded allocation; before
resizing, validate and cap the incoming size (e.g. define a
MAX_INCUMBENT_PAYLOAD or MAX_INCUMBENT_SIZE constant) and return false if size
is zero or exceeds that limit, and ensure you convert/cast size to size_t
safely; update the recv_incumbent_pipe logic around the read_from_pipe(fd,
&size, ...) and data.resize(...) calls (and any callers using fd/data) to
enforce this limit and prevent resource exhaustion.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Line 90: The loop currently treats any ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) == false as a benign break and later returns true, which
accepts partial/corrupt payloads; change the handling so that when
ParseDelimitedFromCodedStream returns false you check clean_eof and only break
on a true clean_eof (clean EOF), otherwise return false to indicate a parse
error. Update both occurrences (the line with ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) and the other similar occurrence) so parse failures produce
an immediate false return instead of falling through and returning true.
- Around line 151-154: The resize uses ac.total_elements() * elem_size without
validation which can overflow or trigger huge allocations; before calling
dest.resize validate that ac.total_elements() and elem_size are non-negative and
compute the product using a larger unsigned type (e.g., uint64_t) then check
product <= std::numeric_limits<size_t>::max() and <= a configured maximum
allocation threshold, and only then cast to size_t and call dest.resize; if the
checks fail, return/throw an error instead of resizing. Also apply the same
validation logic for the other occurrence around lines 158-163 (same variables:
ac, elem_size, array_field_element_size(ac.field_id()), dest).
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 50-80: The CLI parsing loop in main (grpc_server_main.cpp)
currently silently ignores unknown flags and missing values; update the for-loop
that inspects argv/argc to validate every flag: for each option that requires a
value (e.g., "-p"/"--port", "-w"/"--workers", "--max-message-mb",
"--max-message-bytes", "--tls-cert", "--tls-key", "--tls-root") check i+1 < argc
and if not, print an error to std::cerr (include the flag name), call
print_usage(argv[0]) and return a non-zero exit code; for unrecognized args (the
final else case) do the same (error + usage + non-zero exit); also wrap
stoi/stoll conversions in try/catch to handle invalid numbers and fail similarly
with an explanatory error. Ensure you reference and update the same config
fields (config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) and keep behavior for boolean flags unchanged.
- Around line 157-167: The code currently calls memset on JobQueueEntry and
ResultQueueEntry which contain std::atomic members (job_queue and result_queue
initialization loop); replace the raw memset with explicit field initialization
or construction: for each JobQueueEntry/ResultQueueEntry set every member
explicitly (e.g., initialize non-atomic fields directly and initialize atomics
via .store(...) or use placement-new / a helper function that constructs the
object in shared memory) to avoid undefined behavior; update the loops that call
memset(&job_queue[i], 0, ...) and memset(&result_queue[i], 0, ...) to perform
per-field initialization instead, ensuring worker_index, ready, claimed,
cancelled, retrieved, etc. are correctly constructed and stored.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe
functions (std::cout and result_cv.notify_all()), which must be removed; change
signal_handler to only perform async-signal-safe writes: set keep_running to
false using a sig_atomic_t/atomic<bool> and, if needed, set
shm_ctrl->shutdown_requested using an atomic or sig_atomic_t field, and return
immediately (no I/O or condition_variable calls). Add a separate watcher
thread/function (e.g., monitor_shutdown or shutdown_watcher) that polls/blocks
on keep_running becoming false and then performs the logging (std::cout) and
calls result_cv.notify_all() and any other cleanup; update signal_handler,
keep_running's type, and shm_ctrl to use signal-safe atomics so the watcher can
safely observe the change.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.
In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The documentation sentence claiming “no custom serialization
logic is implemented” is too absolute; update the statement to acknowledge the
custom pipe blob framing helpers introduced in grpc_pipe_serialization.hpp (the
pipe blob framing helpers) and clarify scope by limiting the claim to gRPC wire
payloads (e.g., “All gRPC wire payload serialization uses protocol buffers
generated by protoc and grpc_cpp_plugin; separate pipe blob framing helpers in
grpc_pipe_serialization.hpp are used only for local pipe framing, not for gRPC
wire serialization.”).
In `@SERVER_ARCHITECTURE.md`:
- Around line 15-46: The markdown has untyped fenced code blocks (the ASCII
diagram block and other fenced regions) which trigger MD040; update each
triple-backtick fence (including the large ASCII diagram block shown and the
other blocks noted) to include an explicit language token such as text (e.g.,
change ``` to ```text) or another appropriate language (bash/cpp) so
markdownlint MD040 is satisfied across the file.
---
Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-266: The public RPC method grpc_client_t::check_status
dereferences impl_->stub without ensuring impl_ is initialized or stub is
non-null (can be called before connect()), which can crash; add a null-check
guard at the start of check_status (and other public RPC methods) that verifies
impl_ and impl_->stub are valid, and return an appropriate job_status_result_t
error/failed state (or set result.error and result.code) when the client is not
connected instead of dereferencing; refer to grpc_client_t::connect and
impl_->stub when adding the guard so the check mirrors connect()'s
initialization contract.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without validating it; update the logic that sets
max_polls (the line initializing int max_polls) to first validate
config_.poll_interval_ms is > 0 (and also guard against extremely large/small
values), returning an error or using a safe default/policy (e.g., set
poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).
- Around line 703-712: Validate and guard all chunk metadata before allocating
or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and
arr_desc.total_elements() (total_elems) >= 0 before using them; ensure
chunk_data_budget is >0 before computing elems_per_chunk and handle division
safely; check for size_t/multiplication overflow when computing total_elems *
elem_size before constructing array_bytes and use a safe cap; when iterating
chunks validate each elements_in_chunk is within [0, total_elems] and compute
copy_size = elements_in_chunk * elem_size with overflow checks before any
memcpy; add clear error handling/logging and skip/abort the chunk on invalid
metadata rather than performing the allocation or memcpy (refer to symbols:
arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk,
elements_in_chunk, array_bytes, memcpy).
- Around line 247-251: stop_log_streaming currently only sets stop_logs_ and
then joins log_thread_, which can hang if the thread is blocked in a gRPC read;
change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close
the gRPC stream used by the reader (e.g. call the
grpc::ClientContext::TryCancel() or cancel/Finish the
ClientReader/ReaderInterface instance used for log streaming), (3) ensure any
CompletionQueue or async reader is shutdown/drained so the blocked read returns,
and only then join log_thread_ (use a timed join as a fallback and log an error
if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.
In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 399-411: The lambda get_doubles uses n = it->second.size() /
sizeof(double) but then memcpy's the full byte size, which can overflow if the
payload isn't aligned; validate that it->second.size() is an exact multiple of
sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and
only memcpy expected_bytes (or return/throw on misaligned size), and when
converting to f_t ensure you only read expected_bytes into the temporary vector;
apply the same check/fix to the other chunked-array decoder blocks that perform
memcpy (the other double-to-f_t conversion in this file).
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 147-158: The submit_job_async path performs an unsynchronized
check-then-write on shared job_queue entries (e.g., checking
job_queue[i].ready/claimed then writing fields like job_id, ready, claimed),
which races with check_job_status and cancel_job; protect all accesses to
job_queue (both scans and per-entry reads/writes) by introducing a
synchronization primitive (e.g., a std::mutex or per-entry mutex array) and
acquire the lock(s) around the loop that scans MAX_JOBS and before modifying
fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 186-212: The TLS validation branches that currently do "return 1"
(around checks using config, ssl_opts and read_file_to_string for cert/key/root
and require_client) can occur after workers/threads have been started; replace
those early returns with calls to the centralized shutdown/teardown routine
(e.g., call your existing server stop/cleanup functions such as StopServer(),
shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs
shutdown and joins) and then return the error code; ensure every failure path
that currently returns from inside the TLS setup invokes that teardown routine
before exiting so workers/threads and other resources are properly cleaned up.
- Around line 234-251: BuildAndStart() may return nullptr so avoid unguarded
dereference of server; after std::unique_ptr<Server>
server(builder.BuildAndStart()), check if (server) before calling server->Wait()
and starting the shutdown logic that assumes a live server. If BuildAndStart()
fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the
shutdown_thread's capture/Shutdown call remains safe by only invoking
server->Shutdown() when server is non-null; move creation of shutdown_thread and
the server->Wait() call inside the if (server) block and handle the failure path
explicitly.
In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code currently does
job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.
---
Nitpick comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 236-240: The class grpc_client_t is explicitly non-copyable and
non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread
member; if you need movability, add a noexcept move constructor and move
assignment that transfer ownership of the PIMPL, move the std::thread (std::move
on the thread member) and ensure the source thread is left in a benign state
(e.g., set to not joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: Replace direct calls to std::stod and std::stoll in the
CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string)
used to set time_limit and the std::stoll used to set incumbent_index in
try-catch blocks that catch std::invalid_argument and std::out_of_range, print a
clear error message indicating the offending flag and value (e.g.,
"--time-limit" or "--incumbent-index"), and exit with a non-zero status instead
of letting the process throw; update the parsing branches that reference
time_limit and incumbent_index so they validate the parsed numeric ranges if
applicable and fall back or exit cleanly on parse failure.
In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 9: IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id
are defined as bytes while most RPC messages (StatusRequest, GetResultRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string;
change those fields to type string to standardize the schema, update any proto
message definitions where job_id is bytes to string (IncumbentRequest,
Incumbent, SubmitResponse), run proto generation to refresh client/server stubs,
and adjust any serialization/handler code that expects bytes to handle string
instead.
In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 93-101: The per-element protobuf adds
(pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should
be replaced with bulk operations to improve performance for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
- Around line 559-598: The code currently detects warm-start presence by testing
ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see
bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty())
guard); change that condition to prefer a dedicated header flag (e.g., use
h.has_warm_start() or h.ws_present() if such a field exists) and fall back to
the primal-array emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2a43112f-b25e-4054-b00f-1fee3a598df9
📒 Files selected for processing (38)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (8)
- cpp/tests/linear_programming/CMakeLists.txt
- cpp/tests/linear_programming/grpc/CMakeLists.txt
- cpp/src/grpc/server/grpc_incumbent_proto.hpp
- cpp/src/grpc/server/grpc_worker_infra.cpp
- cpp/src/grpc/server/grpc_server_threads.cpp
- cpp/src/grpc/server/grpc_field_element_size.hpp
- cpp/src/grpc/server/grpc_worker.cpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
| config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); | ||
| config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); | ||
| config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0); |
There was a problem hiding this comment.
Validate env override ranges before applying gRPC sizing config.
CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are accepted as-is. Zero/negative values can break chunking/message-size logic at runtime.
Proposed guardrails
static void apply_env_overrides(grpc_client_config_t& config)
{
- config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
- config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+ auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+ auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+ if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+ if (max_msg > 0) { config.max_message_bytes = max_msg; }
config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); | |
| config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); | |
| config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0); | |
| auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); | |
| auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); | |
| if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; } | |
| if (max_msg > 0) { config.max_message_bytes = max_msg; } | |
| config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides
for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES must be validated before
assignment: call parse_env_int64 for each (as currently done), then only assign
to config.chunk_size_bytes and config.max_message_bytes when the parsed value is
a sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
| double bound_copy = objective; // Use objective as bound for incumbent | ||
| get_callback->get_solution( | ||
| solution_copy.data(), &obj_copy, &bound_copy, callback->get_user_data()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Remote callback context (solve_remote.cu lines 160-170) =="
sed -n '160,170p' cpp/src/grpc/client/solve_remote.cu | cat -n
echo "== All get_solution calls in solve_remote.cu =="
rg -n 'get_solution' cpp/src/grpc/client/solve_remote.cu -B 5 -A 5
echo "== Other get_solution callback implementations =="
rg -n 'get_solution\(' cpp/src -g '*.{cu,cpp,hpp,h}' -B 3 -A 3 | head -100Repository: NVIDIA/cuopt
Length of output: 9221
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for best_bound or incumbent_bound in solve_remote.cu =="
rg -n 'best_bound|incumbent_bound|lower_bound|upper_bound|dual_bound' cpp/src/grpc/client/solve_remote.cu -C 3
echo "== Check callback interface definition =="
rg -n 'get_solution|GetSolution' cpp/src/callbacks -g '*.{h,hpp}' -B 5 -A 10 | head -150Repository: NVIDIA/cuopt
Length of output: 121
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for callback setup/documentation in solve_remote.cu =="
cat -n cpp/src/grpc/client/solve_remote.cu | head -200Repository: NVIDIA/cuopt
Length of output: 9190
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check incumbent_callback definition/signature =="
rg -n 'incumbent_callback' cpp/src/grpc -g '*.{hpp,h,cpp}' -B 2 -A 2
echo "== Check if best_bound is available in remote solve =="
rg -n 'best_bound|bestBound|best_lower_bound|lower_bound' cpp/src/grpc -g '*.{hpp,h,cpp}' -C 2Repository: NVIDIA/cuopt
Length of output: 2037
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check config/client callback interface =="
fd 'grpc_client' cpp/src/grpc --type f -x head -100 {} \;
echo "== Search for MipSolveResult or similar that returns solver progress =="
rg -n 'MipSolveResult|incumbent|bound' cpp/src/grpc/server -g '*.{hpp,h,cpp}' | head -50Repository: NVIDIA/cuopt
Length of output: 17291
Remote solver should pass the best bound to incumbent callbacks, not the incumbent objective.
The incumbent callback from the remote server provides only the objective value and solution, not the best lower bound from the MIP solver. Setting bound_copy = objective at line 164 creates an artificial zero gap in user callbacks, misreporting MIP progress.
The underlying issue is that the incumbent_callback protocol (grpc_client.hpp:66) does not include the solver's best bound, while the local MIP solver (solve.cu:117) passes a separate bound vector to callbacks. The server has access to get_solution_bound() but does not forward it through the incumbent callback protocol.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The incumbent
callback incorrectly uses the incumbent objective as the "bound" (bound_copy =
objective) when calling get_callback->get_solution; change the flow so the
actual best bound from the remote solver (use get_solution_bound()) is forwarded
through the incumbent callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.
| // ============================================================================= | ||
|
|
||
| message IncumbentRequest { | ||
| bytes job_id = 1; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
cat -n cpp/src/grpc/cuopt_remote_service.proto | sed -n '300,335p'Repository: NVIDIA/cuopt
Length of output: 1562
🏁 Script executed:
#!/bin/bash
echo "== All job_id field declarations in proto files =="
rg -nP '\bjob_id\b\s*=\s*\d+\s*;' cpp/src/grpc/*.proto
echo ""
echo "== Check context around job_id declarations =="
rg -B2 -A2 '\bjob_id\b\s*=\s*\d+\s*;' cpp/src/grpc/*.protoRepository: NVIDIA/cuopt
Length of output: 4428
🏁 Script executed:
#!/bin/bash
echo "== GetIncumbents RPC definition =="
rg -B5 -A5 'GetIncumbents' cpp/src/grpc/cuopt_remote_service.proto | head -30
echo ""
echo "== Usage of job_id in test files =="
rg -n 'job_id' cpp/tests/linear_programming/grpc/ | head -20
echo ""
echo "== Check if bytes job_id is used elsewhere intentionally =="
rg -B3 -A3 'bytes job_id' cpp/src/grpc/Repository: NVIDIA/cuopt
Length of output: 4404
Unify job_id type across RPC messages.
job_id is string in 8 request/response messages (SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest) but bytes in IncumbentRequest and Incumbent. This inconsistency complicates cross-language clients and can introduce avoidable encoding bugs. If UTF-8 validation concerns require bytes encoding, the same should apply consistently across all job_id fields.
Suggested schema alignment
message IncumbentRequest {
- bytes job_id = 1;
+ string job_id = 1;
int64 from_index = 2; // Return incumbents starting from this index
int32 max_count = 3; // Optional limit (0 or negative => no limit)
}
message Incumbent {
int64 index = 1;
double objective = 2;
repeated double assignment = 3;
- bytes job_id = 4;
+ string job_id = 4;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/cuopt_remote_service.proto` at line 315, The proto uses two
different types for job_id (string in SubmitJobResponse, StatusRequest,
GetResultRequest, StartChunkedDownloadRequest, DeleteRequest, CancelRequest,
WaitRequest, StreamLogsRequest vs bytes in IncumbentRequest and Incumbent), so
unify them: choose one canonical type (prefer string for UTF-8 identifiers or
bytes if binary IDs are required) and update the job_id field in the outlier
messages (IncumbentRequest and Incumbent if switching to string, or update the
eight string-typed messages if switching to bytes) to match; after changing the
proto make sure to regenerate language bindings and add any necessary UTF-8
validation or encoding/decoding notes in the RPC handling code.
|
|
||
| syntax = "proto3"; | ||
|
|
||
| package cuopt.remote; |
There was a problem hiding this comment.
Align proto package path with Buf package-directory rules.
package cuopt.remote from cpp/src/grpc/cuopt_remote.proto triggers PACKAGE_DIRECTORY_MATCH. This will fail Buf lint unless the file path (or package naming) is adjusted consistently.
🧰 Tools
🪛 Buf (1.66.0)
[error] 6-6: Files with package "cuopt.remote" must be within a directory "cuopt/remote" relative to root but were in directory "cpp/src/grpc".
(PACKAGE_DIRECTORY_MATCH)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration
"package cuopt.remote" in cuopt_remote.proto doesn't match Buf's
package-directory rules; either change the package line to match the file's
directory-based package (so the package mirrors the proto file location) or move
cuopt_remote.proto into a directory structure that matches "cuopt/remote";
update the package declaration (package cuopt.remote) accordingly so the package
and directory structure are consistent and the PACKAGE_DIRECTORY_MATCH lint
passes.
| if (pb_settings.iteration_limit() >= 0) { | ||
| settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); | ||
| } |
There was a problem hiding this comment.
Range-check iteration_limit before narrowing to i_t.
A large protobuf iteration_limit can overflow the static_cast<i_t> and produce an invalid internal limit.
🔧 Suggested guard
if (pb_settings.iteration_limit() >= 0) {
- settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
+ const auto limit = pb_settings.iteration_limit();
+ if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+ settings.iteration_limit = static_cast<i_t>(limit);
+ } else {
+ settings.iteration_limit = std::numeric_limits<i_t>::max();
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (pb_settings.iteration_limit() >= 0) { | |
| settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); | |
| } | |
| if (pb_settings.iteration_limit() >= 0) { | |
| const auto limit = pb_settings.iteration_limit(); | |
| if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) { | |
| settings.iteration_limit = static_cast<i_t>(limit); | |
| } else { | |
| settings.iteration_limit = std::numeric_limits<i_t>::max(); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The protobuf
iteration_limit can exceed the range of the target type i_t and overflow when
statically cast; before assigning pb_settings.iteration_limit() to
settings.iteration_limit (the static_cast<i_t>), check that the value lies
within numeric_limits<i_t>::min() and numeric_limits<i_t>::max(); if it is out
of range, either clamp to the appropriate min/max or return/log an error and
refuse the assignment, ensuring you reference pb_settings.iteration_limit() and
settings.iteration_limit when implementing the guard to prevent narrowing
overflow.
| void set_server_log_path(const std::string& path) | ||
| { | ||
| server_log_path_ = path; | ||
| server_log_start_pos_ = 0; | ||
| test_start_marked_ = false; | ||
| } |
There was a problem hiding this comment.
Synchronize server-log state reads/writes consistently.
server_log_path_, server_log_start_pos_, and test_start_marked_ are written and read without a common lock in several methods. This can race in concurrent test callbacks/pollers.
As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."
Also applies to: 229-244, 332-343
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.
| ``` | ||
| ┌────────────────────────────────────────────────────────────────────┐ | ||
| │ Main Server Process │ | ||
| │ │ | ||
| │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────────┐ │ | ||
| │ │ gRPC │ │ Job │ │ Background Threads │ │ | ||
| │ │ Service │ │ Tracker │ │ - Result retrieval │ │ | ||
| │ │ Handler │ │ (job status,│ │ - Incumbent retrieval │ │ | ||
| │ │ │ │ results) │ │ - Worker monitor │ │ | ||
| │ └─────────────┘ └──────────────┘ └─────────────────────────────┘ │ | ||
| │ │ ▲ │ | ||
| │ │ shared memory │ pipes │ | ||
| │ ▼ │ │ | ||
| │ ┌─────────────────────────────────────────────────────────────────┐│ | ||
| │ │ Shared Memory Queues ││ | ||
| │ │ ┌─────────────────┐ ┌─────────────────────┐ ││ | ||
| │ │ │ Job Queue │ │ Result Queue │ ││ | ||
| │ │ │ (MAX_JOBS=100) │ │ (MAX_RESULTS=100) │ ││ | ||
| │ │ └─────────────────┘ └─────────────────────┘ ││ | ||
| │ └─────────────────────────────────────────────────────────────────┘│ | ||
| └────────────────────────────────────────────────────────────────────┘ | ||
| │ ▲ | ||
| │ fork() │ | ||
| ▼ │ | ||
| ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ | ||
| │ Worker 0 │ │ Worker 1 │ │ Worker N │ | ||
| │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ | ||
| │ │ GPU Solve │ │ │ │ GPU Solve │ │ │ │ GPU Solve │ │ | ||
| │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ | ||
| │ (separate proc)│ │ (separate proc)│ │ (separate proc)│ | ||
| └─────────────────┘ └─────────────────┘ └─────────────────┘ | ||
| ``` |
There was a problem hiding this comment.
Add fence languages to code blocks to satisfy markdownlint.
Several fenced blocks are untyped; add explicit languages (e.g., text, cpp, bash) to resolve MD040.
Also applies to: 105-120, 136-151
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has untyped fenced
code blocks (the ASCII diagram block and other fenced regions) which trigger
MD040; update each triple-backtick fence (including the large ASCII diagram
block shown and the other blocks noted) to include an explicit language token
such as text (e.g., change ``` to ```text) or another appropriate language
(bash/cpp) so markdownlint MD040 is satisfied across the file.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (24)
cpp/src/grpc/cuopt_remote_service.proto (1)
314-325:⚠️ Potential issue | 🟠 MajorUse one
job_idscalar type across service messages.Line 315 and Line 324 use
bytes, while most service request/response messages usestringforjob_id. This inconsistency creates avoidable cross-language conversion edge cases.Schema alignment sketch
message IncumbentRequest { - bytes job_id = 1; + string job_id = 1; int64 from_index = 2; int32 max_count = 3; } message Incumbent { int64 index = 1; double objective = 2; repeated double assignment = 3; - bytes job_id = 4; + string job_id = 4; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 325, The IncumbentRequest.job_id and Incumbent.job_id fields use bytes while the rest of the API uses string for job_id; change both fields in messages IncumbentRequest and Incumbent from bytes to string to ensure a single scalar type for job_id across the service, and update any serialization/usage code that expects bytes to accept a string (or perform explicit encoding/decoding at the callsite) so cross-language conversions remain consistent.GRPC_ARCHITECTURE.md (1)
9-10:⚠️ Potential issue | 🟡 MinorScope the serialization claim to gRPC wire payloads.
Line 9-Line 10 is too absolute; local pipe framing helpers exist. Please clarify this is specifically about gRPC wire serialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_ARCHITECTURE.md` around lines 9 - 10, The statement "All serialization uses protocol buffers generated by `protoc` and `grpc_cpp_plugin` — no custom serialization logic is implemented." is too absolute; update it to explicitly scope serialization to gRPC wire payloads (e.g., "All gRPC wire payload serialization uses protocol buffers generated by `protoc` and `grpc_cpp_plugin`") and add a brief note acknowledging that local pipe framing helpers (custom framing/unframing code) are used outside the gRPC wire format; keep the original wording's intent but clarify that custom framing helpers exist for local pipes.SERVER_ARCHITECTURE.md (1)
15-46:⚠️ Potential issue | 🟡 MinorAdd explicit fence languages to these code blocks.
markdownlintMD040 is still triggered by untyped fences in these sections. Use```textfor ASCII diagrams and table-like flows.Also applies to: 110-118, 122-135, 139-148, 154-168, 196-198, 207-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The ASCII-art code fences (e.g., the block starting with the "Main Server Process" diagram that includes "Shared Memory Queues", "Job Queue", "Result Queue", and worker boxes) are untyped and trigger markdownlint MD040; update each such fence to use an explicit text language (replace ``` with ```text) for that diagram and the other similar untyped blocks (the blocks containing table/flow-style ASCII diagrams and blocks starting with the same diagram-like contents referenced in the comment). Ensure every ASCII/table-like fenced block (including the ones around the ranges called out in the review) is changed to ```text so markdownlint stops flagging MD040.cpp/src/grpc/server/grpc_worker_infra.cpp (1)
165-168:⚠️ Potential issue | 🔴 CriticalValidate PID and handle
EINTRinwait_for_workers().The reap loop at Line 165-Line 168 should skip non-positive PIDs and retry interrupted waits; otherwise
waitpid(0, ...)/signal interruption can lead to incorrect child reaping behavior.Suggested fix
void wait_for_workers() { for (pid_t pid : worker_pids) { + if (pid <= 0) continue; int status; - waitpid(pid, &status, 0); + while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {} } worker_pids.clear(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 165 - 168, In wait_for_workers(), the reap loop currently calls waitpid(pid, &status, 0) for every pid in worker_pids without validating pid or handling EINTR; update the loop to skip non-positive PIDs (pid <= 0) and wrap waitpid(pid, &status, 0) in a retry loop that re-invokes waitpid when it returns -1 with errno == EINTR, and only break on a successful return or an error other than EINTR; reference the worker_pids iteration, the pid variable, the waitpid call and the status/errno checks when making the change.cpp/src/grpc/cuopt_remote.proto (1)
6-6:⚠️ Potential issue | 🟠 MajorAlign proto package and directory to satisfy Buf lint.
package cuopt.remoteat Line 6 does not match the current file directory (cpp/src/grpc), which triggersPACKAGE_DIRECTORY_MATCH.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in cuopt_remote.proto ("package cuopt.remote;") does not match the file's directory and triggers PACKAGE_DIRECTORY_MATCH; update the package line in cuopt_remote.proto to reflect the directory structure (e.g., "package cpp.src.grpc;" or another package path that mirrors the directory segments) so the package-to-directory mapping aligns, then re-run buf lint to verify the PACKAGE_DIRECTORY_MATCH error is resolved.cpp/src/grpc/server/grpc_server_main.cpp (5)
157-167:⚠️ Potential issue | 🔴 CriticalReplace raw
memseton queue entries that include atomics.
JobQueueEntry/ResultQueueEntrycontainstd::atomicmembers; byte-wise zeroing those objects is undefined behavior.As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 157 - 167, The code currently zeroes JobQueueEntry and ResultQueueEntry with memset (affecting job_queue and result_queue), which is undefined because those structs contain std::atomic members; replace the memset calls by value-initializing or explicitly initializing each member (e.g., construct entries using JobQueueEntry() or assign default-constructed instances) and then set the atomic flags via their store(...) calls (retain the existing store(false)/store(-1) lines for ready/claimed/cancelled/worker_index and retrieved), ensuring no byte-wise writes touch atomics; update the initialization loop to use assignment or placement-new of JobQueueEntry/ResultQueueEntry rather than memset to avoid UB.
50-80:⚠️ Potential issue | 🟠 MajorReject unknown or incomplete CLI flags explicitly.
The parser currently falls through for unrecognized flags and for value-requiring flags missing a value, so the server can continue with unintended defaults.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI parser in the main argument loop accepts unknown flags and silently ignores missing values for flags that require them; update the argument parsing logic (the for-loop handling argv in grpc_server_main.cpp and the handling of flags like "--port"/"--workers"/"--max-message-mb"/"--max-message-bytes"/"--tls-cert"/"--tls-key"/"--tls-root") so that when a flag that requires a value is encountered and i+1 >= argc you call print_usage(argv[0]) (or log an explicit error) and return a non-zero exit code, and when an unrecognized argument starting with '-' is seen you also call print_usage(argv[0]) (or emit an error) and return non-zero—this ensures unknown or incomplete CLI flags are rejected instead of silently falling back to defaults.
89-90:⚠️ Potential issue | 🔴 CriticalDo not wire POSIX signals to the current async-unsafe handler.
signal_handler(incpp/src/grpc/server/grpc_server_types.hpp) uses operations like condition-variable notification from signal context, which is not async-signal-safe.As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 89 - 90, The current wiring of POSIX signals via signal(SIGINT, signal_handler) / signal(SIGTERM, signal_handler) is unsafe because signal_handler performs async-unsafe operations (e.g., condition-variable notification); instead, change to a safe signal handling pattern: install a minimal async-signal-safe handler that only sets a std::sig_atomic_t or std::atomic_bool shutdown flag (or writes a byte to a dedicated self-pipe) and return, and move any condition-variable notifications or complex shutdown logic out of signal_handler into the main thread or a dedicated watcher that polls the flag or reads the pipe; update references in grpc_server_main.cpp that call signal(...) and the implementation of signal_handler in grpc_server_types.hpp to follow this pattern so all non-async-safe actions occur off the signal context.
234-251:⚠️ Potential issue | 🔴 CriticalGuard
BuildAndStart()failure before callingserver->Wait().
BuildAndStart()can return null, but Line 250 unconditionally dereferencesserver.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251, BuildAndStart() may return nullptr so avoid unguarded dereference of server; after calling std::unique_ptr<Server> server(builder.BuildAndStart()), check if (!server) and handle the failure (log an error via std::cerr or similar and return/exit with non-zero) before starting the shutdown_thread or calling server->Wait(); also ensure the shutdown_thread lambda still checks server before calling server->Shutdown() and join the thread if started (use a flag to indicate thread started or create the thread only when server is valid).
185-212:⚠️ Potential issue | 🔴 CriticalTLS error branches leak workers/threads/shared memory after startup.
These early returns execute after
spawn_workers()and thread creation, so teardown is skipped on TLS configuration/read failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 185 - 212, The TLS error branches return early after spawn_workers()/thread/shared-memory setup and thus leak resources; either move the TLS validation block to run before spawn_workers(), or ensure every early-return path calls the teardown routines (e.g., shutdown_workers(), join_worker_threads(), release_shared_memory() or whatever the existing cleanup functions are) before returning. Locate the TLS block (uses config, ssl_opts, read_file_to_string) and change it so that validation happens prior to calling spawn_workers(), or add calls to the project's worker/shutdown/cleanup functions immediately before each return in this block to properly stop threads and free shared memory.cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)
90-90:⚠️ Potential issue | 🔴 CriticalReturn failure on malformed chunk parse instead of breaking and succeeding.
A parse error currently breaks the loop and still returns
true, which accepts partial/corrupt payloads as valid.🛠️ Minimal fix
- if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } + return false; + } ... - if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } + return false; + }As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
Also applies to: 148-148
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loop currently treats any false from google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof) as a harmless break and ultimately returns true, which accepts corrupt partial payloads; change the logic so that when ParseDelimitedFromCodedStream returns false you check clean_eof and only break if clean_eof is true (clean EOF), otherwise immediately return false to signal a malformed chunk error; update this behavior for the ParseDelimitedFromCodedStream call at the shown line and the duplicate occurrence (around the other occurrence near line 148) so malformed parses fail instead of succeeding.
93-95:⚠️ Potential issue | 🟠 MajorValidate
total_elementssize math beforeresize().The current resize paths trust unvalidated
total_elements(and multiplication withelem_size), which can overflow or trigger excessive allocation on malformed input.As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
Also applies to: 151-154
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 95, The resize call trusts ac.total_elements() (and the later elem_size * total_elements math) and can overflow or cause huge allocations; before calling dest.resize() (both at the dest.empty() check and the duplicate block at the later occurrence), validate ac.total_elements() and the computed byte_count: cast total_elements to a fixed-width unsigned type (e.g. uint64_t), check total_elements > 0 is within a sane upper bound and that elem_size * total_elements does not overflow size_t (e.g. byte_count = uint64_t(elem_size) * total_elements; if (byte_count > std::numeric_limits<size_t>::max() || byte_count > MAX_ALLOWED_BYTES) return/error), then perform resize using the checked size_t value; if the validation fails, return an error/handle gracefully instead of resizing.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)
213-218:⚠️ Potential issue | 🟠 MajorSynchronize server-log state reads/writes consistently.
server_log_path_,server_log_start_pos_, andtest_start_marked_are accessed without a common lock across setters/readers. This can race when log polling and test setup run concurrently.Suggested fix pattern
void set_server_log_path(const std::string& path) { + std::lock_guard<std::mutex> lock(mutex_); server_log_path_ = path; server_log_start_pos_ = 0; test_start_marked_ = false; } std::string get_server_logs(bool since_test_start = true) const { - if (server_log_path_.empty()) { return ""; } + std::string path; + std::streampos start_pos = 0; + bool marked = false; + { + std::lock_guard<std::mutex> lock(mutex_); + path = server_log_path_; + start_pos = server_log_start_pos_; + marked = test_start_marked_; + } + if (path.empty()) { return ""; } - std::ifstream file(server_log_path_); + std::ifstream file(path); if (!file.is_open()) { return ""; } - if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) { - file.seekg(server_log_start_pos_); + if (since_test_start && marked && start_pos > 0) { + file.seekg(start_pos); }As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."
Also applies to: 229-244, 332-355
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213 - 218, The three server-log fields (server_log_path_, server_log_start_pos_, test_start_marked_) are accessed without synchronization; modify set_server_log_path to acquire a dedicated mutex (e.g., server_log_mutex_) and set all three fields while holding the lock, and update all other readers/writers that touch these members (the other setter/getter/polling functions referenced in the diff ranges) to also lock the same mutex when reading or writing those fields to prevent races.cpp/src/grpc/server/grpc_job_management.cpp (2)
109-113:⚠️ Potential issue | 🟠 MajorCap incumbent payload size before resizing the buffer.
This path trusts the size prefix and directly calls
data.resize(size). A malformed payload can force large allocations and exhaust memory.Suggested guardrail
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + if (size > static_cast<uint64_t>(config.max_message_bytes)) { + std::cerr << "[Worker] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code currently trusts the size prefix and calls data.resize(size) after read_from_pipe; introduce a validated upper bound (e.g., a constexpr size_t kMaxJobPayloadSize) and check that size is non-negative and <= kMaxJobPayloadSize before calling data.resize(size) and before the subsequent read_from_pipe; if the size is out of range, return false (or handle as a protocol error) to prevent unbounded allocation and potential memory exhaustion. Use the existing symbols size, data.resize, and read_from_pipe to locate and modify the logic.
147-158:⚠️ Potential issue | 🔴 CriticalFix racy queue-slot reservation in async submit.
Slot selection currently performs a check-then-write on shared queue state, so concurrent submissions can reserve the same slot.
Suggested approach (atomic claim)
for (size_t i = 0; i < MAX_JOBS; ++i) { - if (!job_queue[i].ready && !job_queue[i].claimed) { + if (job_queue[i].ready.load()) { continue; } + bool expected = false; + if (!job_queue[i].claimed.compare_exchange_strong(expected, true)) { continue; } copy_cstr(job_queue[i].job_id, job_id); job_queue[i].problem_type = is_mip ? 1 : 0; ... - job_queue[i].claimed = false; - job_queue[i].cancelled = false; - job_queue[i].ready = true; + job_queue[i].cancelled.store(false); + job_queue[i].ready.store(true, std::memory_order_release); + job_queue[i].claimed.store(false, std::memory_order_release);As per coding guidelines: "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The loop in submit that does check-then-write on job_queue slots is racy: change reservation to an atomic claim (e.g., make job_queue[].claimed an std::atomic<bool> or std::atomic_flag) and perform an atomic test-and-set (compare_exchange_strong or test_and_set) to reserve the slot before writing any fields; once the atomic claim succeeds, initialize job_queue[i].job_id, problem_type, data_size, worker_* etc., then set job_queue[i].ready = true (non-atomic write) only after fully initialized; ensure headers (<atomic>) and any needed memory-ordering are used and that no other code resets claimed without clearing ready in the reverse order.cpp/src/grpc/client/solve_remote.cu (2)
53-57:⚠️ Potential issue | 🟠 MajorGuard invalid env overrides before applying gRPC sizes.
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTESare applied without sanity checks. Zero/negative values can break transfer behavior at runtime.Suggested fix
static void apply_env_overrides(grpc_client_config_t& config) { - config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + const auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + const auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; } + if (max_msg > 0) { config.max_message_bytes = max_msg; } config.enable_debug_log = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 53 - 57, The env overrides in apply_env_overrides currently accept any parsed value from parse_env_int64 for chunk_size_bytes and max_message_bytes; update apply_env_overrides to validate the returned values before assignment: call parse_env_int64 for "CUOPT_CHUNK_SIZE" and "CUOPT_MAX_MESSAGE_BYTES", then only assign to config.chunk_size_bytes and config.max_message_bytes if the parsed value is > 0 and within sane bounds (e.g., enforce a minimum of 1 and a sensible upper cap to avoid resource exhaustion); leave enable_debug_log assignment using parse_env_int64 as-is. Ensure you reference apply_env_overrides, config.chunk_size_bytes, config.max_message_bytes, and parse_env_int64 when making the change.
163-166:⚠️ Potential issue | 🟠 MajorForward the real best bound to incumbent callbacks.
The current forwarding path sets
bound_copyfrom the incumbent objective, which can report an artificial zero gap to user callbacks. This should carry the solver’s actual best bound from the remote pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 163 - 166, The code currently sets bound_copy from objective, so get_callback->get_solution(...) forwards the incumbent objective as the bound and can report a zero gap; change the initialization of bound_copy to use the solver’s actual best-bound value from the remote pipeline (replace "bound_copy = objective" with the variable that holds the solver's best bound), then call get_callback->get_solution(solution_copy.data(), &obj_copy, &bound_copy, callback->get_user_data()); this ensures callbacks receive the real best bound instead of the incumbent objective.cpp/src/grpc/grpc_settings_mapper.cu (1)
144-146:⚠️ Potential issue | 🟠 MajorRange-check
iteration_limitbefore narrowing cast.A large protobuf
iteration_limit(int64_t) can overflow when cast toi_t(which may be int32_t), producing an invalid internal limit.🔧 Suggested guard
// proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset" // so the server keeps the library default (typically max()). if (pb_settings.iteration_limit() >= 0) { - settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit()); + const auto limit = pb_settings.iteration_limit(); + if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) { + settings.iteration_limit = static_cast<i_t>(limit); + } else { + settings.iteration_limit = std::numeric_limits<i_t>::max(); + } }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The narrowing cast from pb_settings.iteration_limit() to settings.iteration_limit (type i_t) can overflow; before the static_cast in the block that sets settings.iteration_limit, range-check the int64 value against the numeric limits of i_t and either clamp to std::numeric_limits<i_t>::max()/min() or return/report an error if out-of-range—update the code around the check of pb_settings.iteration_limit() to perform this validation and only then assign via static_cast<i_t>.cpp/src/grpc/client/grpc_client.cu (4)
258-266:⚠️ Potential issue | 🔴 CriticalAdd null stub guard and deadline to RPC methods.
Public RPC methods like
check_statusdereferenceimpl_->stubwithout checking if a connection was established. Additionally, theClientContextlacks a deadline, allowing indefinite blocking on network issues. The connection probe at line 191 correctly demonstrates both patterns.🔧 Proposed fix pattern (apply to all RPC methods)
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; + if (config_.timeout_seconds > 0) { + context.set_deadline(std::chrono::system_clock::now() + + std::chrono::seconds(config_.timeout_seconds)); + } auto request = build_status_request(job_id);Apply the same pattern to:
wait_for_completion,cancel_job,delete_job,get_incumbents,stream_logs, and all chunked upload/download methods.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The method grpc_client_t::check_status currently dereferences impl_->stub and creates a grpc::ClientContext without a deadline; add a null-stub guard at the start (check impl_ and impl_->stub and return a job_status_result_t error/result indicating "not connected") before calling build_status_request/CheckStatus, and set a deadline on the grpc::ClientContext (use std::chrono::... now()+configured RPC timeout) so the RPC won't block indefinitely; follow the same pattern for wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs and all chunked upload/download methods, using the same error/result handling style used by the connection probe at line 191.
991-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore division.Line 994 computes
max_pollsby dividing byconfig_.poll_interval_ms. If this value is<= 0, the division will either crash or produce invalid polling behavior. The same issue exists at line 1132.🔧 Proposed validation
} else { CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str()); + if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "Invalid configuration: poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 991 - 995, The code computes max_polls by dividing by config_.poll_interval_ms without validating it; add a guard in the polling branch (where CUOPT_LOG_INFO logs "Using polling (CheckStatus) for job" and any other similar polling block around the second occurrence) to check config_.poll_interval_ms > 0 and handle invalid values: log an error via CUOPT_LOG_ERROR (include job_id/context), set a safe default (e.g., poll_interval_ms = 1) or return/fail early, then compute max_polls using the validated/clamped value; update both locations that reference config_.poll_interval_ms to use the validated variable.
702-712:⚠️ Potential issue | 🔴 CriticalValidate chunked result metadata before arithmetic operations.
element_size_bytesis used at line 708 without checking for<= 0, causing division by zero. Additionally,total_elems * elem_sizeat line 711 can overflow before allocation, making subsequentmemcpyoperations unsafe.🔧 Proposed bounds validation
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element_size_bytes"; + return false; + } + // Check for overflow before allocation + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: total byte size overflow"; + return false; + } + int64_t total_bytes = total_elems * elem_size; + if (static_cast<uint64_t>(total_bytes) > std::numeric_limits<size_t>::max()) { + last_error_ = "Invalid chunk metadata: exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size;As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 712, The loop over header->arrays() uses arr_desc.element_size_bytes() and multiplies it by arr_desc.total_elements() without validation; add checks that element_size_bytes() > 0 (return/continue or log and skip if not) before computing elems_per_chunk to avoid division by zero, compute elems_per_chunk = max(1, chunk_data_budget / elem_size) only after that check, and validate that total_elements() * element_size_bytes() cannot overflow size_t before allocating array_bytes (e.g., if total_elems > SIZE_MAX / elem_size then handle as error/skip or cap/stream allocation); reference arr_desc.element_size_bytes(), arr_desc.total_elements(), chunk_data_budget, elems_per_chunk, and the allocation of array_bytes to implement these guards.
221-252:⚠️ Potential issue | 🔴 CriticalLog streaming thread can hang indefinitely on shutdown.
The
stop_log_streaming()method sets thestop_logs_flag and joins the thread, butreader->Read()at line 232 blocks indefinitely if the server stops sending messages. TheTryCancel()call at line 234 only executes afterRead()returns, not before. You need to callcontext.TryCancel()from the main thread before joining.🔧 Proposed fix using context cancellation from stop_log_streaming
Add member variables to the class (in grpc_client.hpp):
std::mutex log_context_mutex_; grpc::ClientContext* active_log_context_ = nullptr;Then update the methods:
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The log streaming thread can block in reader->Read() so add a cancellable context pointer protected by a mutex: introduce members std::mutex log_context_mutex_ and grpc::ClientContext* active_log_context_ = nullptr; in grpc_client_t, then in start_log_streaming() create the local grpc::ClientContext context as before but immediately register it by locking log_context_mutex_ and setting active_log_context_ = &context before calling impl_->stub->StreamLogs(...); after the read/Finish() loop clear the pointer (lock and set to nullptr). In stop_log_streaming() lock log_context_mutex_, if active_log_context_ != nullptr call active_log_context_->TryCancel() (to wake up reader->Read()) before joining the thread, then proceed to join/reset log_thread_ and set stop_logs_. Ensure all accesses to active_log_context_ are mutex-protected to avoid race conditions and that the pointer only points to the stack context while the thread is running.cpp/src/grpc/server/grpc_server_types.hpp (1)
300-308:⚠️ Potential issue | 🔴 CriticalRemove non-async-signal-safe operations from signal handler.
std::coutandcondition_variable::notify_all()are not async-signal-safe and can deadlock or corrupt state when called from a POSIX signal handler. The signal handler should only perform atomic writes.🔧 Suggested approach
inline void signal_handler(int signal) { if (signal == SIGINT || signal == SIGTERM) { - std::cout << "\n[gRPC Server] Received shutdown signal\n"; keep_running = false; if (shm_ctrl) { shm_ctrl->shutdown_requested = true; } - result_cv.notify_all(); } }Move logging and
result_cv.notify_all()to a dedicated shutdown-watcher thread that pollskeep_running == falseand then performs the cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, The signal_handler currently performs non-async-signal-safe operations (std::cout and result_cv.notify_all()), which must be removed; change signal_handler to only perform atomic-safe actions: set keep_running = false and, if shm_ctrl exists, set shm_ctrl->shutdown_requested = true (both already present), and return immediately; then implement a separate shutdown-watcher thread (e.g., spawned at server startup) that polls or waits on an atomic/condition to detect keep_running == false and from that thread perform logging (std::cout or logging framework), call result_cv.notify_all(), and any other cleanup work — locate signal_handler, keep_running, shm_ctrl, and result_cv to update behavior and add the watcher thread to the server lifecycle.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-427:⚠️ Potential issue | 🔴 CriticalAdd alignment validation before
memcpyin chunked array decoding.The
get_doublesandget_intslambdas compute element count via integer division but then copy the full byte array. If the payload size isn't aligned to element size,memcpywill copy garbage or partial data.🔧 Proposed alignment checks
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; // Misaligned data size_t n = it->second.size() / sizeof(double); if constexpr (std::is_same_v<f_t, double>) { std::vector<double> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(double)); return v; } else { // ... } }; auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; // Misaligned data size_t n = it->second.size() / sizeof(int32_t); // ... };As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas get_doubles and get_ints must validate byte alignment before memcpy: compute size_t bytes = it->second.size(); if bytes == 0 return {}; if (bytes % sizeof(double) != 0) (for get_doubles) or bytes % sizeof(int32_t) != 0 (for get_ints) then handle the misaligned payload (log/return error/throw) instead of blindly memcpy; use size_t n = bytes / sizeof(...) and pass bytes (or n * sizeof(...)) to memcpy rather than it->second.size(); update references in get_doubles/get_ints and ensure the branch where types differ still copies only the aligned byte count and converts only n elements.
🧹 Nitpick comments (3)
cpp/src/grpc/server/grpc_field_element_size.hpp (1)
17-29: Avoid silent fallback for unknownArrayFieldId.At Line 28,
default: return 8;can mask unsupported/new enum values and mis-size chunk parsing. Prefer an explicit invalid path so callers can reject the request.As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."Suggested change
inline int64_t array_field_element_size(cuopt::remote::ArrayFieldId field_id) { switch (field_id) { @@ - default: return 8; + default: return -1; // unknown field id; caller should fail validation } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_field_element_size.hpp` around lines 17 - 29, The function array_field_element_size(cuopt::remote::ArrayFieldId) silently falls back to return 8 for unknown enum values; change it to explicitly handle unsupported/new enums by logging/propagating an error or failing fast (e.g., throw std::invalid_argument or assert) instead of returning 8 so callers can reject invalid requests and avoid mis-sized parsing; update array_field_element_size to remove the silent default and return/throw a clear failure for unexpected cuopt::remote::ArrayFieldId values and ensure callers handle that error path.build.sh (1)
7-7: Enable unbound-variable checks in the script preamble.At Line 7, prefer
set -euto catch unset variable usage early.Based on learnings: "In this repository, prefer using 'set -u' in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages."Suggested change
-set -e +set -eu🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build.sh` at line 7, The script preamble currently uses "set -e"; change this to enable unbound-variable checks by replacing that occurrence with "set -eu" (i.e., add the -u flag) so the shell will error on unset variables; audit any places that intentionally rely on unset variables and initialize or guard them accordingly (locate the "set -e" token in build.sh and update it).cpp/src/grpc/server/grpc_server_threads.cpp (1)
166-182: Deserialize outsidetracker_mutexto reduce contention.
deserialize_result_pipe_blob(...)can be expensive; parsing before taking the lock and only committing parsed state under lock will improve concurrency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 166 - 182, deserialize_result_pipe_blob(...) is done while holding tracker_mutex causing unnecessary contention; move the expensive deserialize step out of the critical section by first copying result_data and success locally, call deserialize_result_pipe_blob(result_data.data(), result_data.size(), hdr, arrays) before acquiring tracker_mutex, then acquire lock(tracker_mutex), find job_tracker entry by job_id, verify it still exists and update it: set status to COMPLETED or FAILED (JobStatus), set error_message on failure, or assign result_header, result_arrays and result_size_bytes on success; ensure hdr and arrays are moved into it->second to avoid copying and that result_size_bytes is set from result_data.size().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d12476d8-baa0-42cb-9006-1d214ffa2716
📒 Files selected for processing (38)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/src/grpc/server/grpc_incumbent_proto.hpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/server/grpc_service_impl.cpp
- cpp/src/grpc/server/grpc_worker.cpp
- cpp/src/grpc/grpc_solution_mapper.cu
| if(TARGET protobuf::protoc) | ||
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) | ||
| if(NOT _PROTOBUF_PROTOC) | ||
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) | ||
| endif() | ||
| else() | ||
| find_package(protobuf CONFIG REQUIRED) | ||
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) | ||
| if(NOT _PROTOBUF_PROTOC) | ||
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) | ||
| endif() | ||
| endif() | ||
|
|
||
| if(NOT _PROTOBUF_PROTOC) | ||
| message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)") | ||
| endif() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -300 cpp/CMakeLists.txt | tail -30Repository: NVIDIA/cuopt
Length of output: 1081
🏁 Script executed:
sed -n '270,300p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 1118
🏁 Script executed:
cat > /tmp/verify_cmake.py << 'EOF'
# Verify CMake semantics documentation for IMPORTED_LOCATION behavior
print("""
CMake Semantics Verification:
1. IMPORTED_LOCATION and IMPORTED_LOCATION_<CONFIG>:
- These properties are specific to IMPORTED targets
- They store the filesystem path to the imported library/executable
- For non-IMPORTED (regular) targets, these properties are not set by default
- Reading them from non-IMPORTED targets returns empty/NOTFOUND
2. $<TARGET_FILE:target>:
- Generator expression that works with both IMPORTED and non-IMPORTED targets
- For IMPORTED targets: resolves to the path specified by IMPORTED_LOCATION variants
- For non-IMPORTED targets: resolves to the build directory path
- However: Must NOT be used in string contexts before generator evaluation
- This is a generator expression, evaluated at generation time, not configuration time
3. Key issue with the suggested fix:
- set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
- This stores a generator expression STRING, not an actual path
- If _PROTOBUF_PROTOC is used in an add_custom_command() COMMAND clause, it works
- If used in string comparisons or other non-generator contexts, it will fail
""")
EOF
python /tmp/verify_cmake.pyRepository: NVIDIA/cuopt
Length of output: 1178
🏁 Script executed:
rg -n "_PROTOBUF_PROTOC" cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 550
🏁 Script executed:
sed -n '294,400p' cpp/CMakeLists.txt | head -50Repository: NVIDIA/cuopt
Length of output: 1985
🏁 Script executed:
grep -n "_PROTOBUF_PROTOC" cpp/CMakeLists.txt -A 2 -B 2Repository: NVIDIA/cuopt
Length of output: 1254
🏁 Script executed:
cat > /tmp/cmake_check.md << 'EOF'
# CMake Generator Expression Constraints
## When $<TARGET_FILE:...> works:
1. In add_custom_command(COMMAND ...) - YES
2. In add_custom_target(COMMAND ...) - YES
3. In file(WRITE ...) - NO (configuration time)
4. In string operations before usage - NO
5. In find_program() - NO
6. As a plain string variable - NO
## For the code in question:
The variable _PROTOBUF_PROTOC appears to be used to locate the protoc executable.
If it's passed to add_custom_command(), generator expressions work fine.
If it's used in conditional logic or string tests, generator expressions will fail.
## Alternative approaches:
1. Use $<TARGET_FILE:> only if usage context allows
2. Use file(REAL_PATH target_file) after getting the location
3. Keep the current IMPORTED_LOCATION approach but handle non-imported targets
4. Check if target is IMPORTED first, then use appropriate method
EOF
cat /tmp/cmake_check.mdRepository: NVIDIA/cuopt
Length of output: 925
🏁 Script executed:
sed -n '294,500p' cpp/CMakeLists.txt | head -100Repository: NVIDIA/cuopt
Length of output: 3254
🏁 Script executed:
grep -n "PROTOBUF_PROTOC" cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 550
🏁 Script executed:
rg -B5 -A10 "PROTOBUF_PROTOC" cpp/CMakeLists.txt | head -100Repository: NVIDIA/cuopt
Length of output: 2368
🏁 Script executed:
cat > /tmp/test_cmake_behavior.py << 'EOF'
import subprocess
import tempfile
import os
# Create a minimal CMakeLists.txt to test behavior
cmake_test = '''
cmake_minimum_required(VERSION 3.15)
project(test)
# Create an executable target (non-imported)
add_executable(my_exe main.cpp)
# Try to get IMPORTED_LOCATION from non-imported target
get_target_property(LOC my_exe IMPORTED_LOCATION)
message("IMPORTED_LOCATION for non-imported: ${LOC}")
# Try TARGET_FILE
message("TARGET_FILE: $<TARGET_FILE:my_exe>")
'''
main_cpp = '''
int main() { return 0; }
'''
result = """
CMake Behavior Test Results:
============================
For non-IMPORTED executable targets:
1. get_target_property(... IMPORTED_LOCATION) returns empty/NOTFOUND
2. $<TARGET_FILE:target> returns a generator expression string that will resolve at build time
Key findings:
- IMPORTED_LOCATION only works for IMPORTED targets
- Non-IMPORTED targets don't have IMPORTED_LOCATION set
- $<TARGET_FILE:> works for both but is a generator expression, not evaluated at config time
- If the variable is used in add_custom_command(), generator expressions are safe
- If used in conditionals or string comparisons, generator expressions won't work as expected
"""
print(result)
EOF
python /tmp/test_cmake_behavior.pyRepository: NVIDIA/cuopt
Length of output: 709
🏁 Script executed:
wc -l cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 78
🏁 Script executed:
sed -n '279,350p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 2701
🏁 Script executed:
grep -n "protoc" cpp/CMakeLists.txt | head -20Repository: NVIDIA/cuopt
Length of output: 640
🏁 Script executed:
sed -n '285,400p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 3783
🏁 Script executed:
rg -i "fetchcontent|add_subdirectory.*protobuf" cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 264
🏁 Script executed:
grep -n "find_package\|add_subdirectory" cpp/CMakeLists.txt | head -20Repository: NVIDIA/cuopt
Length of output: 568
🏁 Script executed:
sed -n '250,280p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 1303
🏁 Script executed:
sed -n '230,320p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 3181
🏁 Script executed:
sed -n '295,350p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 2109
🏁 Script executed:
cat > /tmp/analyze_protoc.md << 'EOF'
# Analysis of protoc Resolution Review
## Current Code (lines 279-294):
1. If TARGET protobuf::protoc exists:
- Try IMPORTED_LOCATION_RELEASE property
- If not set, try IMPORTED_LOCATION property
2. Else:
- find_package(protobuf CONFIG REQUIRED)
- Try IMPORTED_LOCATION_RELEASE property
- If not set, try IMPORTED_LOCATION property
3. Check if _PROTOBUF_PROTOC was set, error if not
## Review's Claim:
- For non-imported targets, IMPORTED_LOCATION* is empty
- This causes a false fatal error
- Solution: use $<TARGET_FILE:protobuf::protoc>
## Critical Issue with the Claim:
- The code says "if(TARGET protobuf::protoc)" which checks if target exists
- Then it gets IMPORTED_LOCATION properties
- For non-imported targets created via add_executable, these properties would be empty
- However: find_package(protobuf CONFIG) should create IMPORTED targets
- add_subdirectory() or FetchContent would create non-IMPORTED targets
## Key Questions:
1. Can protobuf::protoc be non-IMPORTED in practice?
- YES: if using add_subdirectory(protobuf) or FetchContent
- This is possible if protobuf is built in-tree
2. Does $<TARGET_FILE:> work in the usage context?
- Need to see where _PROTOBUF_PROTOC is used
- If in add_custom_command(COMMAND), YES
- Otherwise, probably NOT
## The Real Issue:
The code doesn't handle the case where:
- protobuf::protoc exists (non-IMPORTED)
- IMPORTED_LOCATION* are all empty
- This would incorrectly trigger FATAL_ERROR
EOF
cat /tmp/analyze_protoc.mdRepository: NVIDIA/cuopt
Length of output: 1536
🏁 Script executed:
rg "_PROTOBUF_PROTOC" cpp/ -A 3 -B 1Repository: NVIDIA/cuopt
Length of output: 1642
🏁 Script executed:
sed -n '300,320p' cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 807
🏁 Script executed:
cat > /tmp/final_verification.py << 'EOF'
import sys
analysis = """
VERIFICATION RESULTS:
=====================
1. CURRENT CODE BEHAVIOR (lines 279-294):
- Assumes protobuf::protoc target has IMPORTED_LOCATION properties
- For IMPORTED targets from find_package(): WORKS
- For non-IMPORTED targets (add_subdirectory/FetchContent): FAILS
- Results in false FATAL_ERROR "protoc not found"
2. REVIEWER'S CLAIM: ACCURATE
- Non-imported targets don't have IMPORTED_LOCATION properties
- This is a real issue if protobuf is built in-tree
3. PROPOSED SOLUTION VALIDITY:
- Using $<TARGET_FILE:protobuf::protoc> is VALID
- Works for both IMPORTED and non-IMPORTED targets
- Context check: used in add_custom_command(COMMAND ...) - VALID
- Generator expressions ARE evaluated in COMMAND context
4. CORRECTNESS OF SUGGESTED DIFF:
- Replaces multiple IMPORTED_LOCATION lookups with single $<TARGET_FILE:...>
- Simplifies logic: removes redundant fallback checks
- Works for both configuration scenarios
5. POTENTIAL CONCERNS:
- The diff in the review appears incomplete or slightly off
- The else() branch has redundant logic addition
- But the core approach ($<TARGET_FILE:>) is sound
VERDICT: Review comment is SUBSTANTIVELY CORRECT but diff may need refinement
"""
print(analysis)
EOF
python /tmp/final_verification.pyRepository: NVIDIA/cuopt
Length of output: 1298
🏁 Script executed:
cat > /tmp/verify_logic.py << 'EOF'
# Analyze the proposed logic
current = """
CURRENT:
if(TARGET protobuf::protoc)
# try IMPORTED_LOCATION_RELEASE
# else try IMPORTED_LOCATION
else()
find_package(protobuf CONFIG REQUIRED)
# try IMPORTED_LOCATION_RELEASE
# else try IMPORTED_LOCATION
endif()
"""
proposed = """
PROPOSED:
if(TARGET protobuf::protoc)
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
else()
find_package(protobuf CONFIG REQUIRED)
if(TARGET protobuf::protoc)
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
endif()
endif()
"""
analysis = """
LOGIC ANALYSIS:
Current Code Flow:
1. Check if target exists
2. If yes: try to get IMPORTED_LOCATION
3. If no: find_package and try to get IMPORTED_LOCATION
4. If all empty: FATAL_ERROR
Proposed Code Flow:
1. Check if target exists
2. If yes: use TARGET_FILE generator expression
3. If no: find_package, then check target again, use TARGET_FILE
4. If not created by find_package: FATAL_ERROR
OBSERVATION: The proposed code is logically equivalent and simpler
- Removes config-specific property lookups
- Uses universal TARGET_FILE instead
- Handles both IMPORTED and non-IMPORTED targets
- Still maintains the final error check (implicit: if not set, error)
ISSUE: The final error check at line 293-294 is still needed!
The diff doesn't show it, but the original has:
if(NOT _PROTOBUF_PROTOC)
message(FATAL_ERROR "protoc not found...")
endif()
This check MUST remain or the logic is incomplete.
"""
print(analysis)
EOF
python /tmp/verify_logic.pyRepository: NVIDIA/cuopt
Length of output: 965
🏁 Script executed:
cat > /tmp/protobuf_config.txt << 'EOF'
CMake protobuf CONFIG packages:
- protobuf_FOUND should be set
- Creates IMPORTED targets: protobuf::protoc, protobuf::libprotobuf
- These are IMPORTED targets with IMPORTED_LOCATION set
FetchContent/add_subdirectory protobuf:
- May create non-IMPORTED targets
- Would not have IMPORTED_LOCATION properties
The review's scenario is valid and real.
EOF
cat /tmp/protobuf_config.txtRepository: NVIDIA/cuopt
Length of output: 405
Use $<TARGET_FILE:protobuf::protoc> to resolve protoc across both imported and non-imported target types.
At lines 279–294, the code assumes protobuf::protoc has IMPORTED_LOCATION* properties. This fails when protobuf is built in-tree (e.g., via FetchContent or add_subdirectory), where the target exists but lacks these properties, triggering a false fatal error. The $<TARGET_FILE:…> generator expression resolves the path for both imported and non-imported targets and simplifies the logic.
Suggested CMake adjustment
if(TARGET protobuf::protoc)
- get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
- if(NOT _PROTOBUF_PROTOC)
- get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
- endif()
+ set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
else()
find_package(protobuf CONFIG REQUIRED)
- get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
- if(NOT _PROTOBUF_PROTOC)
- get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
- endif()
+ if(TARGET protobuf::protoc)
+ set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+ endif()
endif()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if(TARGET protobuf::protoc) | |
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) | |
| if(NOT _PROTOBUF_PROTOC) | |
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) | |
| endif() | |
| else() | |
| find_package(protobuf CONFIG REQUIRED) | |
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) | |
| if(NOT _PROTOBUF_PROTOC) | |
| get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) | |
| endif() | |
| endif() | |
| if(NOT _PROTOBUF_PROTOC) | |
| message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)") | |
| endif() | |
| if(TARGET protobuf::protoc) | |
| set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) | |
| else() | |
| find_package(protobuf CONFIG REQUIRED) | |
| if(TARGET protobuf::protoc) | |
| set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) | |
| endif() | |
| endif() | |
| if(NOT _PROTOBUF_PROTOC) | |
| message(FATAL_ERROR "protoc not found (Protobuf_PROTOC_EXECUTABLE is empty)") | |
| endif() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic tries to read
IMPORTED_LOCATION/IMPORTED_LOCATION_RELEASE from the protobuf::protoc target and
falsely fails when protobuf is an in-tree target; replace the multi-branch
property lookups by resolving the executable via the generator expression
$<TARGET_FILE:protobuf::protoc> (assign to _PROTOBUF_PROTOC or set
Protobuf_PROTOC_EXECUTABLE) and keep the fallback find_package path only if that
result is empty, finally erroring if the resolved value (e.g. _PROTOBUF_PROTOC
or Protobuf_PROTOC_EXECUTABLE) is still empty; update references to
protobuf::protoc and _PROTOBUF_PROTOC accordingly.
| } else if (arg == "--time-limit" && arg_idx + 1 < argc) { | ||
| time_limit = std::stod(argv[++arg_idx]); | ||
| arg_idx++; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the test_grpc_client.cpp file
find . -name "test_grpc_client.cpp" -type f 2>/dev/null | head -20Repository: NVIDIA/cuopt
Length of output: 98
🏁 Script executed:
# Check if the file exists and get its line count
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
wc -l cpp/src/grpc/client/test_grpc_client.cpp
else
echo "File not found at expected path"
fiRepository: NVIDIA/cuopt
Length of output: 100
🏁 Script executed:
# Examine lines around 401-403
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
sed -n '395,410p' cpp/src/grpc/client/test_grpc_client.cpp
fiRepository: NVIDIA/cuopt
Length of output: 615
🏁 Script executed:
# Examine lines around 531
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
sed -n '525,540p' cpp/src/grpc/client/test_grpc_client.cpp
fiRepository: NVIDIA/cuopt
Length of output: 492
🏁 Script executed:
# Search for all std::stod and std::stoll calls in the file
if [ -f "cpp/src/grpc/client/test_grpc_client.cpp" ]; then
echo "=== std::stod calls ===" && \
rg "std::stod|std::stoll" cpp/src/grpc/client/test_grpc_client.cpp -n -B2 -A2
fiRepository: NVIDIA/cuopt
Length of output: 503
Add exception handling for numeric CLI argument parsing.
std::stod and std::stoll throw on invalid input; currently they can terminate this tool abruptly instead of returning a normal CLI error.
Add try-catch blocks around numeric parsing at line 402 (--time-limit) and line 531 (incumbent mode index) to handle std::invalid_argument and std::out_of_range, then return an error code with a descriptive message. Additionally, validate that time_limit is positive before accepting it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the
numeric CLI parses in main where std::stod is used for "--time-limit" and where
std::stoll is used for the incumbent mode index in try-catch blocks that catch
std::invalid_argument and std::out_of_range; on exception print a descriptive
error to stderr (including the offending argument text and which flag failed)
and return a non-zero exit code instead of allowing an exception to propagate.
After parsing time_limit (the variable assigned from std::stod), validate it is
> 0 and if not print an error and return non-zero. Apply the same guarded
pattern around the incumbent-mode parsing (the std::stoll call) so invalid or
out-of-range inputs are handled gracefully. Ensure messages clearly reference
the flag name ("--time-limit" or the incumbent-mode flag) and the invalid value.
| if (pid < 0) { | ||
| std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ") | ||
| << worker_id << "\n"; | ||
| close_worker_pipes_server(worker_id); | ||
| return -1; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /repo && find . -name "grpc_worker_infra.cpp" -type fRepository: NVIDIA/cuopt
Length of output: 111
🏁 Script executed:
head -150 cpp/src/grpc/server/grpc_worker_infra.cpp | tail -40Repository: NVIDIA/cuopt
Length of output: 1699
🏁 Script executed:
wc -l cpp/src/grpc/server/grpc_worker_infra.cppRepository: NVIDIA/cuopt
Length of output: 101
🏁 Script executed:
cat -n cpp/src/grpc/server/grpc_worker_infra.cpp | sed -n '100,130p'Repository: NVIDIA/cuopt
Length of output: 1414
🏁 Script executed:
rg "close_worker_pipes" cpp/src/grpc/server/grpc_worker_infra.cpp -A 10 -B 2Repository: NVIDIA/cuopt
Length of output: 2137
🏁 Script executed:
rg "worker_read_fd|worker_write_fd|worker_incumbent_write_fd" cpp/src/grpc/server/grpc_worker_infra.cpp -B 3 -A 3Repository: NVIDIA/cuopt
Length of output: 2216
🏁 Script executed:
cat -n cpp/src/grpc/server/grpc_worker_infra.cpp | sed -n '70,110p'Repository: NVIDIA/cuopt
Length of output: 1388
🏁 Script executed:
rg "void close_worker_pipes" cpp/src/grpc/server/grpc_worker_infra.cpp -A 15Repository: NVIDIA/cuopt
Length of output: 1032
Close child-end FDs on fork() failure path.
When fork() fails at line 120, the pipes created at line 113 via create_worker_pipes() remain partially open in the parent. The call to close_worker_pipes_server(worker_id) at line 123 closes only the parent-end descriptors (to_worker_fd, from_worker_fd, incumbent_from_worker_fd). The child-end descriptors (worker_read_fd, worker_write_fd, worker_incumbent_write_fd) remain open and are leaked, since the child process was never created to inherit them.
Add close_worker_pipes_child_ends(worker_id) to properly clean up all file descriptors on fork failure, matching the cleanup pattern already present in the successful path (line 145).
Suggested fix
if (pid < 0) {
std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
<< worker_id << "\n";
close_worker_pipes_server(worker_id);
+ close_worker_pipes_child_ends(worker_id);
return -1;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork()
failure in the block handling pid < 0, call
close_worker_pipes_child_ends(worker_id) in addition to the existing
close_worker_pipes_server(worker_id) so the child-end FDs (worker_read_fd,
worker_write_fd, worker_incumbent_write_fd) are closed and not leaked; mirror
the cleanup done in the successful path (the same pattern used around the
fork-success handling) and then return -1 as before.
There was a problem hiding this comment.
Actionable comments posted: 13
♻️ Duplicate comments (25)
cpp/src/grpc/server/grpc_job_management.cpp (2)
147-158:⚠️ Potential issue | 🔴 Critical
submit_job_asyncstill uses racy check-then-write slot acquisition.Lines [147]-[158] can let concurrent submitters claim/overwrite the same queue slot. Use atomic reservation (
compare_exchange) and publish ordering.Proposed fix
- for (size_t i = 0; i < MAX_JOBS; ++i) { - if (!job_queue[i].ready && !job_queue[i].claimed) { + for (size_t i = 0; i < MAX_JOBS; ++i) { + if (job_queue[i].ready.load()) { continue; } + bool expected_claimed = false; + if (job_queue[i].claimed.compare_exchange_strong(expected_claimed, true)) { copy_cstr(job_queue[i].job_id, job_id); job_queue[i].problem_type = is_mip ? 1 : 0; job_queue[i].data_size = request_data.size(); job_queue[i].worker_pid = 0; job_queue[i].worker_index = -1; - job_queue[i].data_sent = false; - job_queue[i].claimed = false; - job_queue[i].cancelled = false; - job_queue[i].ready = true; + job_queue[i].data_sent.store(false); + job_queue[i].cancelled.store(false); + job_queue[i].ready.store(true, std::memory_order_release); + job_queue[i].claimed.store(false, std::memory_order_release);As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, submit_job_async currently uses a racy check-then-write loop over job_queue (iterating i from 0..MAX_JOBS) checking job_queue[i].ready and job_queue[i].claimed then writing the slot; change this to atomically reserve a slot using an atomic CAS on a dedicated slot state (e.g., atomically compare_exchange the claimed/empty flag for job_queue[i]) before writing any fields, and use release semantics when publishing ready (store with memory_order_release) and acquire semantics when consumers read (memory_order_acquire); update submit_job_async to attempt compare_exchange on job_queue[i].claimed (or a new atomic state) to transition from empty->reserved, only write the job fields after successful CAS, and then set ready=true with appropriate publish ordering so concurrent submitters cannot overwrite the same slot.
109-113:⚠️ Potential issue | 🟠 MajorBound incumbent payload size before
resizeto prevent memory exhaustion.Line [111] trusts a pipe-provided size and allocates directly. A malformed size prefix can force unbounded allocation.
Proposed fix
bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data) { uint64_t size; if (!read_from_pipe(fd, &size, sizeof(size))) return false; + if (size > static_cast<uint64_t>(config.max_message_bytes)) { + std::cerr << "[Worker] Incumbent payload too large: " << size << "\n"; + return false; + } data.resize(size); if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false; return true; }As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code reads an untrusted uint64_t size from a pipe and calls data.resize(size) directly; bound the incoming size before resizing to prevent memory exhaustion by introducing a sensible constant (e.g. JOB_PAYLOAD_MAX or MAX_PAYLOAD_SIZE) and rejecting sizes > that limit (return false) or otherwise handling them; also validate zero/negative semantics if needed and use that check right after read_from_pipe(fd, &size, ...) and before data.resize(size) (referencing the local variable size, function read_from_pipe, and the vector/data.resize call).cpp/src/grpc/cuopt_remote.proto (1)
6-6:⚠️ Potential issue | 🟠 MajorBuf package-directory mismatch is still unresolved.
Line [6] (
package cuopt.remote;) remains in a path that violates BufPACKAGE_DIRECTORY_MATCHfor this module layout, so lint/CI will continue to fail until package/path (or Buf roots) are aligned.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in cuopt_remote.proto ("package cuopt.remote;") doesn't match the file path, triggering Buf's PACKAGE_DIRECTORY_MATCH rule; fix by either updating the package to reflect the file's directory structure (e.g., change package to match the actual path) or move cuopt_remote.proto into a directory layout matching "cuopt/remote" (or adjust buf.yaml roots/module configuration to include the current path). Ensure the package token "cuopt.remote" and the filesystem layout are aligned so Buf lint passes.cpp/src/grpc/server/grpc_service_impl.cpp (1)
71-73:⚠️ Potential issue | 🔴 CriticalPublish
readybefore releasingclaimedto prevent queue-slot double allocation.At Line [71] and Line [265], releasing
claimedbefore publishingreadyre-opens a claim window for another submitter on the same slot.Proposed fix (both locations)
- job_queue[job_idx].claimed.store(false); - job_queue[job_idx].ready.store(true); + job_queue[job_idx].ready.store(true, std::memory_order_release); + job_queue[job_idx].claimed.store(false, std::memory_order_release);As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code currently clears job_queue[job_idx].claimed before publishing job_queue[job_idx].ready, which allows another submitter to claim the same slot between those two stores; swap the operations so you set job_queue[job_idx].ready.store(true) before job_queue[job_idx].claimed.store(false) (and apply the same swap at the other occurrence around lines 265-267) to ensure the slot is marked available only after the ready flag is visible; if using explicit memory orders, use release semantics for the store to ready to avoid the race.SERVER_ARCHITECTURE.md (1)
15-46:⚠️ Potential issue | 🟡 MinorAdd fence languages to satisfy markdownlint MD040
Several fenced blocks are untyped; markdownlint MD040 will keep warning/failing until those fences specify a language (e.g.,
text).Proposed patch
-``` +```text ┌────────────────────────────────────────────────────────────────────┐ │ Main Server Process │ ... └─────────────────┘ └─────────────────┘ └─────────────────┘ -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Server Worker ... -``` +``` @@ -``` +```text Client Worker ... -``` +``` @@ -``` +```text ┌─────────┐ submit ┌───────────┐ claim ┌────────────┐ ... └───────────┘ └─────────┘ -``` +```Also applies to: 110-118, 122-136, 139-148, 154-168, 207-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has multiple untyped fenced code blocks (the large ASCII "Main Server Process" diagram and the smaller "Client / Server / Worker" and "submit / claim" pipeline diagrams) causing markdownlint MD040 failures; update each triple-backtick fence around these ASCII diagrams in SERVER_ARCHITECTURE.md to specify a language (use "text") so the fences read ```text ... ```; look for the big box diagram containing "Main Server Process", the repeated "Client Server Worker" blocks, and the "submit claim" pipeline block and add the language token to each opening fence.cpp/src/grpc/server/grpc_server_main.cpp (3)
156-168:⚠️ Potential issue | 🔴 CriticalDon’t
memsetshared-memory structs that containstd::atomic
memset()on a type withstd::atomicmembers is undefined behavior in C++. Even if you immediately.store()afterwards, the atomic object’s lifetime/rules are already violated.Safer initialization pattern
for (size_t i = 0; i < MAX_JOBS; ++i) { - memset(&job_queue[i], 0, sizeof(JobQueueEntry)); + new (&job_queue[i]) JobQueueEntry{}; job_queue[i].ready.store(false); job_queue[i].claimed.store(false); job_queue[i].cancelled.store(false); job_queue[i].worker_index.store(-1); } for (size_t i = 0; i < MAX_RESULTS; ++i) { - memset(&result_queue[i], 0, sizeof(ResultQueueEntry)); + new (&result_queue[i]) ResultQueueEntry{}; result_queue[i].ready.store(false); result_queue[i].retrieved.store(false); }As per coding guidelines, “Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 156 - 168, The code currently calls memset on job_queue[i] and result_queue[i] which is undefined for types containing std::atomic; remove the memset calls and instead properly construct/value-initialize each entry (e.g., use placement new: new (&job_queue[i]) JobQueueEntry(); new (&result_queue[i]) ResultQueueEntry();) and then set the atomic fields via their .store(...) calls (worker_index.store(-1), ready.store(false), claimed.store(false), cancelled.store(false), retrieved.store(false)); update the loops that touch job_queue and result_queue (and references to MAX_JOBS / MAX_RESULTS) to use this safe initialization pattern.
173-212:⚠️ Potential issue | 🔴 CriticalAvoid
std::terminateon TLS error paths; guardBuildAndStart()failureTLS validation errors currently
returnafter starting workers/threads (joinable thread dtors will callstd::terminate). Separately,BuildAndStart()can return null andserver->Wait()will crash. Both need a single cleanup/teardown path.Minimal direction: introduce a shutdown helper and use it on all early-exit paths
spawn_workers(); std::thread result_thread(result_retrieval_thread); std::thread incumbent_thread(incumbent_retrieval_thread); std::thread monitor_thread(worker_monitor_thread); std::thread reaper_thread(session_reaper_thread); + auto shutdown_runtime = [&]() { + keep_running = false; + shm_ctrl->shutdown_requested = true; + result_cv.notify_all(); + if (result_thread.joinable()) result_thread.join(); + if (incumbent_thread.joinable()) incumbent_thread.join(); + if (monitor_thread.joinable()) monitor_thread.join(); + if (reaper_thread.joinable()) reaper_thread.join(); + wait_for_workers(); + cleanup_shared_memory(); + }; @@ if (config.enable_tls) { if (config.tls_cert_path.empty() || config.tls_key_path.empty()) { std::cerr << "[Server] TLS enabled but --tls-cert/--tls-key not provided\n"; - return 1; + shutdown_runtime(); + return 1; } @@ if (ssl_opts.pem_root_certs.empty()) { std::cerr << "[Server] --require-client-cert requires --tls-root\n"; - return 1; + shutdown_runtime(); + return 1; } @@ std::unique_ptr<Server> server(builder.BuildAndStart()); + if (!server) { + std::cerr << "[gRPC Server] Failed to start server on " << server_address << "\n"; + shutdown_runtime(); + return 1; + } @@ server->Wait();As per coding guidelines, “Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state.”
Also applies to: 234-251
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 173 - 212, Spawned worker threads and early TLS/BuildAndStart error paths can return while threads are still joinable (causing std::terminate) and BuildAndStart() can return null causing a crash; introduce a single shutdown helper (e.g., shutdown_server_and_workers()) that stops workers, signals threads to exit, joins result_retrieval_thread, incumbent_retrieval_thread, worker_monitor_thread, session_reaper_thread and cleans up any resources and the grpc::Server pointer, call this helper on all early-exit/TLS error paths and when builder.BuildAndStart() returns nullptr before returning non-zero, and replace ad-hoc returns with calls to this helper; additionally ensure shared solver/global state access in server handlers uses proper mutex/synchronization (e.g., std::mutex/lock_guard around the solver state) to avoid thread-unsafe access.
50-80:⚠️ Potential issue | 🟠 MajorCLI parsing should reject unknown/incomplete flags and handle
std::stoifailuresRight now unknown flags and missing values are silently ignored, and
std::stoi/std::stollcan throw and abort the server. Also,--chunk-timeoutis documented but not parsed at all.Example patch sketch (apply pattern to all flags)
for (int i = 1; i < argc; i++) { std::string arg = argv[i]; if (arg == "-p" || arg == "--port") { - if (i + 1 < argc) { config.port = std::stoi(argv[++i]); } + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + print_usage(argv[0]); + return 1; + } + try { + config.port = std::stoi(argv[++i]); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid port value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } } else if (arg == "--max-message-bytes") { - if (i + 1 < argc) { config.max_message_b = std::max(4096LL, std::stoll(argv[++i])); } + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + return 1; + } + try { + config.max_message_b = std::max<int64_t>(4096LL, std::stoll(argv[++i])); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } + } else if (arg == "--chunk-timeout") { + if (i + 1 >= argc) { + std::cerr << "ERROR: " << arg << " requires a value\n"; + return 1; + } + try { + config.chunk_timeout_seconds = std::stoi(argv[++i]); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n"; + return 1; + } } else if (arg == "-h" || arg == "--help") { print_usage(argv[0]); return 0; + } else { + std::cerr << "ERROR: Unknown option '" << arg << "'\n"; + print_usage(argv[0]); + return 1; } }As per coding guidelines, “Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI parsing loop in main currently silently ignores unknown flags and missing values, and calls std::stoi/std::stoll without catching exceptions; also the documented flag "--chunk-timeout" is not parsed. Update the argv parsing code (the for loop handling argv[], touching config.port, config.num_workers, config.max_message_mb, config.max_message_b, config.enable_transfer_hash, config.enable_tls, config.tls_cert_path, config.tls_key_path, config.tls_root_path, config.require_client, config.log_to_console, config.verbose) to: 1) validate that flags requiring a value check (i + 1 < argc) and on failure print_usage(argv[0]) and return a non-zero error code; 2) wrap std::stoi/std::stoll usages in try/catch(std::exception&) to log a clear error and exit non-zero on parse failure; 3) handle unknown flags by printing an error + usage and exiting non-zero; and 4) add parsing for "--chunk-timeout" to populate the appropriate config field (e.g., config.chunk_timeout) following the same validation and exception-handling pattern. Ensure error messages include the offending flag for easier debugging.build.sh (1)
390-398:⚠️ Potential issue | 🟠 Major
cuopt_grpc_serverfast path still captures mixed-target buildsAs written,
build.sh cuopt_grpc_server cuoptwill still take the server-only branch and skip the requested install target(s). That’s a surprising foot-gun for devs/CI.Proposed patch
- if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll; then + if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll \ + && ! hasArg cuopt && ! hasArg cuopt_server && ! hasArg cuopt_mps_parser \ + && ! hasArg cuopt_sh_client && ! hasArg docs && ! hasArg deb \ + && ! hasArg libmps_parser; then # Build only the gRPC server (ninja resolves libcuopt as a dependency) cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server ${VERBOSE_FLAG} ${JFLAG}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build.sh` around lines 390 - 398, The fast-path condition that builds only cuopt_grpc_server (the if branch using hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll) incorrectly triggers when other targets (e.g., cuopt) are also requested; change the guard so it only takes the server-only branch when cuopt_grpc_server is the sole target requested (for example replace the condition with a new predicate like hasOnlyArg cuopt_grpc_server or add explicit negations for all other possible target args / check INSTALL_TARGET equals cuopt_grpc_server), updating the if in build.sh that surrounds the cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server block so mixed-target invocations fall through to the normal install/build branch.cpp/src/grpc/client/test_grpc_client.cpp (1)
381-412:⚠️ Potential issue | 🟠 MajorHarden numeric argument parsing and validate
--time-limit/ incumbent index
std::stod/std::stollthrow on bad input, andstatic_cast<int>(time_limit)can be undefined if the value is out of range. This tool should fail with a clean CLI error instead of terminating.Proposed patch
while (arg_idx < argc && argv[arg_idx][0] == '-') { std::string arg = argv[arg_idx]; @@ } else if (arg == "--time-limit" && arg_idx + 1 < argc) { - time_limit = std::stod(argv[++arg_idx]); + const char* value = argv[++arg_idx]; + try { + time_limit = std::stod(value); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid value for --time-limit: '" << value << "'\n"; + return 1; + } + if (!std::isfinite(time_limit) || time_limit <= 0.0) { + std::cerr << "ERROR: --time-limit must be a finite positive number (got '" << value + << "')\n"; + return 1; + } arg_idx++; @@ grpc_client_config_t config; config.server_address = server_address; - config.timeout_seconds = static_cast<int>(time_limit) + 120; + if (time_limit > static_cast<double>(std::numeric_limits<int>::max() - 120)) { + std::cerr << "ERROR: --time-limit too large (max " + << (std::numeric_limits<int>::max() - 120) << ")\n"; + return 1; + } + config.timeout_seconds = static_cast<int>(time_limit) + 120; @@ } else if (mode == "incumbent") { @@ std::string job_id = argv[arg_idx++]; - int64_t from_index = (arg_idx < argc) ? std::stoll(argv[arg_idx]) : 0; + int64_t from_index = 0; + if (arg_idx < argc) { + const char* value = argv[arg_idx]; + try { + from_index = std::stoll(value); + } catch (const std::exception&) { + std::cerr << "ERROR: Invalid incumbent index: '" << value << "'\n"; + return 1; + } + if (from_index < 0) { + std::cerr << "ERROR: incumbent index must be >= 0 (got '" << value << "')\n"; + return 1; + } + } return mode_incumbent(client, job_id, from_index); }Also applies to: 425-426, 531-532
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 381 - 412, The CLI currently calls std::stod (for time_limit) and std::stoll (for the incumbent index elsewhere) which throw on invalid input and can lead to undefined behavior when casting large values; update the argument parsing in the main option loop (handling --time-limit / variable time_limit) to validate input robustly: parse numeric arguments using non-throwing parsing (e.g., std::from_chars for integers and a safe check for floating parsing or wrap std::stod in try/catch), detect conversion failure or out-of-range values, emit a clear error message via std::cerr and return a non-zero exit code, and before doing static_cast<int>(time_limit) or casting the stoll result ensure the parsed value fits the target range to avoid undefined behavior; apply the same validation/failure handling to the other occurrences parsing integers (the std::stoll sites noted in the review).cpp/src/grpc/client/solve_remote.cu (2)
164-166:⚠️ Potential issue | 🟠 MajorDo not pass incumbent objective as best-bound in callbacks.
Line 164–166 sets
bound_copyequal toobjective, which reports a fake zero-gap state to callback consumers.A safe interim behavior is to pass
NaN(unknown bound) until the incumbent protocol includes a real bound field end-to-end.🔧 Interim mitigation in this file
+#include <limits> ... - double bound_copy = objective; // Use objective as bound for incumbent + double bound_copy = std::numeric_limits<double>::quiet_NaN();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The callback currently passes the incumbent objective as the best-bound (bound_copy = objective) to get_callback->get_solution, which falsely reports a zero-gap; change the interim behavior to pass an unknown bound (use NaN) instead of objective when calling get_callback->get_solution (affecting bound_copy, get_callback->get_solution(..., &bound_copy, callback->get_user_data()) and related solution_copy/obj_copy variables) so callbacks receive NaN until the protocol supplies a real bound end-to-end.
55-57:⚠️ Potential issue | 🟠 MajorValidate env sizing overrides before applying them.
Line 55–57 still accepts zero/negative values for
CUOPT_CHUNK_SIZEandCUOPT_MAX_MESSAGE_BYTES, which can destabilize transfer behavior.🔧 Proposed guardrails
- config.chunk_size_bytes = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); - config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes); + auto max_msg = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes); + if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; } + if (max_msg > 0) { config.max_message_bytes = max_msg; }As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES may be zero/negative; update the code that calls parse_env_int64 to validate the returned value before assigning to config.chunk_size_bytes and config.max_message_bytes (e.g., call parse_env_int64 for each into a temp variable, check temp > 0 and optionally cap to a safe maximum, then assign to config.*; if invalid, retain the existing default and emit a debug/warn via config.enable_debug_log or logger). Ensure you reference parse_env_int64, config.chunk_size_bytes, config.max_message_bytes, and the CUOPT_CHUNK_SIZE / CUOPT_MAX_MESSAGE_BYTES env names when implementing the guard.cpp/CMakeLists.txt (1)
279-294:⚠️ Potential issue | 🟠 MajorResolve
protocvia$<TARGET_FILE:protobuf::protoc>to support in-tree protobuf targets.Line 279–294 still assumes
protobuf::protochasIMPORTED_LOCATION*. That fails for non-imported targets (e.g., FetchContent/add_subdirectory), causing false fatal errors.🔧 Proposed CMake adjustment
if(TARGET protobuf::protoc) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) - if(NOT _PROTOBUF_PROTOC) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) - endif() + set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) else() find_package(protobuf CONFIG REQUIRED) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE) - if(NOT _PROTOBUF_PROTOC) - get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION) - endif() + if(TARGET protobuf::protoc) + set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>) + endif() endif()In CMake, does get_target_property(... IMPORTED_LOCATION) work for non-imported targets (e.g., created via add_subdirectory/FetchContent), and is $<TARGET_FILE:...> the recommended way to reference executables in add_custom_command(COMMAND ...)?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic assumes protobuf::protoc has IMPORTED_LOCATION properties and can fail for in-tree targets; change the resolution to first check if the target exists and set _PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc> when TARGET protobuf::protoc is true, otherwise fall back to get_target_property(... IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION) or the PROTOBUF_PROTOC_EXECUTABLE cache variable; update any add_custom_command(...) references to use ${_PROTOBUF_PROTOC} and keep the existing fatal error check if _PROTOBUF_PROTOC is still empty.cpp/src/grpc/server/grpc_pipe_serialization.hpp (3)
96-100:⚠️ Potential issue | 🔴 CriticalValidate
element_offsetand overflow before pointer arithmetic in result deserialization.Line 96–100 still allows negative offsets and unchecked
offset + chunk_sizearithmetic, which can underflow/overflow bounds checks and lead to unsafememcpy.🔧 Proposed fix
- int64_t offset = ac.element_offset(); - const auto& chunk_data = ac.data(); - if (offset + static_cast<int64_t>(chunk_data.size()) <= static_cast<int64_t>(dest.size())) { - std::memcpy(dest.data() + offset, chunk_data.data(), chunk_data.size()); - } + int64_t offset = ac.element_offset(); + const auto& chunk_data = ac.data(); + if (offset < 0) { return false; } + size_t uoffset = static_cast<size_t>(offset); + if (uoffset > dest.size()) { return false; } + if (chunk_data.size() > dest.size() - uoffset) { return false; } + std::memcpy(dest.data() + uoffset, chunk_data.data(), chunk_data.size());As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 96 - 100, The current deserialization uses offset = ac.element_offset() and then does pointer arithmetic without validating offset or preventing overflow/underflow before memcpy; update the logic in grpc_pipe_serialization.hpp around the block using offset, chunk_data and dest so you first validate element_offset() is non-negative and within dest.size(), compute size_t chunk_size = chunk_data.size(), and perform a safe bounds check using size_t conversions (e.g., offset <= dest.size() && offset + chunk_size <= dest.size()) while guarding against overflow on offset + chunk_size; only call std::memcpy(dest.data() + offset, ...) when those checks pass and otherwise handle the error/return early.
90-90:⚠️ Potential issue | 🔴 CriticalReturn failure on non-EOF parse errors in both deserializers.
Line 90 and Line 148 still
breakon any parse failure and eventually returntrue, which can accept malformed/truncated payloads as valid.🔧 Proposed fix
- if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break; + if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) { + if (clean_eof) { break; } // normal end-of-stream + return false; // malformed/truncated message + }As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
Also applies to: 148-148
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loops calling google::protobuf::util::ParseDelimitedFromCodedStream (using variables ac, coded, clean_eof) currently use "break" on parse failure which lets the function return true for truncated/malformed input; change those breaks to return false on a parse failure unless the failure was a clean EOF. Concretely, at both call sites (around the ParseDelimitedFromCodedStream lines referenced) check the clean_eof flag after a failed parse and return false for non-EOF failures (only allow breaking/ending the loop when clean_eof is true).
151-163:⚠️ Potential issue | 🟠 MajorGuard
total_elements * elem_sizeand byte-offset math before resize/copy.Line 151–163 still performs unchecked multiplication and resize from untrusted chunk metadata. A malformed payload can overflow size math or trigger excessive allocation.
🔧 Proposed fix
+ if (ac.total_elements() < 0) { return false; } if (dest.empty() && ac.total_elements() > 0) { int64_t elem_size = array_field_element_size(ac.field_id()); - dest.resize(static_cast<size_t>(ac.total_elements() * elem_size), 0); + if (elem_size <= 0) { return false; } + uint64_t total_bytes = static_cast<uint64_t>(ac.total_elements()) * + static_cast<uint64_t>(elem_size); + if (total_bytes > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + return false; + } + dest.resize(static_cast<size_t>(total_bytes), 0); }As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 151 - 163, The code in the block that resizes dest and copies chunk bytes must validate sizes to avoid integer overflow and huge allocations: before calling dest.resize(ac.total_elements() * elem_size) and before computing byte_offset, check that ac.total_elements() and computed elem_size are > 0 and that static_cast<size_t>(ac.total_elements()) <= SIZE_MAX / static_cast<size_t>(elem_size); bound the requested allocation against a safe maximum (or return an error) to prevent resource exhaustion; verify chunk_data.size() is <= dest.size() and that byte_offset >= 0 and byte_offset <= dest.size() and that byte_offset + chunk_data.size() does not overflow (use size_t or checked arithmetic) before calling std::memcpy; refer to symbols dest, ac, array_field_element_size(), total_elements(), element_offset(), data(), elem_size, byte_offset, resize, and std::memcpy when making these checks.cpp/src/grpc/server/grpc_server_types.hpp (1)
300-308:⚠️ Potential issue | 🔴 CriticalRemove async-signal-unsafe operations from
signal_handler().
std::coutandcondition_variable::notify_all()are not async-signal-safe and can deadlock/corrupt state when invoked from a POSIX signal handler.Suggested minimal signal-safe handler
inline void signal_handler(int signal) { if (signal == SIGINT || signal == SIGTERM) { - std::cout << "\n[gRPC Server] Received shutdown signal\n"; keep_running = false; if (shm_ctrl) { shm_ctrl->shutdown_requested = true; } - result_cv.notify_all(); } }Then have a normal thread (or the main loop) observe
keep_running == falseand do logging +result_cv.notify_all()outside the signal context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, signal_handler currently performs async-unsafe operations (std::cout and result_cv.notify_all()); modify it to only perform signal-safe actions: set a sig_atomic_t/atomic<bool> flag (keep_running) and, if needed, set shm_ctrl->shutdown_requested (ensure that shutdown_requested is signal-safe or updated via atomic/sig_atomic_t), then return immediately—remove std::cout and result_cv.notify_all() from signal_handler. Update the main loop or a dedicated watcher thread to observe keep_running (or the atomic) and then perform logging and call result_cv.notify_all() outside the signal context (use the symbols signal_handler, keep_running, shm_ctrl, result_cv to locate changes).cpp/src/grpc/client/grpc_client.cu (5)
702-745:⚠️ Potential issue | 🔴 CriticalHarden chunked-result metadata validation before division/alloc/memcpy.
element_size_bytesandtotal_elementsare trusted.elem_size <= 0triggers division-by-zero (Line 708), andtotal_elems * elem_sizecan overflowsize_tbefore allocation, making thememcpy(Line 739) unsafe.Proposed bounds checks
for (const auto& arr_desc : header->arrays()) { auto field_id = arr_desc.field_id(); int64_t total_elems = arr_desc.total_elements(); int64_t elem_size = arr_desc.element_size_bytes(); if (total_elems <= 0) continue; + if (elem_size <= 0) { + last_error_ = "Invalid chunk metadata: non-positive element_size_bytes"; + return false; + } + if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) { + last_error_ = "Invalid chunk metadata: total byte size overflow"; + return false; + } + const int64_t total_bytes = total_elems * elem_size; + if (total_bytes < 0 || + static_cast<uint64_t>(total_bytes) > + static_cast<uint64_t>(std::numeric_limits<size_t>::max())) { + last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory"; + return false; + } int64_t elems_per_chunk = chunk_data_budget / elem_size; if (elems_per_chunk <= 0) elems_per_chunk = 1; - std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size)); + std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 745, The code trusts element_size_bytes() and total_elements() and must validate them before any division, allocation, or memcpy: check that elem_size = arr_desc.element_size_bytes() > 0 and total_elems = arr_desc.total_elements() >= 0, ensure elems_per_chunk = chunk_data_budget / elem_size is computed only after the elem_size > 0 check, and verify that total_elems * elem_size does not overflow size_t before constructing array_bytes; also validate that elems_wanted and elems_received are non-negative and that data.size() equals elems_received * elem_size and fits within array_bytes bounds before calling std::memcpy (references: arr_desc.field_id(), arr_desc.total_elements(), arr_desc.element_size_bytes(), elems_per_chunk, array_bytes, chunk_resp.elements_in_chunk(), chunk_resp.data(), and the std::memcpy call).
258-266:⚠️ Potential issue | 🔴 CriticalGuard all public RPC methods against
impl_->stubbeing null.
check_status(),wait_for_completion(),cancel_job(),delete_job(),get_incumbents(), andstream_logs()all dereferenceimpl_->stubdirectly; calling these beforeconnect()will crash.Proposed guard pattern
job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; + if (!impl_ || !impl_->stub) { + result.error_message = "Not connected to server"; + return result; + } grpc::ClientContext context; auto request = build_status_request(job_id); cuopt::remote::StatusResponse response; auto status = impl_->stub->CheckStatus(&context, request, &response);Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, All public RPC methods must guard against impl_ or impl_->stub being null to avoid crashes when called before connect(); update grpc_client_t::check_status, wait_for_completion, cancel_job, delete_job, get_incumbents, and stream_logs to early-return a sensible error result if impl_ == nullptr or impl_->stub == nullptr (e.g., set job_status_result_t to an error/failed state or populate the method's error/status fields and avoid calling stub->*), and add a clear logged message indicating the client is not connected; ensure you use the same guard pattern in the other listed methods (the ones around the ranges noted: check_status, wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs) so no method dereferences impl_->stub without this null check.
221-252:⚠️ Potential issue | 🔴 Critical
stop_log_streaming()can hang indefinitely (blockedreader->Read()).Setting
stop_logs_doesn’t unblock a synchronous server-streamingRead(), sojoin()can deadlock if the server stops emitting logs (or the stream is otherwise idle). This is still the same failure mode previously called out.Proposed fix (make stop cancel the active ClientContext)
void grpc_client_t::start_log_streaming(const std::string& job_id) { if (!config_.stream_logs || !config_.log_callback) return; stop_logs_.store(false); log_thread_ = std::make_unique<std::thread>([this, job_id]() { - grpc::ClientContext context; + grpc::ClientContext context; + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = &context; + } auto request = build_stream_logs_request(job_id, 0); auto reader = impl_->stub->StreamLogs(&context, request); cuopt::remote::LogMessage log_msg; while (reader->Read(&log_msg)) { if (stop_logs_.load()) { - context.TryCancel(); break; } if (config_.log_callback) { config_.log_callback(log_msg.line()); } if (log_msg.job_complete()) { break; } } reader->Finish(); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + active_log_context_ = nullptr; + } }); } void grpc_client_t::stop_log_streaming() { stop_logs_.store(true); + { + std::lock_guard<std::mutex> lk(log_context_mutex_); + if (active_log_context_) { active_log_context_->TryCancel(); } + } if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); } log_thread_.reset(); }You’ll also need matching members on
grpc_client_t(likely incpp/src/grpc/client/grpc_client.hpp):mutable std::mutex log_context_mutex_; grpc::ClientContext* active_log_context_ = nullptr;As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The stop_log_streaming() can block because reader->Read() is synchronous; modify start_log_streaming() and stop_log_streaming() to track and cancel the active grpc::ClientContext: add a mutable std::mutex log_context_mutex_ and grpc::ClientContext* active_log_context_ member, then in start_log_streaming() create the ClientContext, lock log_context_mutex_, set active_log_context_ = &context before calling impl_->stub->StreamLogs, unlock, and clear active_log_context_ (under the mutex) after reader->Finish(); in stop_log_streaming() lock log_context_mutex_, if active_log_context_ is non-null call active_log_context_->TryCancel(), then set stop_logs_ and join/reset the thread; keep the existing context.TryCancel() check inside the thread but ensure all access to active_log_context_ is synchronized to avoid races.
258-266:⚠️ Potential issue | 🟠 MajorAdd deadlines to operational RPCs (current code can block forever on network stalls).
Only the connection probe sets a deadline; most other RPC calls create a
grpc::ClientContextwithoutset_deadline(). That bypassesconfig_.timeout_secondsand can hang indefinitely (including inside polling loops).Proposed helper + per-RPC use
+static inline void set_context_deadline(grpc::ClientContext& ctx, int timeout_seconds) +{ + if (timeout_seconds <= 0) return; + ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(timeout_seconds)); +} + job_status_result_t grpc_client_t::check_status(const std::string& job_id) { job_status_result_t result; grpc::ClientContext context; + set_context_deadline(context, config_.timeout_seconds); auto request = build_status_request(job_id); cuopt::remote::StatusResponse response; auto status = impl_->stub->CheckStatus(&context, request, &response);Apply similarly to
WaitForCompletion,CancelJob,DeleteResult,GetIncumbents,GetResult, and all chunked upload/download RPCs.Also applies to: 293-303, 325-333, 354-361, 387-395, 419-443, 468-470, 509-510, 540-541, 561-562, 588-589, 625-626, 674-675, 724-725, 1000-1004, 1172-1176
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The RPCs in grpc_client_t (e.g., grpc_client_t::check_status) create grpc::ClientContext without deadlines and can block indefinitely; add a helper (e.g., make_client_context_with_deadline or set_deadline_on_context) that constructs a grpc::ClientContext and calls set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(config_.timeout_seconds)) (or uses config_.timeout_seconds as a std::chrono duration) and use that helper everywhere you create a context for RPCs (check_status, WaitForCompletion, CancelJob, DeleteResult, GetIncumbents, GetResult, all chunked upload/download calls, and the other locations noted in the review) so each RPC honors config_.timeout_seconds and cannot hang forever.
992-995:⚠️ Potential issue | 🔴 CriticalValidate
poll_interval_msbefore dividing (division-by-zero).Both LP and MIP polling paths compute
max_pollsby dividing byconfig_.poll_interval_mswithout guarding<= 0.Proposed validation
} else { CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str()); + if (config_.poll_interval_ms <= 0) { + stop_log_streaming(); + result.error_message = "poll_interval_ms must be > 0"; + return result; + } int poll_count = 0; int max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
Also applies to: 1130-1133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cu` around lines 992 - 995, Before computing max_polls in the LP and MIP polling paths, validate config_.poll_interval_ms > 0 to avoid division-by-zero: if config_.poll_interval_ms <= 0, emit a clear error log (e.g., CUOPT_LOG_ERROR) that includes the invalid value and either return an error status or substitute a safe fallback (e.g., set poll_interval_ms = 1) before computing max_polls = (config_.timeout_seconds * 1000) / config_.poll_interval_ms; apply this check around the existing uses of poll_count and max_polls so both the LP and MIP paths use the validated value.cpp/src/grpc/grpc_problem_mapper.cu (1)
399-427:⚠️ Potential issue | 🔴 CriticalFix potential out-of-bounds writes in
get_doubles/get_ints(misaligned payload sizes).
nis computed with floor division, butmemcpycopiesit->second.size()bytes. If the byte payload isn’t an exact multiple of the element size, this can overflow the destination vector.Proposed fix (validate alignment and copy only validated byte count)
auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(double) != 0) return {}; size_t n = it->second.size() / sizeof(double); if constexpr (std::is_same_v<f_t, double>) { std::vector<double> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(double)); return v; } else { std::vector<double> tmp(n); - std::memcpy(tmp.data(), it->second.data(), it->second.size()); + std::memcpy(tmp.data(), it->second.data(), n * sizeof(double)); return std::vector<f_t>(tmp.begin(), tmp.end()); } }; auto get_ints = [&](int32_t field_id) -> std::vector<i_t> { auto it = arrays.find(field_id); if (it == arrays.end() || it->second.empty()) return {}; + if (it->second.size() % sizeof(int32_t) != 0) return {}; size_t n = it->second.size() / sizeof(int32_t); if constexpr (std::is_same_v<i_t, int32_t>) { std::vector<int32_t> v(n); - std::memcpy(v.data(), it->second.data(), it->second.size()); + std::memcpy(v.data(), it->second.data(), n * sizeof(int32_t)); return v; } else { std::vector<int32_t> tmp(n); - std::memcpy(tmp.data(), it->second.data(), it->second.size()); + std::memcpy(tmp.data(), it->second.data(), n * sizeof(int32_t)); return std::vector<i_t>(tmp.begin(), tmp.end()); } };As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas get_doubles and get_ints compute element count n via floor division but then memcpy the full it->second.size() bytes, risking out-of-bounds when payload size isn't a multiple of element size; fix by validating alignment (check it->second.size() % sizeof(double) / sizeof(int32_t) == 0) and compute a safe byte_count = n * sizeof(element_type), resize the destination vector to n, then memcpy only byte_count (or return an error/empty vector if misaligned); update both get_doubles (use sizeof(double) and f_t conversions) and get_ints (use sizeof(int32_t) and i_t conversions) accordingly so copying never exceeds the destination buffer.cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)
213-218:⚠️ Potential issue | 🟠 MajorSynchronize server-log state reads/writes consistently (data race).
server_log_path_,server_log_start_pos_, andtest_start_marked_are accessed without a common lock inset_server_log_path(),get_server_logs(),dump_logs(), and the accessors. This can race if tests poll logs while another thread resets paths/markers.Proposed approach (copy shared state under lock, then do I/O unlocked)
void set_server_log_path(const std::string& path) { - server_log_path_ = path; - server_log_start_pos_ = 0; - test_start_marked_ = false; + std::lock_guard<std::mutex> lock(mutex_); + server_log_path_ = path; + server_log_start_pos_ = 0; + test_start_marked_ = false; } std::string get_server_logs(bool since_test_start = true) const { - if (server_log_path_.empty()) { return ""; } + std::string path; + std::streampos start_pos = 0; + bool marked = false; + { + std::lock_guard<std::mutex> lock(mutex_); + path = server_log_path_; + start_pos = server_log_start_pos_; + marked = test_start_marked_; + } + + if (path.empty()) { return ""; } - std::ifstream file(server_log_path_); + std::ifstream file(path); if (!file.is_open()) { return ""; } - if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) { - file.seekg(server_log_start_pos_); + if (since_test_start && marked && start_pos > 0) { + file.seekg(start_pos); }Consider also making
server_log_path()return a copy (std::string) instead of a reference if you need it to be thread-safe.As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state".
Also applies to: 229-244, 332-345, 350-356
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213 - 218, There is a data race on server_log_path_, server_log_start_pos_, and test_start_marked_: protect all reads/writes with a single mutex (e.g., add std::mutex server_log_mutex_) and update set_server_log_path(), get_server_logs(), dump_logs(), and the accessors to acquire the lock when accessing those members; for I/O work, copy the guarded state (make server_log_path() return a std::string copy rather than a const ref) while holding the lock then release and perform file reading/writing unlocked to avoid long holds; ensure all places that currently read the fields directly use the mutex-protected copy to eliminate races.cpp/src/grpc/grpc_solution_mapper.cu (1)
512-535:⚠️ Potential issue | 🔴 CriticalHarden
bytes_to_typed()against misaligned byte payloads (potential OOB write).All branches compute
nwith floor division and then copyraw.size()bytes. Ifraw.size()is not an exact multiple of the expected element size, this overflowstmp/v.Proposed fix
const auto& raw = it->second; if constexpr (std::is_same_v<T, float>) { + if (raw.size() % sizeof(double) != 0) return {}; size_t n = raw.size() / sizeof(double); std::vector<double> tmp(n); - std::memcpy(tmp.data(), raw.data(), raw.size()); + std::memcpy(tmp.data(), raw.data(), n * sizeof(double)); return std::vector<T>(tmp.begin(), tmp.end()); } else if constexpr (std::is_same_v<T, double>) { + if (raw.size() % sizeof(double) != 0) return {}; size_t n = raw.size() / sizeof(double); std::vector<double> v(n); - std::memcpy(v.data(), raw.data(), raw.size()); + std::memcpy(v.data(), raw.data(), n * sizeof(double)); return v; } else { + if (raw.size() % sizeof(T) != 0) return {}; size_t n = raw.size() / sizeof(T); std::vector<T> v(n); - std::memcpy(v.data(), raw.data(), raw.size()); + std::memcpy(v.data(), raw.data(), n * sizeof(T)); return v; }As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 512 - 535, The bytes_to_typed() implementation can write out-of-bounds because it computes n = raw.size()/sizeof(...) but then memcpy's raw.size() bytes; fix by always using sizeof(T) (not sizeof(double) in the float branch), compute size_t n = raw.size() / sizeof(T), and memcpy only n * sizeof(T) bytes into the destination buffers (and return the n-length vector); if raw.size() % sizeof(T) != 0, silently ignore trailing bytes or log a warning—ensure this same change is applied in the float branch (tmp) and the double/other branches (v) so all copies use n * sizeof(T) and use sizeof(T) to compute n.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 53823cec-ccad-4fbb-b90f-ec216b29d2e2
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/solve_remote.cu
- cpp/src/pdlp/CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/linear_programming/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- GRPC_ARCHITECTURE.md
- cpp/src/grpc/server/grpc_worker_infra.cpp
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/grpc_settings_mapper.cu
| struct grpc_client_config_t { | ||
| std::string server_address = "localhost:8765"; | ||
| int poll_interval_ms = 1000; // How often to poll for status (when use_wait=false) | ||
| int timeout_seconds = 3600; // Max time to wait for job completion (1 hour) | ||
| bool use_wait = false; // Use WaitForCompletion RPC instead of polling CheckStatus | ||
| bool stream_logs = false; // Whether to stream logs from server | ||
| std::function<void(const std::string&)> log_callback = nullptr; // Called for each log line | ||
|
|
||
| // Incumbent callback for MIP solves — invoked each time the server finds a | ||
| // new best-feasible (incumbent) solution. Parameters: index, objective value, | ||
| // solution vector. Return true to continue solving, or false to request | ||
| // early termination (e.g. the objective is good enough for the caller's | ||
| // purposes). Only supported in polling mode (use_wait=false). | ||
| std::function<bool(int64_t index, double objective, const std::vector<double>& solution)> | ||
| incumbent_callback = nullptr; | ||
| int incumbent_poll_interval_ms = 1000; // How often to poll for new incumbents | ||
|
|
||
| // TLS configuration | ||
| bool enable_tls = false; | ||
| std::string tls_root_certs; // PEM-encoded root CA certificates (for verifying server) | ||
| std::string tls_client_cert; // PEM-encoded client certificate (for mTLS) | ||
| std::string tls_client_key; // PEM-encoded client private key (for mTLS) | ||
|
|
||
| // gRPC max message size (used for unary SubmitJob and result download decisions). | ||
| int64_t max_message_bytes = 256LL * 1024 * 1024; // 256 MiB | ||
|
|
||
| // Chunk size for chunked array upload and chunked result download. | ||
| int64_t chunk_size_bytes = 16LL * 1024 * 1024; // 16 MiB | ||
|
|
||
| // --- Test / debug options (not intended for production use) ----------------- | ||
|
|
||
| // Receives internal client debug messages (for test verification). | ||
| std::function<void(const std::string&)> debug_log_callback = nullptr; | ||
|
|
||
| // Enable debug / throughput logging to stderr. | ||
| // Controlled by CUOPT_GRPC_DEBUG env var (0|1). Default: off. | ||
| bool enable_debug_log = false; | ||
|
|
||
| // Log FNV-1a hashes of uploaded/downloaded data on both client and server. | ||
| // Comparing the two hashes confirms data was not corrupted in transit. | ||
| bool enable_transfer_hash = false; | ||
|
|
||
| // Override for the chunked upload threshold (bytes). Normally computed | ||
| // automatically as 75% of max_message_bytes. Set to 0 to force chunked | ||
| // upload for all problems, or a positive value to override. -1 = auto. | ||
| int64_t chunked_array_threshold_bytes = -1; | ||
| }; |
There was a problem hiding this comment.
Document/validate config invariants that can crash the client (e.g., poll_interval_ms > 0).
The implementation currently divides by poll_interval_ms and uses timeout_seconds for deadlines; it would help to either:
- validate/clamp these values in
grpc_client_tconstructors, or - explicitly document “must be > 0” constraints here in the config struct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.hpp` around lines 53 - 99, The config struct
grpc_client_config_t contains fields that must satisfy invariants (e.g.,
poll_interval_ms, incumbent_poll_interval_ms > 0; timeout_seconds > 0;
chunk_size_bytes > 0; max_message_bytes > 0; chunked_array_threshold_bytes >=
-1) but are not enforced; update grpc_client_t constructors to validate and
either clamp or reject invalid values: check each field (poll_interval_ms,
incumbent_poll_interval_ms, timeout_seconds, chunk_size_bytes,
max_message_bytes, and chunked_array_threshold_bytes) and if out of range either
set a safe default/clamp and emit a message via config.log_callback or
debug_log_callback or throw an std::invalid_argument with a clear message; also
document the enforced invariants in grpc_client_config_t comments so callers
know the constraints.
| // Constraint bounds (prefer lower/upper bounds if available) | ||
| if (pb_problem.constraint_lower_bounds_size() > 0) { | ||
| std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(), | ||
| pb_problem.constraint_lower_bounds().end()); | ||
| std::vector<f_t> con_ub(pb_problem.constraint_upper_bounds().begin(), | ||
| pb_problem.constraint_upper_bounds().end()); | ||
| cpu_problem.set_constraint_lower_bounds(con_lb.data(), static_cast<i_t>(con_lb.size())); | ||
| cpu_problem.set_constraint_upper_bounds(con_ub.data(), static_cast<i_t>(con_ub.size())); | ||
| } else if (pb_problem.b_size() > 0) { | ||
| // Use b (RHS) + row_types format | ||
| std::vector<f_t> b(pb_problem.b().begin(), pb_problem.b().end()); | ||
| cpu_problem.set_constraint_bounds(b.data(), static_cast<i_t>(b.size())); | ||
|
|
||
| if (!pb_problem.row_types().empty()) { | ||
| const std::string& row_types_str = pb_problem.row_types(); | ||
| cpu_problem.set_row_types(row_types_str.data(), static_cast<i_t>(row_types_str.size())); | ||
| } | ||
| } |
There was a problem hiding this comment.
Constraint bounds decoding should validate both LB/UB are present (and sizes match).
constraint_lower_bounds_size() > 0 implicitly assumes constraint_upper_bounds() exists and matches; otherwise you can end up setting mismatched arrays with no explicit error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 174 - 191, The code assumes
constraint_lower_bounds_size() > 0 implies matching upper bounds; update the
condition in the constraint decoding block to require both
pb_problem.constraint_lower_bounds_size() > 0 and
pb_problem.constraint_upper_bounds_size() > 0 and verify their sizes match
before calling cpu_problem.set_constraint_lower_bounds(...) and
cpu_problem.set_constraint_upper_bounds(...); if sizes differ or upper bounds
are missing, either fall back to the pb_problem.b()/row_types() branch or
return/log an explicit error (so you don’t call the setters with mismatched
arrays).
| auto get_string_list = [&](int32_t field_id) -> std::vector<std::string> { | ||
| auto it = arrays.find(field_id); | ||
| if (it == arrays.end() || it->second.empty()) return {}; | ||
| std::vector<std::string> names; | ||
| const char* s = reinterpret_cast<const char*>(it->second.data()); | ||
| const char* s_end = s + it->second.size(); | ||
| while (s < s_end) { | ||
| const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s)); | ||
| if (!nul) nul = s_end; | ||
| names.emplace_back(s, nul); | ||
| s = nul + 1; | ||
| } | ||
| return names; | ||
| }; |
There was a problem hiding this comment.
Avoid UB in get_string_list when the blob is missing a trailing NUL.
If memchr doesn’t find '\0', the code sets nul = s_end and then advances s = nul + 1, which forms a pointer beyond one-past-end. Even if producers always NUL-terminate, this is server-side parsing of untrusted payload bytes.
Proposed fix
while (s < s_end) {
const char* nul = static_cast<const char*>(std::memchr(s, '\0', s_end - s));
if (!nul) nul = s_end;
names.emplace_back(s, nul);
- s = nul + 1;
+ if (nul == s_end) break;
+ s = nul + 1;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 435 - 448, get_string_list
currently advances s = nul + 1 when memchr returns nullptr, producing a pointer
one-past-end; instead, when memchr returns nullptr (nul == s_end) append the
final string using the length (s_end - s) and then break the loop rather than
doing s = nul + 1. Update the lambda (get_string_list) to compute the substring
length from s to nul (or s_end) and stop the loop when nul == s_end to avoid
forming pointers past the buffer while still handling a missing trailing NUL.
cpp/src/grpc/grpc_service_mapper.cpp
Outdated
| // Append chunk requests for a single typed array. | ||
| template <typename T> | ||
| void chunk_typed_array(std::vector<cuopt::remote::SendArrayChunkRequest>& out, | ||
| cuopt::remote::ArrayFieldId field_id, | ||
| const std::vector<T>& data, | ||
| const std::string& upload_id, | ||
| int64_t chunk_data_budget) | ||
| { | ||
| if (data.empty()) return; | ||
|
|
||
| const int64_t elem_size = static_cast<int64_t>(sizeof(T)); | ||
| const int64_t total_elements = static_cast<int64_t>(data.size()); | ||
|
|
||
| int64_t elems_per_chunk = chunk_data_budget / elem_size; | ||
| if (elems_per_chunk <= 0) elems_per_chunk = 1; | ||
|
|
||
| const auto* raw = reinterpret_cast<const uint8_t*>(data.data()); | ||
|
|
||
| for (int64_t offset = 0; offset < total_elements; offset += elems_per_chunk) { | ||
| int64_t count = std::min(elems_per_chunk, total_elements - offset); | ||
| int64_t byte_offset = offset * elem_size; | ||
| int64_t byte_count = count * elem_size; | ||
|
|
||
| cuopt::remote::SendArrayChunkRequest req; | ||
| req.set_upload_id(upload_id); | ||
| auto* ac = req.mutable_chunk(); | ||
| ac->set_field_id(field_id); | ||
| ac->set_element_offset(offset); | ||
| ac->set_total_elements(total_elements); | ||
| ac->set_data(raw + byte_offset, byte_count); | ||
| out.push_back(std::move(req)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential float-path corruption: chunked upload sends f_t bytes, but server reconstruction looks double-based.
chunk_typed_array() writes raw sizeof(T) bytes. In build_array_chunk_requests(), all floating arrays (A_VALUES, C, bounds, etc.) use T = f_t. If f_t == float, this sends float bytes; however, map_chunked_arrays_to_problem() currently decodes floating arrays as double bytes (it divides by sizeof(double) and memcpys into double buffers). That combination would truncate element counts and can overflow copies (and/or silently corrupt values).
I’d strongly suggest verifying the intended chunked wire format:
- If wire format is always double for floats, convert
std::vector<float>→ chunkeddoublebytes before callingchunk_typed_array(ideally in a streaming way to avoid doubling peak memory). - If wire format is native element size, then server-side chunked decode should use element-size metadata (e.g.,
element_size_bytes) instead of hard-codingsizeof(double).
#!/bin/bash
set -euo pipefail
# 1) Inspect the proto schema: does ArrayChunk carry element_size_bytes?
rg -n "message\s+ArrayChunk|element_size_bytes|total_elements|field_id|element_offset" cpp/src/grpc -S
# 2) Find where SendArrayChunk is decoded/validated on server
rg -n "SendArrayChunk|StartChunkedUpload|FinishChunkedUpload|FIELD_A_VALUES|FIELD_C" cpp/src/grpc -S
# 3) Confirm how chunked arrays are decoded into cpu_optimization_problem_t
rg -n "map_chunked_arrays_to_problem|sizeof\(double\)|FIELD_A_VALUES" cpp/src/grpc -SAs per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results".
Also applies to: 121-173
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_service_mapper.cu` around lines 23 - 55, chunk_typed_array
currently writes raw sizeof(T) bytes (used by build_array_chunk_requests with T
= f_t), but map_chunked_arrays_to_problem decodes floats as sizeof(double) which
can corrupt/overflow; fix by making the wire-format consistent: either convert
float arrays to double bytes before calling chunk_typed_array (perform
conversion in build_array_chunk_requests or a streaming converter so you don't
double memory) or change the server-side decoder in
map_chunked_arrays_to_problem to use the element size of the incoming
SendArrayChunkRequest/ArrayChunk (use
element_count/element_offset/total_elements and element_size_bytes or add one)
and memcpy/interpret using that element size; update tests to cover f_t=float
and f_t=double roundtrip and ensure SendArrayChunkRequest uses/accepts the
chosen element-size convention.
| std::vector<uint8_t> buffer(msg.ByteSizeLong()); | ||
| msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())); | ||
| return buffer; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "grpc_incumbent_proto.hpp" -type fRepository: NVIDIA/cuopt
Length of output: 102
🏁 Script executed:
cat -n ./cpp/src/grpc/server/grpc_incumbent_proto.hppRepository: NVIDIA/cuopt
Length of output: 2203
🏁 Script executed:
rg "SerializeToArray" --type hpp --type cpp -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
rg "SerializeToArray" -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 1552
🏁 Script executed:
rg "SerializeToArray" cpp/cuopt_grpc_server.cpp -B 5 -A 5Repository: NVIDIA/cuopt
Length of output: 842
Check SerializeToArray return value and validate size narrowing.
Line 31 silently ignores SerializeToArray failure and narrows ByteSizeLong() (returning size_t) to int without validation. Failed serialization returns an empty/invalid buffer indistinguishable from success, potentially causing data loss. Size narrowing could lose data for large payloads. Other parts of the codebase (e.g., cuopt_grpc_server.cpp) properly check this return value; apply the same error handling pattern here.
Proposed fix
+#include <limits>
@@
- std::vector<uint8_t> buffer(msg.ByteSizeLong());
- msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
+ const auto size = msg.ByteSizeLong();
+ if (size > static_cast<size_t>(std::numeric_limits<int>::max())) { return {}; }
+ std::vector<uint8_t> buffer(size);
+ if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) { return {}; }
return buffer;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The
current serialize path allocates buffer using msg.ByteSizeLong(), then calls
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) without
checking the int narrowing or the return value; update the code around
msg.SerializeToArray to first verify that buffer.size() fits into int (e.g.,
compare against std::numeric_limits<int>::max()) before narrowing, then capture
and check the boolean return of msg.SerializeToArray and handle failures
(log/error/throw) similar to cuopt_grpc_server.cpp; ensure the function that
returns the std::vector<uint8_t> (refer to variables buffer and the call
msg.SerializeToArray) does not return an empty/invalid buffer on serialize
failure.
| class IncumbentPipeCallback : public cuopt::internals::get_solution_callback_t { | ||
| public: | ||
| IncumbentPipeCallback(std::string job_id, int fd, size_t num_vars, bool is_float) | ||
| : job_id_(std::move(job_id)), fd_(fd) | ||
| { | ||
| n_variables = num_vars; | ||
| isFloat = is_float; | ||
| } | ||
|
|
||
| void get_solution(void* data, | ||
| void* objective_value, | ||
| void* /*solution_bound*/, | ||
| void* /*user_data*/) override | ||
| { | ||
| if (fd_ < 0 || n_variables == 0) { return; } | ||
|
|
||
| double objective = 0.0; | ||
| std::vector<double> assignment; | ||
| assignment.resize(n_variables); | ||
|
|
||
| if (isFloat) { | ||
| const float* float_data = static_cast<const float*>(data); | ||
| for (size_t i = 0; i < n_variables; ++i) { | ||
| assignment[i] = static_cast<double>(float_data[i]); | ||
| } | ||
| objective = static_cast<double>(*static_cast<const float*>(objective_value)); | ||
| } else { | ||
| const double* double_data = static_cast<const double*>(data); | ||
| std::copy(double_data, double_data + n_variables, assignment.begin()); | ||
| objective = *static_cast<const double*>(objective_value); | ||
| } | ||
|
|
||
| auto buffer = build_incumbent_proto(job_id_, objective, assignment); | ||
| std::cout << "[Worker] Incumbent callback job_id=" << job_id_ << " obj=" << objective | ||
| << " vars=" << assignment.size() << "\n"; | ||
| std::cout.flush(); | ||
| send_incumbent_pipe(fd_, buffer); | ||
| } | ||
|
|
||
| private: | ||
| std::string job_id_; | ||
| int fd_; | ||
| }; |
There was a problem hiding this comment.
Incumbent callback: validate pipe-send result and gate logging to configured verbosity levels
The send_incumbent_pipe() return value must be checked—a broken pipe (e.g., client disconnection, EPIPE) will silently fail to deliver the incumbent solution, leaving the client without results. Consider: storing the pipe status state in the callback and disabling further sends on failure, or propagating the error to fail the job.
Additionally, the unconditional std::cout logging of every incumbent produces unbounded output when optimizers report many solutions (typical in combinatorial problems). Gate this logging behind a configuration flag (e.g., config.verbose or job.log_level) to prevent production logs from becoming unusable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 12 - 54, The incumbent
callback currently ignores send_incumbent_pipe() errors and always logs; update
IncumbentPipeCallback (constructor and get_solution) to accept a
verbosity/logging flag (e.g., bool verbose or int log_level) and to check the
return value of send_incumbent_pipe(fd_, buffer) on each call; on error, record
the pipe as closed (set fd_ = -1 or a boolean failed_ member) to stop future
sends and optionally propagate/fail the job, and wrap the std::cout logging
behind the new verbosity flag so it only prints when verbose is enabled;
reference get_solution, send_incumbent_pipe, job_id_, fd_, build_incumbent_proto
and add a member like verbose_ or failed_ to gate behavior.
| static void store_simple_result(const std::string& job_id, | ||
| int worker_id, | ||
| int status, | ||
| const char* error_message) | ||
| { | ||
| for (size_t i = 0; i < MAX_RESULTS; ++i) { | ||
| if (!result_queue[i].ready) { | ||
| copy_cstr(result_queue[i].job_id, job_id); | ||
| result_queue[i].status = status; | ||
| result_queue[i].data_size = 0; | ||
| result_queue[i].worker_index = worker_id; | ||
| copy_cstr(result_queue[i].error_message, error_message); | ||
| result_queue[i].error_message[sizeof(result_queue[i].error_message) - 1] = '\0'; | ||
| result_queue[i].retrieved = false; | ||
| result_queue[i].ready = true; | ||
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the grpc_worker.cpp file
find . -name "grpc_worker.cpp" -type fRepository: NVIDIA/cuopt
Length of output: 93
🏁 Script executed:
# Check file size and read the relevant sections
wc -l cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 95
🏁 Script executed:
# Read lines around 56-74 to verify the snippet
sed -n '45,85p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 1743
🏁 Script executed:
# Read lines around 341-363 mentioned in "Also applies to"
sed -n '330,375p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 2506
🏁 Script executed:
# Search for where store_simple_result is called
rg -n "store_simple_result" cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 271
🏁 Script executed:
# Search for result_queue definition and synchronization
rg -n "result_queue|result_queue\[" cpp/src/grpc/server/grpc_worker.cpp -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 1454
🏁 Script executed:
# Look for mutex/lock declarations and usage related to result_queue
rg -n "mutex|lock|atomic|critical|synchronized" cpp/src/grpc/server/grpc_worker.cpp -iRepository: NVIDIA/cuopt
Length of output: 343
🏁 Script executed:
# Look for result_queue declaration and type
rg -n "result_queue" cpp/src/grpc/server/grpc_worker.cpp -B 5 | head -40Repository: NVIDIA/cuopt
Length of output: 1453
🏁 Script executed:
# Search for global mutex, lock, or synchronization declarations
rg -n "std::mutex|pthread_mutex|lock_guard|unique_lock" cpp/src/grpc/server/ -rRepository: NVIDIA/cuopt
Length of output: 469
🏁 Script executed:
# Check the shm_ctrl structure and its synchronization
rg -n "shm_ctrl" cpp/src/grpc/server/grpc_worker.cpp -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 427
🏁 Script executed:
# Look for any `#include` headers that might provide synchronization
head -50 cpp/src/grpc/server/grpc_worker.cpp | grep -E "#include|^#"Repository: NVIDIA/cuopt
Length of output: 188
🏁 Script executed:
# Check if there are any CAS (compare-and-swap) or atomic operations
rg -n "CAS|compare_exchange|cas_flag" cpp/src/grpc/server/grpc_worker.cpp -iRepository: NVIDIA/cuopt
Length of output: 923
🏁 Script executed:
# Look for the result_queue and ResultQueueEntry declaration with more context
rg -n "result_queue|ResultQueueEntry" cpp/src/grpc/server/ -B 5 -A 5 | head -80Repository: NVIDIA/cuopt
Length of output: 6078
🏁 Script executed:
# Check header files for result_queue definition
fd -e hpp -e h cpp/src/grpc/server/ -exec grep -l "ResultQueueEntry\|result_queue" {} \;Repository: NVIDIA/cuopt
Length of output: 321
🏁 Script executed:
# Look at job_queue to see how it uses atomics for comparison
rg -n "job_queue\[.*\].claimed" cpp/src/grpc/server/grpc_worker.cpp -B 3 -A 3Repository: NVIDIA/cuopt
Length of output: 493
🏁 Script executed:
# Check the full context around where result_queue is declared/initialized
rg -n "^(static\s+|extern\s+)?.*result_queue" cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 875
🏁 Script executed:
# Check the ResultQueueEntry definition for field types
rg -n "struct ResultQueueEntry|class ResultQueueEntry" cpp/src/grpc/server/ -A 10Repository: NVIDIA/cuopt
Length of output: 999
🏁 Script executed:
# Check if ResultQueueEntry.ready is atomic
fd -name "*.hpp" -o -name "*.h" cpp/src/grpc/server/ | xargs grep -l "ResultQueueEntry"Repository: NVIDIA/cuopt
Length of output: 288
🏁 Script executed:
# Check how job_queue uses atomics for comparison to understand the pattern difference
cat cpp/src/grpc/server/grpc_server_types.hpp | grep -A 20 "struct JobQueueEntry"Repository: NVIDIA/cuopt
Length of output: 1108
🏁 Script executed:
# Check if result_queue is in shared memory (multi-process)
rg -n "shm_open|mmap.*result_queue" cpp/src/grpc/server/Repository: NVIDIA/cuopt
Length of output: 394
🏁 Script executed:
# Look at how result_queue is initialized
rg -n "result_queue\s*=" cpp/src/grpc/server/grpc_worker_infra.cpp -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look for result_queue initialization and declarations
rg -n "result_queue" cpp/src/grpc/server/grpc_worker_infra.cpp | head -20Repository: NVIDIA/cuopt
Length of output: 565
🏁 Script executed:
# Check the actual initialization in grpc_server_main.cpp
rg -n "result_queue" cpp/src/grpc/server/grpc_server_main.cpp -B 3 -A 3Repository: NVIDIA/cuopt
Length of output: 1024
Multi-worker race condition on result queue slot allocation
Both store_simple_result() and the main enqueue path pick the first !ready slot without atomic claim. With multiple worker processes writing to shared memory (via mmap), two workers can read the same slot as available and clobber each other's data on non-atomic fields (job_id, status, data_size, error_message). The atomic ready flag alone does not protect concurrent writes to these fields. Additionally, when the queue is full, result_slot remains -1 and the job result is silently lost without error.
Use atomic compare-and-swap on a dedicated claim flag (following the JobQueueEntry.claimed pattern) to atomically reserve a slot before writing, or protect the entire slot allocation and write sequence with a mutex. Fail explicitly if no slots are available.
Applies to lines 56-74, 341-363, and similar pattern in grpc_worker_infra.cpp:194-206.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74,
store_simple_result currently scans result_queue for the first entry with
ready==false and writes non-atomic fields, which races with other workers;
replace that pattern with an atomic claim step (mirror JobQueueEntry.claimed
semantics) or protect allocation+write with a mutex: perform an atomic
compare-and-swap on a per-slot claim flag to reserve the slot, only then write
job_id/status/data_size/error_message and finally set ready; ensure the claim
flag is cleared on failure and set/cleared correctly on success; also make the
function return or log an explicit error when no slots can be claimed (do not
silently drop results). Apply the same fix where result_slot is selected in the
main enqueue path and in grpc_worker_infra.cpp where the same pattern appears.
| cudaMemcpy(host_solution.data(), | ||
| device_solution.data(), | ||
| device_solution.size() * sizeof(double), | ||
| cudaMemcpyDeviceToHost); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get basic information
cd /tmp/repo 2>/dev/null || echo "Repo not available"
find . -name "grpc_worker.cpp" -type f 2>/dev/null | head -5Repository: NVIDIA/cuopt
Length of output: 112
🏁 Script executed:
# Search for grpc_worker.cpp in the repository
fd "grpc_worker\.cpp" 2>/dev/null | head -10Repository: NVIDIA/cuopt
Length of output: 91
🏁 Script executed:
# Get the content of the file at the specified lines
fd "grpc_worker\.cpp" -x wc -l {} \; 2>/dev/nullRepository: NVIDIA/cuopt
Length of output: 97
🏁 Script executed:
# Read the relevant sections from grpc_worker.cpp
fd "grpc_worker\.cpp" -x cat -n {} \; 2>/dev/null | sed -n '240,260p; 290,315p'Repository: NVIDIA/cuopt
Length of output: 2534
🏁 Script executed:
if [ -f "cpp/src/grpc/server/grpc_worker.cpp" ]; then
echo "File found at expected path"
wc -l "cpp/src/grpc/server/grpc_worker.cpp"
echo "=== Lines 240-260 ==="
sed -n '240,260p' "cpp/src/grpc/server/grpc_worker.cpp" | cat -n
echo ""
echo "=== Lines 290-315 ==="
sed -n '290,315p' "cpp/src/grpc/server/grpc_worker.cpp" | cat -n
else
echo "File not found at expected path"
find . -name "grpc_worker.cpp" -type f 2>/dev/null
fiRepository: NVIDIA/cuopt
Length of output: 2647
🏁 Script executed:
# Look for all cudaMemcpy calls in the file to understand the pattern
fd "grpc_worker\.cpp" -x grep -n "cudaMemcpy" {} \;Repository: NVIDIA/cuopt
Length of output: 425
🏁 Script executed:
# Check for CUDA error checking patterns in the file
fd "grpc_worker\.cpp" -x grep -n -E "(CUDA_CHECK|cudaGetLastError|cudaDeviceSynchronize)" {} \;Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look at the broader context around error handling in the file
fd "grpc_worker\.cpp" -x grep -n -E "(try|catch|throw|exception)" {} \; | head -20Repository: NVIDIA/cuopt
Length of output: 319
🏁 Script executed:
# Get broader context around the cudaMemcpy calls
sed -n '165,320p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 8704
🏁 Script executed:
# Check for CUDA_CHECK or other error checking macros defined in the file or headers
head -50 cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 2099
🏁 Script executed:
# Search for all cudaMemcpy in the entire file to see how many are unchecked
grep -c "cudaMemcpy" cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 57
🏁 Script executed:
# Look for error checking patterns after cudaMemcpy calls
grep -A 3 "cudaMemcpy" cpp/src/grpc/server/grpc_worker.cpp | head -40Repository: NVIDIA/cuopt
Length of output: 977
🏁 Script executed:
# Check what includes are at the top of the file
head -30 cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 949
Check CUDA return codes and fail the job on copy errors
cudaMemcpy failures are currently ignored, which can yield corrupted results or hard-to-debug downstream failures. Since this worker already wraps work in a try/catch, you can throw on CUDA failures and surface a clean error message to clients.
Example patch (apply to all cudaMemcpy sites)
- cudaMemcpy(host_solution.data(),
- device_solution.data(),
- device_solution.size() * sizeof(double),
- cudaMemcpyDeviceToHost);
+ {
+ auto err = cudaMemcpy(host_solution.data(),
+ device_solution.data(),
+ device_solution.size() * sizeof(double),
+ cudaMemcpyDeviceToHost);
+ if (err != cudaSuccess) {
+ throw std::runtime_error(std::string("cudaMemcpy(device_solution) failed: ") +
+ cudaGetErrorString(err));
+ }
+ }Also applies to: lines 296-307 (host_primal, host_dual, host_reduced_cost)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 252, The cudaMemcpy
call copying device_solution to host_solution ignores return codes and can
silently fail; change the cudaMemcpy(...) call to capture the cudaError_t
result, check if it's not cudaSuccess, and throw a descriptive exception (e.g.,
std::runtime_error) including cudaGetErrorString(err) so the worker's try/catch
surfaces a clean error to clients; apply the same pattern to the other copies
that populate host_primal, host_dual, and host_reduced_cost (i.e., any uses of
cudaMemcpy with device_* -> host_* in grpc_worker.cpp).
| #include <signal.h> | ||
| #include <sys/types.h> | ||
| #include <sys/wait.h> | ||
| #include <unistd.h> | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check required include =="
rg -n '^\s*#include <fcntl\.h>' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp || echo "fcntl.h NOT found"
echo "== Check open/O_* usage =="
rg -n '\bopen\s*\(|\bO_WRONLY\b|\bO_CREAT\b|\bO_TRUNC\b' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp || echo "No usage found"
echo "== Show context around line 53-57 =="
sed -n '50,60p' cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
echo "== Show context around line 120 =="
sed -n '115,125p' cpp/tests/linear_programming/grpc/grpc_integration_test.cppRepository: NVIDIA/cuopt
Length of output: 840
Add the missing POSIX header for open/O_* usage.
The <fcntl.h> header is not included, but open() and flags like O_WRONLY, O_CREAT, and O_TRUNC are used at line 120. This can fail on stricter toolchains that don't include fcntl.h transitively.
🔧 Proposed fix
`#include` <signal.h>
+#include <fcntl.h>
`#include` <sys/types.h>
`#include` <sys/wait.h>
`#include` <unistd.h>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <signal.h> | |
| #include <sys/types.h> | |
| #include <sys/wait.h> | |
| #include <unistd.h> | |
| `#include` <signal.h> | |
| `#include` <fcntl.h> | |
| `#include` <sys/types.h> | |
| `#include` <sys/wait.h> | |
| `#include` <unistd.h> | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53
- 57, The test file is missing the POSIX header required for open() and O_*
flags; add the missing include by inserting `#include` <fcntl.h> near the other
system headers in grpc_integration_test.cpp so symbols like open(), O_WRONLY,
O_CREAT, and O_TRUNC used later (around the code that opens the log file) are
defined on strict toolchains.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
cpp/src/grpc/client/test_grpc_client.cpp (1)
340-340: Print the client error on log-stream failure for parity with other modes.At
Line 340, a failedstream_logsreturns non-zero but drops the error detail. Emitclient.get_last_error()before returning to keep failure diagnosis consistent.Proposed fix
- return success ? 0 : 1; + if (!success) { + std::cerr << "Log stream failed: " << client.get_last_error() << "\n"; + return 1; + } + return 0;As per coding guidelines: "Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/test_grpc_client.cpp` at line 340, The return path after stream_logs currently returns non-zero on failure but drops the error detail; update the failure branch surrounding stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or equivalent accessor) to the log/stream before returning the non-zero exit code so the client error is logged for parity with other modes and aids diagnosis (locate the code around stream_logs and the return of success ? 0 : 1 and insert a process/log call that prints client.get_last_error() when success is false).cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
24-27: Consider makingkPipeChunkBytesconfigurable.The fixed 64 MiB chunk cap is a hard resource limit; exposing it via server config/env (with this as default) would improve deploy-time tuning.
As per coding guidelines: "Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 24 - 27, kPipeChunkBytes is a hard-coded constexpr limit; make it configurable by replacing the fixed constexpr with a runtime-configured value (e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that reads an environment variable or server config and falls back to 64ULL * 1024 * 1024 as the default; update references to kPipeChunkBytes to call the accessor and validate the parsed value (positive, reasonable upper bound) and document the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)
957-982: Assert the log-callback path inSolveMIPWithLogCallback.
received_logsis populated but never asserted, so this test can pass even if callback streaming regresses.💡 Suggested assertion
auto result = client->solve_mip(problem, settings, false); EXPECT_TRUE(result.success) << result.error_message; + EXPECT_FALSE(received_logs.empty()) << "Expected at least one streamed log line";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 957 - 982, The test registers a log callback (received_logs with log_mutex) but never asserts anything, so add assertions after the call to client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty() (and optionally that at least one entry contains an expected substring like "objective" or "MIP" to ensure meaningful log content). If solve_mip is asynchronous/streams logs, ensure you wait briefly or poll until received_logs is non-empty before asserting to avoid flakes; reference received_logs, log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when adding the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/CMakeLists.txt`:
- Around line 297-304: The CMake snippet only checks for the non-namespaced
target grpc_cpp_plugin; update the check to first test for the namespaced target
gRPC::grpc_cpp_plugin (like how gRPC::grpc++ is handled) and use its target file
if present, otherwise fall back to checking TARGET grpc_cpp_plugin and then
find_program for the executable; adjust references to set
_GRPC_CPP_PLUGIN_EXECUTABLE from "$<TARGET_FILE:gRPC::grpc_cpp_plugin>" when
available to ensure installed gRPC packages are recognized.
In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp`:
- Around line 1387-1402: The test currently expects GetResult to be called even
though CheckStatus returns PROCESSING (not complete); update the test so
GetResult is not expected for PROCESSING jobs: remove or replace the
EXPECT_CALL(*mock_stub_, GetResult(...)).WillOnce(...) with an expectation that
it is not called (e.g., EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0))
or simply delete the GetResult expectation, keeping the CheckStatus stub that
sets resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the
intended "do not call GetResult for non-complete status" behavior.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 275-283: The server_log_count function can infinite-loop when
substring is empty; add a guard at the start of server_log_count (which calls
get_server_logs()) to return 0 (or handle as appropriate) if substring.empty()
before entering the while loop so pos will never rely on substring.length() == 0
to advance.
---
Duplicate comments:
In `@cpp/CMakeLists.txt`:
- Around line 279-294: The current CMake logic relies on IMPORTED_LOCATION*
which can be empty for non-imported targets; update the resolution to use
$<TARGET_FILE:protobuf::protoc> when the target exists: in the branch that tests
if(TARGET protobuf::protoc) set _PROTOBUF_PROTOC to the generator expression
$<TARGET_FILE:protobuf::protoc> (instead of querying IMPORTED_LOCATION*),
otherwise fall back to find_package(protobuf CONFIG REQUIRED) and then use
IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION as a fallback; keep the existing
fatal check for empty _PROTOBUF_PROTOC. Ensure you reference the symbols
protobuf::protoc and _PROTOBUF_PROTOC in the change so the target-file generator
expression is used for in-tree/FetchContent builds.
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-265: grpc_client_t::check_status (and similar RPC helpers)
currently dereferences impl_->stub without checking connection state which can
crash if called before connect(); add a null-check for impl_ and impl_->stub at
the start of grpc_client_t::check_status and for the same pattern in the other
RPC helper methods (the ones calling impl_->stub->...). If either is null,
return an appropriate error job_status_result_t (populate error code/message or
a failed state consistent with existing error handling) instead of calling the
stub; otherwise proceed to build the request and call CheckStatus as before.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without checking for <=0; add a guard before any
division (both where max_polls is computed and the other occurrence around the
1132 region) to validate config_.poll_interval_ms > 0, and if not either (a)
return or raise a clear error/invalid-argument from the surrounding function, or
(b) log an error and use a safe default (e.g., 1 ms) before computing max_polls;
ensure you reference config_.poll_interval_ms and update the computations that
set max_polls to avoid divide-by-zero crashes.
- Around line 221-251: The read loop in start_log_streaming blocks on
reader->Read() so stop_log_streaming can hang; to fix, add a mutex-protected
pointer member (log_context_mutex_ and active_log_context_) and set
active_log_context_ = &context (protected by log_context_mutex_) right after
creating grpc::ClientContext context in start_log_streaming, clear it (set
nullptr) after reader->Finish(); then update stop_log_streaming to lock
log_context_mutex_ and if active_log_context_ is non-null call
active_log_context_->TryCancel() before setting stop_logs_ and joining the
thread so the blocking StreamLogs read is cancelled safely.
- Around line 703-712: Validate elem_size and guard against overflow before
dividing/allocating: check elem_size > 0 before computing elems_per_chunk =
chunk_data_budget / elem_size (return/continue or log and skip if zero), ensure
total_elems > 0 (already present) and verify total_elems <= SIZE_MAX / elem_size
(or use safe_mul_check) before computing total_elems * elem_size and allocating
array_bytes, and if the allocation would exceed a configured budget or SIZE_MAX
also error/skip; apply the same checks when computing per-chunk sizes and before
any memcpy/copy operations. Reference symbols: elem_size, total_elems,
chunk_data_budget, elems_per_chunk, array_bytes, arr_desc.total_elements(),
arr_desc.element_size_bytes() (also update the similar block around the other
occurrence).
- Around line 262-265: The CheckStatus RPC (and other RPCs that create
grpc::ClientContext, e.g., the sites calling impl_->stub->CheckStatus and the
other stub->* calls listed) currently create a ClientContext with no deadline;
fix by setting a per-RPC deadline on the context before the call using
context.set_deadline(std::chrono::system_clock::now() + configured_timeout).
Prefer reusing or adding a small helper (e.g., set_deadline(context, timeout_ms)
or a method on impl_ that reads a configured RPC timeout) and apply it to the
Call sites such as the CheckStatus call (impl_->stub->CheckStatus) and the other
grpc::ClientContext creation sites flagged in the review so all RPCs
consistently use the same configured deadline.
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 53-99: The grpc_client_config_t allows invalid values for
intervals/sizes (poll_interval_ms, incumbent_poll_interval_ms, timeout_seconds,
chunk_size_bytes, max_message_bytes, chunked_array_threshold_bytes) which can
break runtime behavior; add a validation method (e.g.,
grpc_client_config_t::Validate() or a constructor/static factory) that enforces
and documents invariants (positive non-zero ints for intervals/timeout, sensible
min/max for chunk_size_bytes and max_message_bytes, and explicit handling for
chunked_array_threshold_bytes semantics (-1 for auto, 0 for force-chunked, >0 as
threshold)), make the client call Validate() early (e.g., before using in
grpc_client constructors or SubmitJob path) and return/throw a clear error on
invalid config or clamp values where safe; update comments to list accepted
ranges for each field.
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env-derived overrides for config.chunk_size_bytes and
config.max_message_bytes are applied without validation (using parse_env_int64),
allowing zero/negative/oversized values; update the logic where
config.chunk_size_bytes, config.max_message_bytes (and optionally
config.enable_debug_log) are set so that after calling parse_env_int64 you
validate the result against sane bounds (e.g., >0 and <= a defined
MAX_MESSAGE_BYTES/CHUNK_SIZE constant), ignore or clamp invalid values, and emit
a debug/warning log via the existing logger; specifically modify the section
using parse_env_int64 to perform the check and fallback to existing config
values when the parsed value is out-of-range.
- Around line 164-166: The callback is being passed the incumbent objective as
the bound (bound_copy = objective) which can show a false zero gap; change the
code so bound_copy is set to the solver's actual best bound variable (the
maintained global/best bound used by the B&B engine) instead of 'objective'
before calling get_callback->get_solution(...); locate where the solver stores
the current best bound (e.g., variables named best_bound, global_bound,
current_bound, or similar in the surrounding code/branch-and-bound context) and
use that identifier (preserving obj_copy, solution_copy, and
callback->get_user_data() usage).
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: The parsing of numeric CLI arguments uses std::stod and
std::stoll without guards and converts a double to int unsafely (variable
time_limit and the static_cast<int>(time_limit)); wrap each std::stod/std::stoll
call in a try/catch (catch std::invalid_argument and std::out_of_range),
validate the parsed value for positivity and reasonable bounds, and return/exit
with a clear error on bad input; before converting time_limit to int (the
static_cast<int>(time_limit) usage) check that time_limit is finite, >= 0 and <=
std::numeric_limits<int>::max() (or clamp/handle overflow) to avoid UB; apply
the same defensive checks to the stoll usages (ensure values fit in target
integer types) and use the same error path if validation fails.
In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto conflicts with its directory path (cpp/src/grpc) and triggers
Buf PACKAGE_DIRECTORY_MATCH; fix by making the package name match the file path
or by moving the file to match the package. Either change the package line in
cpp/src/grpc/cuopt_remote.proto from "package cuopt.remote" to "package
cpp.src.grpc" (matching the directory segments as dot-separated identifiers) or
relocate cuopt_remote.proto into a cuopt/remote directory so the existing
"package cuopt.remote" matches the filesystem; update any import/usage
references accordingly and re-run buf lint to verify.
In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 175-182: The code assumes constraint_upper_bounds exists when
pb_problem.constraint_lower_bounds_size() > 0; instead validate both arrays and
their sizes before calling cpu_problem.set_constraint_lower_bounds and
cpu_problem.set_constraint_upper_bounds: check
pb_problem.constraint_lower_bounds_size() > 0 &&
pb_problem.constraint_upper_bounds_size() > 0 and that their sizes match, and
only then build con_lb/con_ub and call the two set_* methods; if sizes mismatch
or one is missing, handle the case (skip setting bounds or raise/log an error)
to avoid creating an inconsistent problem state.
- Around line 435-446: get_string_list can advance s past s_end when memchr
returns null (no trailing NUL); fix by computing remaining = s_end - s, call
memchr over remaining, and if memchr returns null push the final string using
the remaining length and then break (do not set s = nul + 1). Update the loop in
get_string_list to use it->second.data() and it->second.size() to compute s and
s_end, use the computed remaining length for std::memchr, and on null result
construct names.emplace_back(s, remaining) and break instead of setting s = nul
+ 1.
- Around line 399-411: The lambdas (notably get_doubles) compute element count
as it->second.size() / sizeof(double) then memcpy the full byte payload, which
can overflow if the byte length is not a multiple of sizeof(double); add an
explicit alignment/length check before any typed memcpy: verify
it->second.size() % sizeof(double) == 0 (or the equivalent check for the other
lambda’s element type), and if the check fails either return an empty vector or
log/throw an error; only then allocate the destination vector with n =
it->second.size() / sizeof(double) and perform memcpy; apply the same fix to the
other similar lambda at the noted region (around 414-426) to avoid misaligned
copies.
In `@cpp/src/grpc/grpc_service_mapper.cu`:
- Around line 23-55: Chunked wire format is inconsistent for floats: update
chunk_typed_array (and the client-side build_array_chunk_requests that calls it)
to always emit floating arrays as IEEE double on the wire (so server get_doubles
can continue to interpret chunks as sizeof(double)). Concretely, when T ==
float, convert the input vector to a temporary std::vector<double> and use that
buffer for chunking (set elem_size = sizeof(double) and point raw at the double
buffer) so byte offsets/counts match get_doubles; for non-floating types keep
the existing behavior. Ensure build_array_chunk_requests callers use the
adjusted chunk_typed_array semantics and do not send raw float bytes anymore.
In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 520-534: The bytes_to_typed template is copying raw.size() bytes
into buffers sized for n elements, causing overflow when raw.size() isn't an
exact multiple of the element byte-size; update each branch (the
std::is_same_v<T,float> branch that builds tmp as double, the
std::is_same_v<T,double> branch that builds v as double, and the generic branch)
to: compute n = raw.size() / sizeof(ElemType), check that raw.size() %
sizeof(ElemType) == 0 and handle (throw/log/return error) if not, and use
memcpy(dst.data(), raw.data(), n * sizeof(ElemType)) instead of memcpy(...,
raw.size()) so you only copy the exact bytes that fit the destination (refer to
function bytes_to_typed and the template T branches).
In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 46-49: The assignment vector is not cleared before appending
parsed values, so repeated calls to parse_incumbent_proto will accumulate stale
entries; fix by calling assignment.clear() (or assign empty) before
assignment.reserve(...) in the code that iterates over incumbent_msg
(referencing the assignment variable and the loop that uses
incumbent_msg.assignment_size() and incumbent_msg.assignment(i)) so the vector
starts empty each parse.
- Around line 30-32: The serialization ignores SerializeToArray's boolean return
and narrows ByteSizeLong() (size_t) to int unsafely: update the code that
constructs buffer and calls msg.SerializeToArray to first capture size_t sz =
msg.ByteSizeLong(), validate sz <= static_cast<size_t>(INT_MAX) before casting,
allocate buffer with that size, call bool ok =
msg.SerializeToArray(buffer.data(), static_cast<int>(sz)), and handle a false
result (e.g., throw or return an error) instead of ignoring it; reference the
buffer variable, msg.ByteSizeLong(), and msg.SerializeToArray(...) when making
the change.
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 107-113: The recv_incumbent_pipe function trusts the incoming size
prefix and can allocate arbitrarily large memory; add a sanity check before
resizing by comparing the read size (variable size) against a defined cap (e.g.,
a new constant like kMaxIncumbentSize or MAX_INCUMBENT_SIZE) and reject/return
false if size is 0 or exceeds that cap, then only call data.resize(size) and
read_from_pipe when the size is within limits; update recv_incumbent_pipe to
enforce this limit and document the constant choice.
- Around line 147-158: The loop in submit_job_async is racy because threads
perform a check-then-write on job_queue[i] (checking ready/claimed then
initializing), so change the slot acquisition to an atomic claim: use an atomic
flag (e.g., make job_queue[].claimed an std::atomic_bool or add a per-slot
std::atomic<int> state) and perform a compare_exchange_strong to transition from
false->true to reserve the slot; only the thread that succeeds should populate
the slot fields (job_id, problem_type, data_size, worker_pid, worker_index,
data_sent, cancelled) and set job_queue[i].ready = true as the final publish
step. Ensure submit_job_async uses this atomic CAS on job_queue[i].claimed and
does not set ready until after all other fields are initialized.
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 93-100: Validate and sanitize the metadata before resizing or
memcpy: check ac.total_elements() and ac.element_offset() are non-negative and
within reasonable caps, ensure chunk_data.size() + offset does not overflow and
fits within a bounded/validated size_t, and guard against pathological
total_elements to avoid huge allocations; in the functions using ac (e.g., where
dest.resize(...) and std::memcpy(...) are called), convert to size_t only after
validation, return or throw an error on invalid/too-large values instead of
resizing/ copying, and ensure the memcpy condition uses the validated,
overflow-checked bounds.
- Line 90: The deserializers currently treat ParseDelimitedFromCodedStream
failures by breaking out and eventually returning true, accepting malformed
payloads; change the error handling around
google::protobuf::util::ParseDelimitedFromCodedStream(...) (the calls that pass
ac, coded, clean_eof) so that on a parse failure you immediately fail closed and
return false (or propagate an error) instead of break/continuing; apply the same
change to both deserializer sites that call ParseDelimitedFromCodedStream to
ensure truncated/malformed messages are rejected.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 234-250: Builder::BuildAndStart() can return nullptr so avoid
dereferencing server without checking; after calling builder.BuildAndStart()
(unique symbols: builder.BuildAndStart(), std::unique_ptr<Server> server,
server->Wait(), shutdown_thread, keep_running), check if server is null and if
so log an error/print to stderr, ensure shutdown_thread is signaled/joined or
cleaned up, and exit or return with a non-zero status; if server is valid
continue to start the shutdown_thread and call server->Wait() as currently
implemented.
- Around line 89-90: The signal handler registered via signal(SIGINT,
signal_handler) performs non-async-signal-safe operations (std::cout and
result_cv.notify_all()) — change it to a minimal, async-signal-safe handler:
have signal_handler set a std::sig_atomic_t or std::atomic<bool> flag (e.g.,
shutdown_requested) and (optionally) write a byte to a dedicated pipe to wake
the main loop; then remove any std::cout or condition_variable use from
signal_handler and move the actual logging and result_cv.notify_all() into the
main thread’s event loop where you check shutdown_requested (or the pipe) and
call result_cv.notify_all() and perform printing. Update grpc_server_types.hpp
to expose the atomic flag (or pipe FD) and ensure only async-signal-safe
functions are called from signal_handler.
- Around line 156-168: The loops in grpc_server_main.cpp must not call memset on
JobQueueEntry and ResultQueueEntry because they contain std::atomic members;
remove the memset calls and instead default-construct or placement-new each
element and then explicitly initialize fields. Replace memset(&job_queue[i], 0,
sizeof(JobQueueEntry)) with either new (&job_queue[i]) JobQueueEntry() or ensure
the array is value-initialized, then call job_queue[i].ready.store(false),
.claimed.store(false), .cancelled.store(false) and
job_queue[i].worker_index.store(-1); do the same for result_queue elements
(value-initialize or placement-new, then .ready.store(false) and
.retrieved.store(false)). Ensure JobQueueEntry and ResultQueueEntry constructors
handle non-atomic fields if you prefer constructor-based initialization.
- Around line 186-212: The early-return TLS error paths currently exit
immediately after spawn_workers() and creating threads, leaking workers,
threads, and shared memory; change these branches to perform proper cleanup
before returning: after calling spawn_workers() and after launching the threads
(referencing spawn_workers() and the container holding created std::thread
objects), call the corresponding shutdown/stop routine for workers (e.g.,
stop_workers() or signal shutdown on the worker pool), join all threads in the
thread container, and release any shared memory/IPC resources (the handles
created earlier, and any variables like ssl_opts or pem_root_certs can be left
untouched) before returning an error code. Ensure every early-return path that
occurs after worker spawn or thread creation follows this cleanup sequence so no
threads or shared resources remain leaked.
- Around line 50-80: The CLI parsing loop currently in main (the for-loop that
processes argv and assigns to config) silently ignores unknown flags and missing
values; update it to validate every flag and required value: for each branch
that expects a value (flags that call std::stoi/std::stoll or assign argv[++i]
to config fields like config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) ensure you check i+1<argc and on failure call
print_usage(argv[0]) and return a non-zero exit code; add an else branch for
unrecognized args (arg starts with '-') that also prints usage and exits
non-zero; additionally wrap stoi/stoll conversions in try/catch to detect
invalid numeric inputs and treat them as errors that print usage and exit
non-zero so the server cannot start with unintended defaults.
In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 166-229: When deserialize_result_pipe_blob(...) fails you mark
it->second.status = JobStatus::FAILED but you leave the local success flag true,
causing waiter->success to be set incorrectly; update the logic so that after a
deserialization failure you set success = false (or set waiter->success based on
it->second.status != JobStatus::FAILED) before assigning waiter->success in the
waiting_threads block; modify the code paths around
deserialize_result_pipe_blob, job_tracker, and the waiter assignment
(waiter->success) so waiters observe the actual failed state and error_message
from it->second.error_message.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe APIs
(std::cout and result_cv.notify_all); change it to only perform
async-signal-safe operations: remove the std::cout and result_cv.notify_all
calls and instead set an atomic/sig_atomic_t flag (ensure keep_running is an
std::atomic_bool or set a sig_atomic_t atomic_shutdown) and, if needed, set
shm_ctrl->shutdown_requested using a sig_atomic_t field; add a normal runtime
watcher thread that observes keep_running/shutdown flag and performs logging and
calls result_cv.notify_all when it detects shutdown; refer to signal_handler,
keep_running, shm_ctrl, shutdown_requested and result_cv when making these
changes.
In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code releases job_queue[job_idx].claimed before
publishing job_queue[job_idx].ready, creating a race where another producer can
claim the slot before consumers see ready; invert the operations so you
store(true) to job_queue[job_idx].ready before you store(false) to
job_queue[job_idx].claimed (apply same fix at the other occurrence around lines
with job_queue[...] claimed/ready, e.g., the site referenced as 265-267) to
ensure ready is visible prior to allowing re-claim.
- Around line 633-642: Hold on to waiter->mutex only while inspecting/modifying
waiter state; do not call check_job_status() while holding waiter->mutex to
avoid lock-order inversion with tracker_mutex used in cancel paths (e.g.,
cancel_job). In WaitForCompletion, restructure the loop so you acquire
unique_lock waiter's mutex, check waiter->ready and context->IsCancelled (and
adjust waiter->waiters on cancel) but before invoking check_job_status(job_id,
msg) release the lock (lock.unlock()), call check_job_status(), then re-lock
(lock.lock()) to re-evaluate waiter->ready and continue/wait; alternatively
adopt a consistent global lock order (acquire tracker_mutex before waiter->mutex
everywhere) — prefer the unlock/call/relock pattern around the call site to
quickly eliminate the deadlock risk.
- Around line 189-192: The code currently blindly adds each chunk to
meta.received_bytes, state.total_bytes, state.chunks and increments
state.total_chunks without enforcing a cumulative bound, so add a pre-check that
rejects or truncates a chunk when meta.received_bytes + raw.size() (or
state.total_bytes + raw.size()) would exceed the declared/expected field size
(use the existing expected-size field on meta or state), and only update
meta.received_bytes, state.total_bytes, state.chunks.push_back(ac) and
++state.total_chunks after the check succeeds; on failure return an error/status
indicating oversized upload to prevent unbounded memory growth.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 120-124: The fork failure branch only calls
close_worker_pipes_server(worker_id) leaving child-end FDs open and leaking;
update the pid < 0 handling to also close the child-side pipe FDs (e.g., call
the corresponding cleanup function for child ends such as
close_worker_pipes_child(worker_id) or otherwise close the child file
descriptors associated with worker_id) before returning -1, ensuring both server
and child pipe endpoints are cleaned up when fork() fails; reference symbols:
fork, pid, worker_id, is_replacement, close_worker_pipes_server.
- Around line 165-168: The loop in wait_for_workers is calling waitpid
unconditionally for every entry in worker_pids; skip any non-positive PIDs (pid
<= 0) and call waitpid only for valid PIDs, and handle EINTR by retrying the
wait in a loop until waitpid succeeds or returns a permanent error; also handle
-1 results such that if errno == ECHILD you can break/continue safely and for
other errors log/propagate as appropriate. Locate the loop that iterates over
worker_pids and update it to validate pid > 0, wrap the waitpid call in a retry
loop that checks errno == EINTR, and handle other error cases (e.g., ECHILD)
instead of proceeding blindly.
In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 56-74: store_simple_result currently scans result_queue for an
entry with !ready and writes multiple non-atomic fields, causing a race; change
the slot allocation to atomically claim a slot before writing (e.g., add or
reuse an atomic claim flag on result_queue[i] and use
compare_exchange_strong/compare_exchange_weak to transition from UNCLAIMED to
CLAIMED), only proceed to copy job_id/status/data_size/error_message and set
retrieved/ready after the CAS succeeds, and ensure you set the atomic ready flag
last so other threads see a fully-initialized entry; update store_simple_result
to loop until a CAS claim succeeds (or handle full-queue) and reference
result_queue, MAX_RESULTS, and the ready/claimed flags in your changes.
- Around line 44-48: Check and handle the return value of
send_incumbent_pipe(fd_, buffer): if it indicates failure (e.g., broken pipe or
write error) log an error including job_id_ and the returned errno/details and
avoid further work or retry/cleanup as appropriate; only emit the incumbent
logging (currently using std::cout with job_id_, objective, assignment.size())
when the send succeeded or gate it behind a debug/log-level flag or rate-limit
to prevent unbounded output. Locate this logic around the call sites using
send_incumbent_pipe, build_incumbent_proto, and the variables job_id_,
objective, assignment, and fd_ to add the error check and conditional logging.
- Around line 248-252: The cudaMemcpy call copying device_solution to
host_solution currently ignores CUDA return codes; change it to capture the
cudaMemcpy return value (cudaError_t err = cudaMemcpy(...)) and if err !=
cudaSuccess throw an exception (e.g., std::runtime_error) with a descriptive
message including cudaGetErrorString(err) so the existing try/catch surfaces a
clear failure to the client; apply the same pattern to all other cudaMemcpy
sites in this file (including the similar copies near lines 296-307) and
reference the same symbols (device_solution, host_solution, cudaMemcpy) when
making the fixes.
- Around line 341-363: The loop that finds an empty slot in result_queue and
then fills it (using result_slot, ResultQueueEntry, copy_cstr, MAX_RESULTS,
worker_id) has a race: another thread may grab the same slot after the ready
check but before setting ready, and if no slot is found the result is silently
dropped; fix by protecting the selection-and-write with the same synchronization
used in store_simple_result (e.g., a mutex or atomic compare-and-set on
result_queue[i].ready) so the check-and-set is atomic, and add explicit handling
for the case result_slot remains -1 (log an error via the worker logger /
std::cerr and return or propagate a failure code) so lost results are reported.
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 53-57: The test file uses open() and O_WRONLY/O_CREAT/O_TRUNC but
does not include <fcntl.h>, which is required on some platforms; add a `#include`
<fcntl.h> to the top of grpc_integration_test.cpp alongside the existing system
headers (near the includes for <signal.h>, <sys/types.h>, <sys/wait.h>,
<unistd.h>) so that the open() call and O_* constants used by the test compile
portably.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setter set_server_log_path writes server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization; protect
these shared members by acquiring mutex_ (e.g., std::lock_guard or
std::unique_lock on mutex_) at the start of set_server_log_path and similarly in
every reader/callback that touches server_log_path_, server_log_start_pos_, or
test_start_marked_ (the read blocks around lines ~229-244 and ~332-355). Under
the lock, perform updates or read-and-copy the values, and for readers prefer
copying the protected values into local variables while holding mutex_ then
release the lock before doing any IO/long operations to avoid holding mutex_
during blocking work.
In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The phrase claiming "no custom serialization logic is
implemented" is too absolute; update the line to clarify the scope by stating
that protocol buffers generated by protoc and grpc_cpp_plugin are used for gRPC
wire payloads, and explicitly note the exception that
grpc_pipe_serialization.hpp provides custom pipe blob framing helpers (i.e., the
custom framing is outside gRPC wire payload serialization). Locate the sentence
in GRPC_ARCHITECTURE.md and reword it to mention "for gRPC wire payloads" and
add a parenthetical or following sentence referencing
grpc_pipe_serialization.hpp as the known custom framing helper.
---
Nitpick comments:
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Line 340: The return path after stream_logs currently returns non-zero on
failure but drops the error detail; update the failure branch surrounding
stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or
equivalent accessor) to the log/stream before returning the non-zero exit code
so the client error is logged for parity with other modes and aids diagnosis
(locate the code around stream_logs and the return of success ? 0 : 1 and insert
a process/log call that prints client.get_last_error() when success is false).
In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 24-27: kPipeChunkBytes is a hard-coded constexpr limit; make it
configurable by replacing the fixed constexpr with a runtime-configured value
(e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that
reads an environment variable or server config and falls back to 64ULL * 1024 *
1024 as the default; update references to kPipeChunkBytes to call the accessor
and validate the parsed value (positive, reasonable upper bound) and document
the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.
In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 957-982: The test registers a log callback (received_logs with
log_mutex) but never asserts anything, so add assertions after the call to
client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked
and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty()
(and optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bfed585-0c3e-4c04-9fc0-af2be140a6f2
📒 Files selected for processing (39)
GRPC_ARCHITECTURE.mdSERVER_ARCHITECTURE.mdbuild.shconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_grpc_server.cppcpp/src/grpc/client/grpc_client.cucpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cucpp/src/grpc/client/test_grpc_client.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cucpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cucpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cucpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cucpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hpppython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
💤 Files with no reviewable changes (2)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/solve_remote.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- build.sh
- cpp/tests/linear_programming/CMakeLists.txt
- cpp/tests/linear_programming/grpc/CMakeLists.txt
- python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
- cpp/src/grpc/grpc_settings_mapper.cu
| EXPECT_CALL(*mock_stub_, CheckStatus(_, _, _)) | ||
| .WillOnce([](grpc::ClientContext*, | ||
| const cuopt::remote::StatusRequest&, | ||
| cuopt::remote::StatusResponse* resp) { | ||
| resp->set_job_status(cuopt::remote::PROCESSING); | ||
| resp->set_result_size_bytes(0); | ||
| return grpc::Status::OK; | ||
| }); | ||
|
|
||
| // GetResult should be called because CheckStatus doesn't show large result | ||
| EXPECT_CALL(*mock_stub_, GetResult(_, _, _)) | ||
| .WillOnce([](grpc::ClientContext*, | ||
| const cuopt::remote::GetResultRequest&, | ||
| cuopt::remote::ResultResponse*) { | ||
| return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready"); | ||
| }); |
There was a problem hiding this comment.
This test asserts the opposite of its stated behavior for PROCESSING jobs.
Line 1383-1385 says GetResult should not be attempted when status is not complete, but Line 1397 expects a GetResult RPC. That codifies an unnecessary call path.
🔧 Proposed fix
- // GetResult should be called because CheckStatus doesn't show large result
- EXPECT_CALL(*mock_stub_, GetResult(_, _, _))
- .WillOnce([](grpc::ClientContext*,
- const cuopt::remote::GetResultRequest&,
- cuopt::remote::ResultResponse*) {
- return grpc::Status(grpc::StatusCode::UNAVAILABLE, "Result not ready");
- });
+ // For PROCESSING status, client should fail fast without issuing GetResult.
+ EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1387 -
1402, The test currently expects GetResult to be called even though CheckStatus
returns PROCESSING (not complete); update the test so GetResult is not expected
for PROCESSING jobs: remove or replace the EXPECT_CALL(*mock_stub_,
GetResult(...)).WillOnce(...) with an expectation that it is not called (e.g.,
EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0)) or simply delete the
GetResult expectation, keeping the CheckStatus stub that sets
resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the intended
"do not call GetResult for non-complete status" behavior.
| int server_log_count(const std::string& substring) const | ||
| { | ||
| std::string logs = get_server_logs(); | ||
| int count = 0; | ||
| size_t pos = 0; | ||
| while ((pos = logs.find(substring, pos)) != std::string::npos) { | ||
| ++count; | ||
| pos += substring.length(); | ||
| } |
There was a problem hiding this comment.
Handle empty substring in server_log_count to prevent infinite loop.
At Line 282, pos advances by substring.length(). If substring is empty, the loop never progresses.
🛠️ Suggested fix
int server_log_count(const std::string& substring) const
{
+ if (substring.empty()) { return 0; }
std::string logs = get_server_logs();
int count = 0;
size_t pos = 0;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 275
- 283, The server_log_count function can infinite-loop when substring is empty;
add a guard at the start of server_log_count (which calls get_server_logs()) to
return 0 (or handle as appropriate) if substring.empty() before entering the
while loop so pos will never rely on substring.length() == 0 to advance.
a94be38 to
67e1796
Compare
cuopt uses embedded grpc client to solve problems on a remote server
46bcc8c to
3c1c06e
Compare
3c1c06e to
c809dd4
Compare
This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.
There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.
The server is built by build.sh and is called cuopt_grpc_server.
Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT
To try this feature, checkout the branch/PR and do the following:
$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps
All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.