If we had a suitable OpenAI/Anthropic token to use in CI, we could set up an E2E run that goes through BYOK.
I think this would work by setting a flag/param (e.g., an env var) that means:
- In E2E tests, all sessions get a
provider injected that configures them to use BYOK with the token
- That provider sets the endpoint to be our record/replay proxy
- Our proxy has a separate snapshot store for BYOK+OpenAI/Anthropic, but otherwise uses the same logic as the CAPI proxy to resolve calls from the snapshot or to pass them onto the underlying OpenAI/Anthropic enpoint with the token and capture the result
- Note this means the proxy would have to be expanded to work in terms of Anthropic-formatted data as well as OpenAI. Likely we need some base implementation and then per-provider specializations.
Why not share snapshots with the CAPI variant?
We could but then that doesn't prove we've ever successfully completed any of these requests against a real Anthropic/OpenAI endpoint. For example maybe Anthropic won't accept certain tool calls, but OpenAI would - in that case sharing a snapshot would mean we don't detect this.
Obviously the fact that we're replaying from snapshots means we only observe the underlying provider's response whenever we're first generating the snapshots or refreshing them. But that's enough to prove the provider did accept our requests at least once, and that's a lot more than never having seen them accept the request.
If we had a suitable OpenAI/Anthropic token to use in CI, we could set up an E2E run that goes through BYOK.
I think this would work by setting a flag/param (e.g., an env var) that means:
providerinjected that configures them to use BYOK with the tokenWhy not share snapshots with the CAPI variant?
We could but then that doesn't prove we've ever successfully completed any of these requests against a real Anthropic/OpenAI endpoint. For example maybe Anthropic won't accept certain tool calls, but OpenAI would - in that case sharing a snapshot would mean we don't detect this.
Obviously the fact that we're replaying from snapshots means we only observe the underlying provider's response whenever we're first generating the snapshots or refreshing them. But that's enough to prove the provider did accept our requests at least once, and that's a lot more than never having seen them accept the request.