First Langfuse evaluations by lotif · Pull Request #19 · VectorInstitute/eval-agents

lotif · 2026-01-28T21:23:56Z

Summary

Adding Langfuse integration code and adding an evaluation script to the report generation agent.

Clickup Ticket(s): NA

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Small refactorings and fixes
Adding common Langfuse code in the langfuse.py file
Adding a ground truth dataset
Adding a script to upload the dataset to Langfuse
Adding an evaluation script to run an LLM-as-a-judge against the Report Generation Agent and the ground truth dataset
Updating the instructions in the README.md file

For an example of how the evaluation results are looking like:
https://us.cloud.langfuse.com/project/cmkwsswke005dad07gxujnipq/datasets/cmkyev4nd000nad084ds2xm30/runs/27328bba-9843-4ccb-940f-6fe1b9e3b0ea

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

Performed manual testing by following the instructions in the README.md file.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

…ctions

… tenacity for retrying mechanism

fcogidi

A couple of comments:

Will you switch to google-adk later since Amrit and I are using that? Asking 'cause the langfuse integration may be different for that.
The langfuse_upload script may be general enough to be in the aieng-eval-agents package

lotif · 2026-01-30T22:04:30Z

@fcogidi

Will you switch to google-adk later since Amrit and I are using that? Asking 'cause the langfuse integration may be different for that.

Yes. I will have one more PR that I will put out on monday with the trajectory evals and next in line is the move to google-adk.

The langfuse_upload script may be general enough to be in the aieng-eval-agents package

Good point. I'm planning to move things around in follow up PRs as well and will keep that in mind.

fcogidi

LGTM

lotif and others added 30 commits January 16, 2026 17:56

WIp trying to make it work

a77a60f

Adding data import for the online retail dataset and some more instru…

9e6ce2e

…ctions

Weaviate local and remote scripts

0098f7d

Deleting weaviate stuff, using Online Retail dataset instead

6592a1c

Adding more report examples

22fc569

Generating xlsx reports

3458565

Movign files around, adding the ddl file and the import script

37b4000

One more readme paragraph

6e3c4c2

Merge branch 'main' into marcelo/report-agent

7bb081f

Adding a couple more vulnerabilities to the skip list

530360e

Grammar fixes

dc02ff2

Merge branch 'main' into marcelo/report-agent

f9d7862

CR by Amrit

66a4494

Moving env and logging config to the top of the file

ee8b854

Small refactor

20e4ec5

Parsing client responses into langfuse traces

40dfc6f

Some more langfuse things

53d0589

CR by Franklin

534f8e5

Merge branch 'marcelo/report-agent' into marcelo/langfuse-integration

cdf0647

CR by Franklin

7a2a57f

CR by Franklin

efd80cb

Merge branch 'marcelo/report-agent' into marcelo/langfuse-integration

a39ac1d

Merge branch 'main' into marcelo/langfuse-integration

d029285

Reporting to langfuse and removed clutter

93ee157

Moving forward with the evaluation script + some more refactorings

f0af403

Finished using LLMs to evaluate result

da9b0c9

Added code comments

02c3ac5

Adding the eval dataset and making changes to the eval script. Adding…

c1980fe

… tenacity for retrying mechanism

Using langfuse to upload a dataset and run the evaluation

5af7152

Addingh evaluator and retry mechanism

9fdc71d

lotif added 3 commits January 28, 2026 15:53

Minor improvements

37348c0

Adding readme instructions

285591b

Merge branch 'main' into marcelo/langfuse-integration

2906b36

lotif requested review from amasin2111, amrit110 and fcogidi January 28, 2026 21:23

Upgrading python-multipart + small improvements

7d59004

fcogidi reviewed Jan 29, 2026

View reviewed changes

lotif added 2 commits January 29, 2026 17:00

Small fixes, additional logging and updated groud truth

b4e124d

Merge branch 'main' into marcelo/langfuse-integration

4507d52

lotif requested a review from fcogidi January 30, 2026 22:05

fcogidi approved these changes Feb 2, 2026

View reviewed changes

lotif merged commit a2db835 into main Feb 2, 2026
3 checks passed

lotif deleted the marcelo/langfuse-integration branch February 2, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First Langfuse evaluations#19

First Langfuse evaluations#19
lotif merged 36 commits intomainfrom
marcelo/langfuse-integration

lotif commented Jan 28, 2026 •

edited

Loading

Uh oh!

fcogidi left a comment

Uh oh!

lotif commented Jan 30, 2026 •

edited

Loading

Uh oh!

fcogidi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lotif commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Changes Made

Testing

Checklist

Uh oh!

fcogidi left a comment

Choose a reason for hiding this comment

Uh oh!

lotif commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fcogidi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lotif commented Jan 28, 2026 •

edited

Loading

lotif commented Jan 30, 2026 •

edited

Loading