Populate sender email and recipients in threads output#78
Populate sender email and recipients in threads output#78cpinto wants to merge 1 commit intobasecamp:mainfrom
Conversation
`hey threads <id> --json` was returning empty `creator.email_address` and `recipients` for every entry because the HTML parser only captured the sender's display name and ignored the rest of the sender link. Scrape the sender email from the `<span class="entry__sender-email">` inside each sender anchor, and extract per-entry recipients by slicing the HTML between entry anchors and reusing the existing `fullRecipientsRe` + `extractEmails` helpers. Dedupe recipients by address so a repeat in the HTML doesn't produce duplicate contacts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves the HTML-based thread entry parsing so hey threads <id> --json includes each entry’s sender email and recipients, enabling downstream flows (like “reply all”) without re-fetching the topic page.
Changes:
- Scrape
creator.email_addressfor each entry from the sender markup (entry__sender-email). - Populate per-entry
recipients[]by extracting emails from the entry-scopedentry__full-recipientssection and deduping by email.
Tip
If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var ( | ||
| entryBlockRe = regexp.MustCompile(`(?s)data-entry-id="(\d+)"`) | ||
| senderRe = regexp.MustCompile(`id="sender_entry_(\d+)"[^>]*>\s*([^<]+?)\s*<`) | ||
| senderEmailRe = regexp.MustCompile(`(?s)sender_entry_(\d+).*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>([^<]+)<`) |
There was a problem hiding this comment.
senderEmailRe is very loosely scoped: it matches sender_entry_(\d+) and then uses .*? with DOTALL to find the next entry__sender-email anywhere later in the document. If any sender block is missing the expected entry__sender-email markup (or if sender_entry_### appears outside the sender element), this can mis-associate an email with the wrong entry ID. Consider tightening the regex to anchor on id="sender_entry_(\d+)" and constrain the match to within the sender element (e.g., stop at </a>), or extract the sender block first and then parse the email within that substring.
| senderEmailRe = regexp.MustCompile(`(?s)sender_entry_(\d+).*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>([^<]+)<`) | |
| senderEmailRe = regexp.MustCompile(`(?s)id="sender_entry_(\d+)"[^>]*>.*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>\s*([^<]+)\s*</a>`) |
| for i, eid := range entryIDs { | ||
| anchor := fmt.Sprintf(`id="entry_%s"`, eid) | ||
| start := strings.Index(html, anchor) | ||
| if start < 0 { | ||
| continue | ||
| } | ||
| end := len(html) | ||
| if i+1 < len(entryIDs) { | ||
| nextAnchor := fmt.Sprintf(`id="entry_%s"`, entryIDs[i+1]) | ||
| if n := strings.Index(html[start:], nextAnchor); n > 0 { | ||
| end = start + n | ||
| } | ||
| } |
There was a problem hiding this comment.
The recipients extraction loop does a full strings.Index(html, anchor) scan for every entry ID, making parsing O(n*m) over the HTML size. Since entryIDs are already in document order, consider tracking the current offset (search from the previous anchor forward) or precomputing anchor indices once, then slice based on those positions. This keeps performance predictable for large threads and also avoids accidentally matching an earlier occurrence of the same anchor substring.
| // Associate recipients with entries by slicing between entry anchors. | ||
| entryRecipients := map[string][]models.Contact{} | ||
| for i, eid := range entryIDs { | ||
| anchor := fmt.Sprintf(`id="entry_%s"`, eid) | ||
| start := strings.Index(html, anchor) | ||
| if start < 0 { | ||
| continue | ||
| } | ||
| end := len(html) | ||
| if i+1 < len(entryIDs) { | ||
| nextAnchor := fmt.Sprintf(`id="entry_%s"`, entryIDs[i+1]) | ||
| if n := strings.Index(html[start:], nextAnchor); n > 0 { | ||
| end = start + n | ||
| } | ||
| } | ||
| m := fullRecipientsRe.FindStringSubmatch(html[start:end]) | ||
| if m == nil { | ||
| continue | ||
| } | ||
| seen := map[string]bool{} | ||
| for _, addr := range extractEmails(m[1]) { | ||
| if seen[addr] { | ||
| continue | ||
| } | ||
| seen[addr] = true | ||
| entryRecipients[eid] = append(entryRecipients[eid], models.Contact{EmailAddress: addr}) | ||
| } | ||
| } |
There was a problem hiding this comment.
New behavior is being added to ParseTopicEntriesHTML (sender email scraping, per-entry recipients parsing + deduping), but there are currently no unit tests covering this HTML parsing. Adding a focused test with a minimal HTML fixture would help prevent silent regressions when HEY’s markup changes again (e.g., ensure creator.email_address and recipients[].email_address populate as expected per entry).
Summary
hey threads <id> --jsonwas returning emptycreator.email_addressand an emptyrecipientsarray for every entry. The HTML parser only captured the sender's display name and discarded the rest of the sender link.<span class="entry__sender-email">element inside the sender anchor.fullRecipientsRe+extractEmailshelpers. Dedupe recipients by email so a repeat in the HTML does not produce duplicate contacts.ParseTopicEntriesHTMLkeep working; empty fields become populated.Why it matters: the empty recipients list makes any "reply all" flow downstream impossible without re-fetching and re-parsing the topic page, because there is no way to know who else was addressed on the entry.
Test plan
go build ./...go test ./internal/htmlutil/... ./internal/cmd/...hey threads <real-thread-id> --json— verifiedcreator.email_addressandrecipients[].email_addressare populated on all entries (previously empty)hey threads <real-thread-id>(styled) — output unchanged🤖 Generated with Claude Code
Summary by cubic
Populates
creator.email_addressand per-entryrecipientsinhey threads <id> --jsonso reply-all flows have the data they need. No changes to styled output; existing callers keep working.<span class="entry__sender-email">inside the sender anchor and maps it by entry id.fullRecipientsRe+extractEmails; dedupes by email.ParseTopicEntriesHTMLsignature and behavior additive; previously empty fields are now populated.Written for commit eeb083d. Summary will update on new commits.