Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion internal/htmlutil/entries.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (
var (
entryBlockRe = regexp.MustCompile(`(?s)data-entry-id="(\d+)"`)
senderRe = regexp.MustCompile(`id="sender_entry_(\d+)"[^>]*>\s*([^<]+?)\s*<`)
senderEmailRe = regexp.MustCompile(`(?s)sender_entry_(\d+).*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>([^<]+)<`)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

senderEmailRe is very loosely scoped: it matches sender_entry_(\d+) and then uses .*? with DOTALL to find the next entry__sender-email anywhere later in the document. If any sender block is missing the expected entry__sender-email markup (or if sender_entry_### appears outside the sender element), this can mis-associate an email with the wrong entry ID. Consider tightening the regex to anchor on id="sender_entry_(\d+)" and constrain the match to within the sender element (e.g., stop at </a>), or extract the sender block first and then parse the email within that substring.

Suggested change
senderEmailRe = regexp.MustCompile(`(?s)sender_entry_(\d+).*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>([^<]+)<`)
senderEmailRe = regexp.MustCompile(`(?s)id="sender_entry_(\d+)"[^>]*>.*?entry__sender-email[^>]*><span[^>]*>[^<]*</span>\s*([^<]+)\s*</a>`)

Copilot uses AI. Check for mistakes.
timeRe = regexp.MustCompile(`<time[^>]*datetime="([^"]+)"`)
srcdocRe = regexp.MustCompile(`(?s)srcdoc="([^"]*trix-content[^"]*)"`)
fullRecipientsRe = regexp.MustCompile(`(?s)entry__full-recipients[^>]*>(.*?)</span>`)
Expand Down Expand Up @@ -86,6 +87,12 @@ func ParseTopicEntriesHTML(html string) []models.Entry {
senders[m[1]] = m[2]
}
}
senderEmails := map[string]string{}
for _, m := range senderEmailRe.FindAllStringSubmatch(html, -1) {
if _, exists := senderEmails[m[1]]; !exists {
senderEmails[m[1]] = strings.TrimSpace(m[2])
}
}

// Associate times with entries by finding the first <time> after each entry anchor
entryTimes := map[string]string{}
Expand All @@ -100,6 +107,35 @@ func ParseTopicEntriesHTML(html string) []models.Entry {
}
}

// Associate recipients with entries by slicing between entry anchors.
entryRecipients := map[string][]models.Contact{}
for i, eid := range entryIDs {
anchor := fmt.Sprintf(`id="entry_%s"`, eid)
start := strings.Index(html, anchor)
if start < 0 {
continue
}
end := len(html)
if i+1 < len(entryIDs) {
nextAnchor := fmt.Sprintf(`id="entry_%s"`, entryIDs[i+1])
if n := strings.Index(html[start:], nextAnchor); n > 0 {
end = start + n
}
}
Comment on lines +112 to +124
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recipients extraction loop does a full strings.Index(html, anchor) scan for every entry ID, making parsing O(n*m) over the HTML size. Since entryIDs are already in document order, consider tracking the current offset (search from the previous anchor forward) or precomputing anchor indices once, then slice based on those positions. This keeps performance predictable for large threads and also avoids accidentally matching an earlier occurrence of the same anchor substring.

Copilot uses AI. Check for mistakes.
m := fullRecipientsRe.FindStringSubmatch(html[start:end])
if m == nil {
continue
}
seen := map[string]bool{}
for _, addr := range extractEmails(m[1]) {
if seen[addr] {
continue
}
seen[addr] = true
entryRecipients[eid] = append(entryRecipients[eid], models.Contact{EmailAddress: addr})
}
}
Comment on lines +110 to +137
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior is being added to ParseTopicEntriesHTML (sender email scraping, per-entry recipients parsing + deduping), but there are currently no unit tests covering this HTML parsing. Adding a focused test with a minimal HTML fixture would help prevent silent regressions when HEY’s markup changes again (e.g., ensure creator.email_address and recipients[].email_address populate as expected per entry).

Copilot uses AI. Check for mistakes.

// Extract bodies from srcdoc iframes - they appear in entry order
type body struct{ html, text string }
bodyMatches := srcdocRe.FindAllStringSubmatch(html, -1)
Expand All @@ -122,7 +158,10 @@ func ParseTopicEntriesHTML(html string) []models.Entry {
CreatedAt: entryTimes[eid],
}
if name, ok := senders[eid]; ok {
e.Creator = models.Contact{Name: name}
e.Creator = models.Contact{Name: name, EmailAddress: senderEmails[eid]}
}
if recips, ok := entryRecipients[eid]; ok {
e.Recipients = recips
}
if i < len(bodies) {
e.Body = bodies[i].text
Expand Down
Loading