Part of the PurelyManage series.
The Naive Approach Breaks at Scale
The first version of the Domains page rechecked every domain on every page load. The logic was simple: call updateDomainSettings with recheckDns: true for each domain, then call listDomains to get the fresh DNS status.
That is N+2 API calls per request where N is the number of domains. With the dashboard set to auto-refresh every 60 seconds:
50 domains × 52 API calls × 60 requests/hour = ~3,000 API calls/hour
This is both wasteful and fragile. Most domains are healthy most of the time. Rechecking 50 domains every minute to confirm they are still green is unnecessary. And if PurelyMail ever introduces rate limits, this approach is the first to break.
The fix is a priority-based scheduler that rechecks domains based on how urgently they need attention.
The Cache Table
A local domain_dns_cache table tracks the state of each domain between rechecks:
CREATE TABLE IF NOT EXISTS domain_dns_cache (
name TEXT PRIMARY KEY,
last_checked_at TIMESTAMPTZ,
failing_count INTEGER NOT NULL DEFAULT 4,
dns_summary JSONB
)
last_checked_at: when this domain was last rechecked. NULL means it has never been checked.failing_count: how many of the four records (MX, SPF, DKIM, DMARC) are currently failing. Ranges from 0 (fully healthy) to 4 (all failing). Defaults to 4 on insert so new domains are treated as red until proven otherwise.dns_summary: the raw DNS check result from PurelyMail, stored as JSON.
New domains are seeded into the cache with last_checked_at = NULL whenever listDomains returns a domain not yet in the table. Deleted domains are removed from the cache.
The Priority Algorithm
The scheduler picks domains eligible for a recheck using three tiers, evaluated in order:
Tier 1: Never checked (last_checked_at IS NULL). Always eligible. These are new domains that have never had a DNS check run against them. They get absolute priority.
Tier 2: Red domains (failing_count > 0). Eligible after 30 minutes. A domain with at least one failing record needs attention but does not need to be checked every second. 30 minutes is frequent enough to detect a fix shortly after DNS propagates.
Tier 3: Green domains (failing_count = 0). Eligible after 24 hours. A fully healthy domain changes rarely. Daily verification is enough.
Within each tier, domains are ordered by most failing first, then oldest last_checked_at first. This means a domain that is completely broken (4 failing) gets picked before one that is mostly healthy (1 failing), and within the same tier the longest-unverified domain goes first.
const rows = await sql`
SELECT name FROM domain_dns_cache
WHERE
last_checked_at IS NULL
OR (failing_count > 0 AND last_checked_at < NOW() - INTERVAL '30 minutes')
OR (failing_count = 0 AND last_checked_at < NOW() - INTERVAL '24 hours')
ORDER BY
(last_checked_at IS NULL) DESC,
failing_count DESC,
last_checked_at ASC NULLS FIRST
LIMIT ${count}
`
When Rechecks Happen
Rechecks are not only triggered by the cron job. Several user-facing actions also trigger them, using the same pick-and-recheck logic:
Domains page load (GET /pm/domains): picks 1 domain to recheck on every request. API cost is at most 3 calls (listDomains + updateDomainSettings + listDomains). If nothing is eligible, 1 call only. This means every time an admin visits the Domains page, one domain gets checked for free without any dedicated scheduling overhead.
Dashboard load (GET /pm/account): picks 1 domain and triggers a recheck in the background. The dashboard response is not delayed by the recheck. It fires and resolves independently.
Manual recheck button (POST /pm/domains/:name/recheck): force-rechecks a specific domain immediately, bypassing the cooldown thresholds. Useful when you have just updated a DNS record and want to confirm it took effect without waiting 30 minutes.
Add domain (POST /pm/domains): seeds the new domain into the cache as unchecked, making it immediately eligible for the next recheck cycle.
Delete domain: removes the entry from the cache.
Background Cron
A cron job runs every 15 minutes and triggers rechecks independently of user activity. It has two modes:
First-run mode: if any domain has last_checked_at IS NULL, the cron processes 3 per cycle. With 50 domains starting unchecked, this hydrates the cache in roughly 4-6 hours, faster with help from the page-load triggers.
Steady state: once all domains have been checked at least once, the cron processes 1 per cycle. At 15-minute intervals this is 4 checks per hour, enough to catch new failures within 30 minutes for any red domain.
export async function cronBatchSize(): Promise<number> {
const [row] = await sql`
SELECT COUNT(*)::int AS count FROM domain_dns_cache WHERE last_checked_at IS NULL
`
return (row?.count ?? 0) > 0 ? 3 : 1
}
Green to Red Alerts
When a domain transitions from fully healthy to having at least one failing record, every sysadmin gets an email alert. The alert fires only on the green-to-red transition, not on subsequent rechecks while the domain remains red. This avoids alert fatigue.
// Only fires when: old count was 0 AND new count is > 0
if (oldCount === 0 && count > 0) {
notifyDnsRed(name, count, dnsSummary)
}
The email includes the domain name, the failing count, and a per-record pass/fail breakdown so the admin knows exactly which record to look at without logging into the panel first.
API Call Budget
Comparing the two approaches at 50 domains with a 60-second dashboard auto-refresh:
| Approach | API calls/hour |
|---|---|
| Naive (recheck all on every load) | ~3,000 |
| Priority scheduler (page-load triggers + 15min cron) | ~20-30 |
The scheduler adds a bit of complexity but keeps the API call count proportional to the number of domains that actually need attention rather than the total domain count.
The next post covers the IMAP migration credential gathering system: how users submit their own credentials securely through a public form.