Part of the MediaBridge series.
The Problem
S3 Glacier and Glacier Deep Archive are cheap cold storage tiers. Files in Glacier take 3-5 hours to restore. Files in Deep Archive take up to 12 hours. When a user tries to access an archived file, S3 returns a 403 (or the object is simply missing from the listing depending on how the bucket is configured). The user gets no feedback about what happened or when the file will be available.
The goal was to make this invisible to the user: detect the access attempt, trigger the restore automatically, notify the user when it starts, and notify them again when the file is ready. No manual steps, no support tickets.
The Pipeline
CloudTrail -> EventBridge -> Lambda A -> POST /restore/request
|
backend crons (5 min)
|
AWS RestoreObject API
|
S3 ObjectRestore:Completed
|
Lambda B -> POST /restore/completed
|
backend crons -> email
CloudTrail captures every S3 API call including GetObject on archived objects. EventBridge filters for the specific error code that S3 returns when access to an archived object is attempted, and routes those events to Lambda A. Lambda A calls POST /restore/request on the backend. The backend stores the request, resolves the user, and five background crons take it from there.
Lambda B is triggered by the S3 ObjectRestore:Completed event, which S3 fires automatically when a Glacier restore finishes. It calls POST /restore/completed on the backend to mark the object as available.
Both Lambda endpoints authenticate with a shared secret in the X-Restore-Secret header, compared with timingSafeEqual.
The restore_requests Table
Each detected access attempt creates a row:
s3_bucket | s3_key | access_key_id | email | user_resolved | restore_status | pre_notified | post_notified
access_key_id comes from CloudTrail - it is the IAM key that made the access attempt. email is resolved by looking up the key in an aws_key_user_map table that maps IAM keys to user emails. If the key is not in the map, user_resolved is still set to true (the row is not blocked), but email remains null - no email will be sent for this request.
Five Crons
All five crons start immediately on backend startup and run on intervals.
resolveUsers (every 5 minutes): For any row where user_resolved = false, looks up the access_key_id in aws_key_user_map and fills in the email. This handles the case where the Lambda fires before the backend has processed the user mapping.
triggerRestores (every 5 minutes): Finds all rows with restore_status = 'not_started' and calls RestoreObjectCommand for each:
await s3.send(new RestoreObjectCommand({
Bucket: row.s3_bucket,
Key: row.s3_key,
RestoreRequest: {},
}));
await sql`UPDATE restore_requests SET restore_status = 'pending' WHERE id = ${row.id}`;
S3 returns 409 RestoreAlreadyInProgress if a restore for that object is already underway. This is not an error - it means someone else already triggered it. The status is still set to pending:
if (err?.Code === 'RestoreAlreadyInProgress') {
await sql`UPDATE restore_requests SET restore_status = 'pending' WHERE id = ${row.id}`;
}
preNotify (every 5 minutes): Sends the “restore started” email for rows that have been user-resolved, have an email, have not yet been pre-notified, and have been sitting for at least 120 seconds:
const groups = await sql`
SELECT email, array_agg(s3_bucket) as buckets, array_agg(s3_key) as keys
FROM restore_requests
WHERE user_resolved = true
AND pre_notified = false
AND email IS NOT NULL
GROUP BY email
HAVING MAX(requested_at) < NOW() - INTERVAL '120 seconds'
`;
The 120-second wait is a debounce. If a user accesses 10 archived files in quick succession, each access attempt creates a row. Without the wait, they would get 10 separate “restore started” emails. With the wait, all 10 requests accumulate before preNotify fires, and they receive a single email listing all 10 files.
The GROUP BY email groups all pending files for the same user into one batch. sendRestoreInitiatedBatch formats them into one email.
postNotify (every 5 minutes): Same debounce pattern, same batching logic, but for the “files are ready” email. Runs after restore_status = 'restored'.
pollPendingRestores (every 23 hours): A fallback poll. ObjectRestore:Completed is the primary signal that a restore finished, but S3 events can occasionally miss or be delayed. The poll calls HeadObject on every pending restore and checks the x-amz-restore header:
const restoreHeader = head.Restore ?? '';
if (!restoreHeader.includes('ongoing-request="true"')) {
await sql`UPDATE restore_requests SET restore_status = 'restored', restored_at = NOW() WHERE id = ${row.id}`;
}
When ongoing-request is not "true", the restore is complete. The 23-hour interval is intentional - it runs roughly once per day as a catch-up, not as the primary detection mechanism.
Lambda B as the Primary Completion Signal
The poll is a fallback. The primary completion path is Lambda B, triggered by ObjectRestore:Completed:
app.post('/completed', async (req, reply) => {
if (!verifyWebhookSecret(req, reply)) return;
const { s3_bucket, s3_key } = req.body;
await sql`
UPDATE restore_requests
SET restore_status = 'restored', restored_at = NOW()
WHERE s3_bucket = ${s3_bucket} AND s3_key = ${s3_key}
`;
return reply.status(200).send({ ok: true });
});
S3 fires this event as soon as the restore completes. The backend updates the status, and postNotify picks it up on its next 5-minute run.
What the User Experiences
A user accesses a file in Glacier. The access fails silently at the S3 level. Within a few minutes, they receive an email: “Your files are being restored from archive. You will receive another email when they are ready.” A few hours later (or up to 12 for Deep Archive), they receive a second email: “Your files are now available.”
The user did not file a ticket. They did not wait on hold. They got an email and then they got their files.