ScalePostBot — How ScalePost crawls the web

What is ScalePostBot?

ScalePostBot is an automated agent operated by ScalePost Corporation (scalepost.ai). It fetches publicly available web pages so that our customers can analyze how their own content is being crawled, cited, and referenced by AI systems.

How to identify ScalePostBot

Every request we make includes the following headers:

Request headers

User-Agent: Mozilla/5.0 (compatible; ScalePostBot/1.0; +https://scalepost.ai/bot)
From: abuse@scalepost.ai
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9

The +https://scalepost.ai/bot token in the User-Agent is the canonical link back to this page. Anyone showing the User-Agent above without that link, or claiming to be us from a different domain, is not us.

What ScalePostBot does

Fetches the URL with a single GET request.
Reads the rendered HTML to extract metadata.
Does not execute JavaScript, submit forms, follow login flows, scrape contact details, or scan for vulnerabilities.
Does not crawl recursively from the page it fetched. We do not follow links from the body of the document.
Does not retain page bodies long-term — only the extracted metadata is stored against the customer that requested the URL.

Volume is light and bursty: we fetch on demand when a customer requests an analysis, not on a continuous schedule. Most domains will see a small number of requests at most, paced according to the rules below.

How ScalePostBot behaves

ScalePostBot is built to be a polite citizen of the web.

robots.txt: Before fetching any URL on a host, we retrieve robots.txt and obey it. We honor both User-agent: ScalePostBot directives and User-agent: * fallbacks, including Disallow. If robots.txt is unreachable (network error, 5xx), we follow RFC 9309 §2.3.1.3 and treat the host as allow-all.
Crawl-Delay: If your robots.txt declares a Crawl-Delay, we enforce it across all of our parallel workers for that host.
Retry-After: On HTTP 429 or HTTP 503, we read the Retry-After header (numeric seconds or HTTP-date) and back off for at least that long before retrying.
Permanent failures: On HTTP 403, 404, 410, and 451, we mark the URL as permanently unavailable and stop trying.
Caching: We cache robots.txt for one hour per origin so we don't re-fetch it on every request.

How to block or limit ScalePostBot

Add a directive to your robots.txt. For example, to block ScalePostBot entirely:

robots.txt — block entirely

User-agent: ScalePostBot
Disallow: /

To slow it down:

robots.txt — rate limit

User-agent: ScalePostBot
Crawl-Delay: 5

To allow most of your site but exclude a section:

robots.txt — partial block

User-agent: ScalePostBot
Disallow: /private/
Disallow: /admin/

Changes to robots.txt are picked up within an hour.

Reporting a problem

If ScalePostBot is misbehaving on your site — or if you have any questions — email abuse@scalepost.ai with:

the affected hostname(s),
a sample User-Agent and a few timestamps from your access logs,
and a brief description of the problem.

We respond within one business day and will pause crawling of any host while we investigate.