Methodology

AgentProbe runs 12 checks across 4 categories, each targeting a specific aspect of how AI agents interact with your site. Here's exactly what we test, how we score it, and why each check matters.

Scoring

Each check returns one of three results: pass (100), warn (50), or fail (0). Scores are averaged within each category, then combined using category weights into an overall 0–100 score.

Discovery30%

Agent-Native25%

Access30%

Metadata15%

We weight Discovery and Access higher because reachability is the foundation — if agents can't find or access your content, nothing else matters.

Scan pipeline

Prefetch

7 resources fetched in parallel — homepage, robots.txt, sitemap.xml, llms.txt, llms-full.txt, ai-plugin.json, and mcp.json. Results are cached so checks read from memory.

Analyze

All 12 checks run concurrently against cached responses. Some checks make additional targeted fetches (e.g., testing bot User-Agents or probing API endpoints).

Score & report

Category averages are weighted into the final score. Failed and warned checks generate actionable recommendations explaining what to fix and how.

Discovery30%

How easily AI agents can find and understand your site

llms.txt

/llms.txt

Checks for a /llms.txt file — a plain-text guide that tells AI models what your site is about, its structure, and where to find key resources. Also checks for /llms-full.txt as a bonus extended version.

How we check

Fetches /llms.txt and validates it has at least 10 characters, a Markdown title (line starting with #), and at least one URL. If /llms-full.txt exists with 100+ characters, it's noted as a bonus in the report.

Why it matters

LLMs have limited context windows and can't crawl your entire site. A curated /llms.txt gives them a concise summary — what you do, what pages matter, and where to find structured content. Without it, agents guess from fragments of HTML.

Scoring

PassFile exists with a title heading and resource links

WarnFile exists but is too short, missing a title, or has no links

FailNo /llms.txt found

robots.txt AI Bot Rules

/robots.txt

Parses your robots.txt and checks whether 8 known AI crawlers are allowed or blocked: GPTBot, ClaudeBot, GoogleBot, Bingbot, PerplexityBot, ChatGPT-User, Amazonbot, and cohere-ai.

How we check

Parses robots.txt into User-agent groups with their Disallow directives. For each AI bot, checks if there's a matching group with Disallow: / (blocks all paths). Falls back to the wildcard (*) group if no bot-specific rules exist.

Why it matters

Many sites unintentionally block AI crawlers with broad Disallow: / rules added years ago for different bots. If agents can't crawl your content, they can't index, cite, or recommend it — you become invisible to AI-powered search and assistants.

Scoring

PassAll 8 AI bots are allowed

WarnSome bots are blocked — the report lists which ones

FailAll AI bots are blocked

XML Sitemap

/sitemap.xml

Checks for a valid XML sitemap at /sitemap.xml. Validates the XML structure and counts the number of URLs listed.

How we check

Fetches /sitemap.xml and looks for <urlset> (standard sitemap) or <sitemapindex> (sitemap of sitemaps) tags. Counts all <loc> entries to report the number of indexed URLs.

Why it matters

A sitemap gives crawlers a complete directory of your pages with priority and update frequency. Without one, bots have to discover pages by following links — slow, incomplete, and they'll miss orphaned or deep-nested pages.

Scoring

PassValid XML sitemap with proper structure found

WarnFile exists but isn't valid XML (wrong format or missing tags)

FailNo /sitemap.xml found

AI Manifest

/.well-known/ai-plugin.json & /.well-known/mcp.json

Looks for two machine-readable manifests: ai-plugin.json (OpenAI's plugin specification with name_for_model and schema_version) and mcp.json (Model Context Protocol configuration).

How we check

Fetches both files from /.well-known/ and validates them as parseable JSON. For ai-plugin.json, checks for name_for_model or schema_version fields. For mcp.json, any valid JSON structure counts.

Why it matters

These manifests let AI platforms auto-discover your site's capabilities. An ai-plugin.json tells ChatGPT-style agents what your service does. An mcp.json advertises tools and data sources that MCP-compatible agents can use. Without them, AI platforms can't programmatically integrate with your site.

Scoring

PassBoth ai-plugin.json and mcp.json found and valid

WarnOnly one of the two manifests found

FailNeither manifest found

Agent-Native25%

Support for agent-specific protocols and formats

Markdown Support

Homepage with Accept: text/markdown

Tests whether your server can serve content as Markdown via HTTP content negotiation (Accept header) or a format query parameter.

How we check

First requests your homepage with the Accept: text/markdown header and checks if the response Content-Type is text/markdown or text/x-markdown. If that fails, tries appending ?format=markdown to the URL and checks if the response either has a Markdown content type or starts with # and is longer than 50 characters.

Why it matters

LLMs process Markdown natively — it's their preferred input format. When a server can return Markdown directly, agents get clean, structured content instead of parsing through HTML tags, navigation menus, cookie banners, and footer links. This dramatically improves content quality for AI consumption.

Scoring

PassServer responds to Accept: text/markdown with proper Markdown content

WarnMarkdown available via ?format=markdown query parameter

FailNo Markdown content negotiation detected

MCP Protocol

/.well-known/mcp.json

Checks for a Model Context Protocol configuration file that defines servers, tools, or resources AI agents can interact with.

How we check

Fetches /.well-known/mcp.json and validates it as JSON. Then checks for the presence of servers, tools, or resources keys — the three core MCP primitives that define what an agent can do with your site.

Why it matters

MCP is an open protocol that lets AI agents go beyond reading pages. They can call APIs, query databases, trigger actions, and access structured data through your exposed tools and resources. An MCP config transforms your site from a passive document into an active service agents can use.

Scoring

PassValid JSON with at least one of: servers, tools, or resources defined

WarnFile exists with valid JSON but no MCP primitives (servers/tools/resources)

FailNo /.well-known/mcp.json found

Structured API

/openapi.json, /swagger.json, /api-docs, and others

Searches for an OpenAPI or Swagger specification that describes your API endpoints in a machine-readable format.

How we check

Probes four common paths in parallel: /openapi.json, /openapi.yaml, /swagger.json, and /api-docs (each with a 5-second timeout). Looks for "openapi", "swagger", or paths keywords in the response. If nothing is found at standard paths, checks the homepage's Link headers for rel="api" or rel="service-desc" references, and finally tries requesting the homepage with Accept: application/json.

Why it matters

A structured API spec tells agents exactly what endpoints exist, what parameters they accept, and what responses look like. Without one, an agent has to reverse-engineer your API from documentation pages or trial and error — unreliable and error-prone.

Scoring

PassOpenAPI/Swagger spec found at a standard endpoint

WarnAPI reference found via Link header or JSON content negotiation

FailNo structured API specification detected

Access30%

Whether AI bots can actually reach your content

AI Bot Access

Homepage with bot User-Agent strings

Tests whether AI bots receive the same content as regular browsers by comparing responses across different User-Agent strings.

How we check

Fetches your homepage with a standard Chrome browser User-Agent to establish a baseline. Then fetches the same page as GPTBot, ClaudeBot, and GoogleBot (8-second timeout each). A bot is considered blocked if it gets a 401, 403, or 429 status, or if the response body is less than 30% of the browser version's size and under 500 characters.

Why it matters

WAFs, CDNs, and bot protection services often serve different content to known bot User-Agents — 403 errors, CAPTCHAs, or stripped-down HTML. If AI crawlers see a blank page or an access denied error while browsers see your full site, you're invisible to AI-powered search, citation, and recommendations.

Scoring

PassAll 3 bots (GPTBot, ClaudeBot, GoogleBot) receive full content

Warn1–2 bots are blocked or receive significantly reduced content

FailAll 3 bots are blocked or receive reduced content

JS Independence

Homepage HTML (no JavaScript execution)

Evaluates whether your homepage delivers meaningful content in the initial HTML response, without requiring JavaScript execution.

How we check

Analyzes the raw HTML: extracts body text length (excluding script and style tags), counts headings (h1–h6), and detects SPA patterns (elements with id="root", id="app", or id="__next" combined with under 200 characters of text content). Also checks for <noscript> fallback content.

Why it matters

AI crawlers, LLM fetchers, and most automation tools don't run JavaScript. They see only the initial HTML your server sends. If your site is a single-page app that renders everything client-side, agents see an empty div and a bundle URL — no content, no headings, no useful information.

Scoring

Pass500+ characters of text content and at least one heading in server-rendered HTML

WarnSPA with a noscript fallback, or limited server-rendered content (under 500 chars or few headings)

FailSPA detected with no meaningful server-rendered content and no noscript fallback

Performance

Homepage response timing

Measures how fast your server responds to the initial page request and counts any redirect hops along the way.

How we check

Records the total time from request to full response for your homepage. Tracks the HTTP status code and counts redirects (e.g., HTTP → HTTPS, www → non-www, or vanity URL redirects).

Why it matters

AI agents process pages at scale — they're fetching thousands of URLs with strict per-request timeouts. A 3-second response time means your pages will be deprioritized or skipped entirely. Redirect chains add latency and sometimes break bot crawlers that don't follow all hops.

Scoring

PassUnder 1,000ms response time with zero redirects

Warn1,000–3,000ms response time, or has redirect hops

FailOver 3,000ms response time or non-200 final status

Metadata15%

Structured data and metadata quality

Schema.org JSON-LD

Homepage <script type="application/ld+json"> blocks

Extracts and validates JSON-LD structured data blocks embedded in your page, looking for recognized Schema.org types.

How we check

Parses all <script type="application/ld+json"> blocks from the HTML. Extracts @type values from each block, including nested items within @graph arrays. Checks for "important" types: Organization, WebSite, WebPage, Article, and Product.

Why it matters

JSON-LD gives agents machine-readable facts about your page — who published it, what type of content it is, when it was updated, what it's about. Without structured data, agents have to infer this from HTML patterns, which is unreliable and lossy. Well-structured JSON-LD means better citations, richer context, and more accurate AI-generated summaries.

Scoring

PassJSON-LD blocks found containing recognized Schema.org types (Organization, WebSite, WebPage, Article, or Product)

WarnJSON-LD blocks found but no standard Schema.org types detected

FailNo JSON-LD structured data found on the page

HTML Metadata

Homepage <head> elements

Checks for 6 essential HTML metadata elements that help agents quickly understand and categorize your page.

How we check

Scans the HTML for: (1) lang attribute on the <html> tag, (2) <title> tag with more than 5 characters, (3) <meta name="description"> with more than 20 characters, (4) <link rel="canonical"> URL, (5) at least one <h1> heading, and (6) Open Graph og:title meta tag.

Why it matters

These metadata basics are the first thing any crawler reads. The lang attribute tells agents what language to expect. The title and description give a summary without parsing content. The canonical URL prevents duplicate indexing. The h1 signals the primary topic. OG tags provide the social/sharing context. Missing any of these forces agents to guess — and guesses are often wrong.

Scoring

PassAll 6 metadata elements present and valid

Warn4–5 elements present, 1–2 missing

FailFewer than 4 elements present (3+ missing)

Ready to test your site?

Run a scan and get your score in seconds.

Scan a URL