AgentProbe runs 12 checks across 4 categories, each targeting a specific aspect of how AI agents interact with your site. Here's exactly what we test, how we score it, and why each check matters.
Each check returns one of three results: pass (100), warn (50), or fail (0). Scores are averaged within each category, then combined using category weights into an overall 0–100 score.
We weight Discovery and Access higher because reachability is the foundation — if agents can't find or access your content, nothing else matters.
7 resources fetched in parallel — homepage, robots.txt, sitemap.xml, llms.txt, llms-full.txt, ai-plugin.json, and mcp.json. Results are cached so checks read from memory.
All 12 checks run concurrently against cached responses. Some checks make additional targeted fetches (e.g., testing bot User-Agents or probing API endpoints).
Category averages are weighted into the final score. Failed and warned checks generate actionable recommendations explaining what to fix and how.
How easily AI agents can find and understand your site
Checks for a /llms.txt file — a plain-text guide that tells AI models what your site is about, its structure, and where to find key resources. Also checks for /llms-full.txt as a bonus extended version.
Fetches /llms.txt and validates it has at least 10 characters, a Markdown title (line starting with #), and at least one URL. If /llms-full.txt exists with 100+ characters, it's noted as a bonus in the report.
LLMs have limited context windows and can't crawl your entire site. A curated /llms.txt gives them a concise summary — what you do, what pages matter, and where to find structured content. Without it, agents guess from fragments of HTML.
Parses your robots.txt and checks whether 8 known AI crawlers are allowed or blocked: GPTBot, ClaudeBot, GoogleBot, Bingbot, PerplexityBot, ChatGPT-User, Amazonbot, and cohere-ai.
Parses robots.txt into User-agent groups with their Disallow directives. For each AI bot, checks if there's a matching group with Disallow: / (blocks all paths). Falls back to the wildcard (*) group if no bot-specific rules exist.
Many sites unintentionally block AI crawlers with broad Disallow: / rules added years ago for different bots. If agents can't crawl your content, they can't index, cite, or recommend it — you become invisible to AI-powered search and assistants.
Checks for a valid XML sitemap at /sitemap.xml. Validates the XML structure and counts the number of URLs listed.
Fetches /sitemap.xml and looks for <urlset> (standard sitemap) or <sitemapindex> (sitemap of sitemaps) tags. Counts all <loc> entries to report the number of indexed URLs.
A sitemap gives crawlers a complete directory of your pages with priority and update frequency. Without one, bots have to discover pages by following links — slow, incomplete, and they'll miss orphaned or deep-nested pages.
Looks for two machine-readable manifests: ai-plugin.json (OpenAI's plugin specification with name_for_model and schema_version) and mcp.json (Model Context Protocol configuration).
Fetches both files from /.well-known/ and validates them as parseable JSON. For ai-plugin.json, checks for name_for_model or schema_version fields. For mcp.json, any valid JSON structure counts.
These manifests let AI platforms auto-discover your site's capabilities. An ai-plugin.json tells ChatGPT-style agents what your service does. An mcp.json advertises tools and data sources that MCP-compatible agents can use. Without them, AI platforms can't programmatically integrate with your site.
Support for agent-specific protocols and formats
Tests whether your server can serve content as Markdown via HTTP content negotiation (Accept header) or a format query parameter.
First requests your homepage with the Accept: text/markdown header and checks if the response Content-Type is text/markdown or text/x-markdown. If that fails, tries appending ?format=markdown to the URL and checks if the response either has a Markdown content type or starts with # and is longer than 50 characters.
LLMs process Markdown natively — it's their preferred input format. When a server can return Markdown directly, agents get clean, structured content instead of parsing through HTML tags, navigation menus, cookie banners, and footer links. This dramatically improves content quality for AI consumption.
Checks for a Model Context Protocol configuration file that defines servers, tools, or resources AI agents can interact with.
Fetches /.well-known/mcp.json and validates it as JSON. Then checks for the presence of servers, tools, or resources keys — the three core MCP primitives that define what an agent can do with your site.
MCP is an open protocol that lets AI agents go beyond reading pages. They can call APIs, query databases, trigger actions, and access structured data through your exposed tools and resources. An MCP config transforms your site from a passive document into an active service agents can use.
Searches for an OpenAPI or Swagger specification that describes your API endpoints in a machine-readable format.
Probes four common paths in parallel: /openapi.json, /openapi.yaml, /swagger.json, and /api-docs (each with a 5-second timeout). Looks for "openapi", "swagger", or paths keywords in the response. If nothing is found at standard paths, checks the homepage's Link headers for rel="api" or rel="service-desc" references, and finally tries requesting the homepage with Accept: application/json.
A structured API spec tells agents exactly what endpoints exist, what parameters they accept, and what responses look like. Without one, an agent has to reverse-engineer your API from documentation pages or trial and error — unreliable and error-prone.
Whether AI bots can actually reach your content
Tests whether AI bots receive the same content as regular browsers by comparing responses across different User-Agent strings.
Fetches your homepage with a standard Chrome browser User-Agent to establish a baseline. Then fetches the same page as GPTBot, ClaudeBot, and GoogleBot (8-second timeout each). A bot is considered blocked if it gets a 401, 403, or 429 status, or if the response body is less than 30% of the browser version's size and under 500 characters.
WAFs, CDNs, and bot protection services often serve different content to known bot User-Agents — 403 errors, CAPTCHAs, or stripped-down HTML. If AI crawlers see a blank page or an access denied error while browsers see your full site, you're invisible to AI-powered search, citation, and recommendations.
Evaluates whether your homepage delivers meaningful content in the initial HTML response, without requiring JavaScript execution.
Analyzes the raw HTML: extracts body text length (excluding script and style tags), counts headings (h1–h6), and detects SPA patterns (elements with id="root", id="app", or id="__next" combined with under 200 characters of text content). Also checks for <noscript> fallback content.
AI crawlers, LLM fetchers, and most automation tools don't run JavaScript. They see only the initial HTML your server sends. If your site is a single-page app that renders everything client-side, agents see an empty div and a bundle URL — no content, no headings, no useful information.
Measures how fast your server responds to the initial page request and counts any redirect hops along the way.
Records the total time from request to full response for your homepage. Tracks the HTTP status code and counts redirects (e.g., HTTP → HTTPS, www → non-www, or vanity URL redirects).
AI agents process pages at scale — they're fetching thousands of URLs with strict per-request timeouts. A 3-second response time means your pages will be deprioritized or skipped entirely. Redirect chains add latency and sometimes break bot crawlers that don't follow all hops.
Structured data and metadata quality
Extracts and validates JSON-LD structured data blocks embedded in your page, looking for recognized Schema.org types.
Parses all <script type="application/ld+json"> blocks from the HTML. Extracts @type values from each block, including nested items within @graph arrays. Checks for "important" types: Organization, WebSite, WebPage, Article, and Product.
JSON-LD gives agents machine-readable facts about your page — who published it, what type of content it is, when it was updated, what it's about. Without structured data, agents have to infer this from HTML patterns, which is unreliable and lossy. Well-structured JSON-LD means better citations, richer context, and more accurate AI-generated summaries.
Checks for 6 essential HTML metadata elements that help agents quickly understand and categorize your page.
Scans the HTML for: (1) lang attribute on the <html> tag, (2) <title> tag with more than 5 characters, (3) <meta name="description"> with more than 20 characters, (4) <link rel="canonical"> URL, (5) at least one <h1> heading, and (6) Open Graph og:title meta tag.
These metadata basics are the first thing any crawler reads. The lang attribute tells agents what language to expect. The title and description give a summary without parsing content. The canonical URL prevents duplicate indexing. The h1 signals the primary topic. OG tags provide the social/sharing context. Missing any of these forces agents to guess — and guesses are often wrong.