SEO Crawler Test Checklist
This page outlines the technical SEO features and scenarios implemented across this test website.
Back to Home
On-Page Elements
- Basic Tags:
- Presence/content of
<title>
(e.g., index.html
)
- Handling of missing
<title>
(page4-no-title.html
)
- Handling of very long
<title>
(content-1.html
)
- Presence/content of
<meta name="description">
(e.g., index.html
)
- Presence/content of
<h1>
(e.g., index.html
)
- Handling of multiple
<h1>
tags (content-2.html
)
- Use of other heading tags (
<h2>
, etc.)
- Canonicals:
- Presence/correct URL in
<link rel="canonical">
(e.g., index.html
)
- Self-referencing canonical (
page3-duplicate.html
)
- Canonical pointing to destination on redirects (
redirect-page.html
, page6-js-redirect.html
, page12-delayed-refresh.html
)
- Handling of multiple
<link rel="canonical">
tags (page9-multi-canonical.html
)
- Handling canonical to different domain (
content-4.html
)
- Meta Robots:
- Parsing
index, follow
(index.html
)
- Parsing
noindex, follow
(page2.html
, redirects)
- Parsing specific bot directives (
<meta name="google" content="nosnippet">
on page2.html
)
- Links:
- Internal/External link identification
- Broken internal link detection (
/nonexistent-page.html
)
- Parsing
rel="nofollow"
(internal/external on index.html
)
- Parsing
rel="sponsored"
(index.html
)
- Parsing
rel="ugc"
(index.html
)
- Links within
<iframe>
(iframe-content.html
)
- Links added via JavaScript (
page5-js-content.html
, hidden link)
- Handling links with empty anchor text (
content-5.html
)
- Handling links to non-HTML resources (PDF link on
content-6.html
)
- Handling pages with high outgoing link counts (
content-7.html
)
- Handling links with mixed-case URLs (
content-8.html
)
- Relative link resolution with
<base>
tag (page7-base-tag.html
)
- Relative link resolution from deep paths (
level1/level2/deep-page.html
)
- Following internal links
- Handling fragment identifiers (#) (
index.html
, faq.html#question1
)
- Images:
- Presence of
<img>
tags
- Presence/content of
alt
text
- Discovery of images in CSS (
style.css
-> index.html
)
- Handling references to large image files (
content-3.html
)
- Internationalization:
- Parsing/validation of
hreflang
tags (index.html
, index-gb.html
)
- Structured Data (Schema.org):
- Parsing JSON-LD
- Validation of
WebSite
schema (index.html
)
- Validation of
Article
schema (blog-1.html
)
- Validation of
FAQPage
schema (faq.html
)
- Validation of
BreadcrumbList
schema (level1/level2/deep-page.html
)
- Pagination:
- Detection/understanding of
rel="next"/"prev"
(blog-*.html
)
- Mobile Friendliness:
- Detection of missing
<meta name="viewport">
(page11-no-viewport.html
)
- Social Tags:
- Parsing Open Graph tags (
index.html
)
- Parsing Twitter Card tags (
index.html
)
- Content:
- Extraction of main page content (paragraphs, lists).
- Handling content hidden via CSS (`display: none` on `index.html`)
- Detection of identical content across multiple pages (`duplicate-*.html`)
- Handling of canonical tags on duplicate pages (`duplicate-2.html`, `duplicate-3.html`)
- Alternate Links:
- Parsing `rel="alternate"` for feeds (RSS link on `index.html`)
Crawling & Indexing Control
- robots.txt:
- Fetching/parsing
robots.txt
- Handling different `User-agent` blocks
- Respecting `Disallow:` (including wildcards)
- Respecting `Allow:`
- Parsing `Crawl-delay:`
- Parsing `Sitemap:` directive
- Sitemaps:
- Fetching/parsing Sitemap Index files (
sitemap-index.xml
)
- Fetching/parsing XML Sitemaps (
sitemap-pages.xml
, sitemap-blog.xml
)
- Extracting standard URL data
- Handling URLs blocked by `robots.txt`
- Parsing Sitemap extensions (Image sitemap in
sitemap-blog.xml
)
- Cross-referencing sitemap/crawled URLs
Technical Health
- Redirects:
- Handling meta refresh redirects (immediate & delayed)
- Handling JavaScript redirects
- Handling HTTP 301 Permanent Redirects (Server-side config for
/old-page.html
)
- Handling HTTP 302 Temporary Redirects (Server-side config for
/temporary-promo.html
)
- Detection of simulated Soft 404s (
soft-404-example.html
with 200 status)
- JavaScript:
- Rendering JS-dependent content/links
- Handling JS redirects
- Respecting `robots.txt` for JS resources
- HTML Validity:
- Handling URLs with excessive length/keywords (
a-very-long-page-name...html
)
- Robustness with parsing errors (
page8-html-errors.html
)
- Character Encoding:
- Handling encoding conflicts (
page10-encoding-error.html
)
- Resource Loading:
- Fetching external CSS/JS (respecting `robots.txt`)
- iFrames:
- Detection and handling of `<iframe>` content
Back to Home