Robots.txt for SEO: Control What Google Crawls on Your Site

<![CDATA[

Robots.txt is a simple but powerful file that controls which pages search engine crawlers can and cannot access. Proper robots.txt configuration preserves crawl budget for your most important pages, blocks indexation of non-public content, and prevents duplicate content issues from parameter URLs and internal systems.

Robots.txt Syntax

The file lives at the root of your domain: yourdomain.com/robots.txt

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /internal-search/
Disallow: /*?sort=
Disallow: /*?filter=

Sitemap: https://yourdomain.com/sitemap.xml

Key Directives

User-agent: Specifies which bot the rules apply to (* = all bots)
Disallow: Blocks the specified path from crawling
Allow: Explicitly permits crawling (overrides Disallow for specific paths)
Sitemap: Points crawlers to your XML sitemap

What to Block

Admin and login pages: /admin/, /wp-admin/, /login/
Internal search results: /search?, /internal-search/
Parameter URLs: ?sort=, ?filter=, ?page= (if canonical tags handle these)
Development and staging paths: /staging/, /test/, /dev/
Thank-you and confirmation pages: /thank-you/, /order-confirmation/
User-generated parameter pages with thin content

What NOT to Block

CSS and JavaScript files: Googlebot needs these to render pages properly
Image directories: Unless you deliberately want to exclude images from Google Images
Your sitemap: Always allow crawling of your sitemap
Any page you want to rank: Disallow prevents crawling, which prevents indexation

Robots.txt vs. Noindex

Important distinction:

For more on this topic, see our guide on xml sitemap seo guide.

For more on this topic, see our guide on google algorithm update recovery.

For more on this topic, see our guide on crawl budget optimization.

For more on this topic, see our guide on log file analysis seo.

For more on this topic, see our guide on indexation management seo.

Robots.txt Disallow: Prevents crawling but does NOT remove pages from the index. If a page has backlinks, Google may index the URL without crawling it.
Noindex meta tag: Tells Google to remove the page from the index. Requires the page to be crawlable (not blocked by robots.txt).

To remove a page from search results: use noindex, NOT robots.txt. To save crawl budget on non-content pages: use robots.txt.

Testing Robots.txt

Use Google Search Console’s robots.txt tester to verify your rules work as intended. Test specific URLs to confirm they’re allowed or blocked. After changes, monitor Search Console’s crawl stats for unexpected changes in crawl behavior.

Review your robots.txt quarterly. As your site grows, new URL patterns emerge that may need blocking. A well-maintained robots.txt ensures Googlebot spends its crawl budget on pages that drive your SEO performance.

]]>