What is the difference between sitemap_urls and sitemap_follow

 In Scrapy's  SitemapSpide, sitemap_urls and sitemap_follow are two important attributes that control how the spider discovers and processes URLs from sitemaps. Here's a detailed explanation of their differences and how they work together:

1. sitemap_urls

  • Purpose: Specifies the initial sitemap URLs that the spider will start crawling from.

  • Type: A list of URLs (strings).

  • Behavior:

    • The spider will download and parse the sitemap files listed in sitemap_urls.

    • These sitemaps can be either:

      • Sitemap Index Files: Files that contain links to other sitemaps (e.g., sitemap.xml).

      • URLset Files: Files that contain direct links to web pages (e.g., product-sitemap.xml).


  • 2. sitemap_follow

    • Purpose: Filters which sitemaps the spider should follow when encountering a sitemap index file.

    • Type: A list of regex patterns (strings).

    • Behavior:

      • If a sitemap index file is encountered, the spider will only follow links to nested sitemaps that match one of the regex patterns in sitemap_follow.

      • If sitemap_follow is not specified, the spider will follow all nested sitemaps.


  • For Example:
This https://global.1more.com/sitemap.xml has sitemapindex like:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- This is the parent sitemap linking to additional sitemaps for products, collections and pages as shown below. The sitemap can not be edited manually, but is kept up to date in real time. -->
<sitemap>
<loc>https://global.1more.com/sitemap_products_1.xml?from=6633159229540&to=7350287007844</loc>
</sitemap>
<sitemap>
<loc>https://global.1more.com/sitemap_pages_1.xml?from=81376641124&to=81377722468</loc>
</sitemap>
<sitemap>
<loc>https://global.1more.com/sitemap_collections_1.xml?from=262647939172&to=262782058596</loc>
</sitemap>
<sitemap>
<loc>https://global.1more.com/sitemap_blogs_1.xml</loc>
</sitemap>
</sitemapindex>
Here we want to follow just sitemap_products_1 then the code will be like:
from scrapy.spiders import SitemapSpider


class MoreSpider(SitemapSpider):
name =
"1more.com"
sitemap_urls = [
'https://global.1more.com/sitemap.xml'
]
sitemap_follow = [
'sitemap_products_1']

def parse(self, response):
pass

No comments

Powered by Blogger.