What is the difference between sitemap_urls and sitemap_follow
In Scrapy's SitemapSpide, sitemap_urls and sitemap_follow are two important attributes that control how the spider discovers and processes URLs from sitemaps. Here's a detailed explanation of their differences and how they work together:
1. sitemap_urls
Purpose: Specifies the initial sitemap URLs that the spider will start crawling from.
Type: A list of URLs (strings).
Behavior:
The spider will download and parse the sitemap files listed in
sitemap_urls
.These sitemaps can be either:
Sitemap Index Files: Files that contain links to other sitemaps (e.g.,
sitemap.xml
).URLset Files: Files that contain direct links to web pages (e.g.,
product-sitemap.xml
).2.
sitemap_follow
Purpose: Filters which sitemaps the spider should follow when encountering a sitemap index file.
Type: A list of regex patterns (strings).
Behavior:
If a sitemap index file is encountered, the spider will only follow links to nested sitemaps that match one of the regex patterns in
sitemap_follow
.If
sitemap_follow
is not specified, the spider will follow all nested sitemaps.
- For Example:
This https://global.1more.com/sitemap.xml has sitemapindex like:<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><!-- This is the parent sitemap linking to additional sitemaps for products, collections and pages as shown below. The sitemap can not be edited manually, but is kept up to date in real time. --><sitemap><loc>https://global.1more.com/sitemap_products_1.xml?from=6633159229540&to=7350287007844</loc></sitemap><sitemap><loc>https://global.1more.com/sitemap_pages_1.xml?from=81376641124&to=81377722468</loc></sitemap><sitemap><loc>https://global.1more.com/sitemap_collections_1.xml?from=262647939172&to=262782058596</loc></sitemap><sitemap><loc>https://global.1more.com/sitemap_blogs_1.xml</loc></sitemap></sitemapindex>Here we want to follow just sitemap_products_1 then the code will be like:from scrapy.spiders import SitemapSpider
class MoreSpider(SitemapSpider):
name = "1more.com"
sitemap_urls = [
'https://global.1more.com/sitemap.xml'
]
sitemap_follow = ['sitemap_products_1']
def parse(self, response):
pass
from scrapy.spiders import SitemapSpider
class MoreSpider(SitemapSpider):
name = "1more.com"
sitemap_urls = [
'https://global.1more.com/sitemap.xml'
]
sitemap_follow = ['sitemap_products_1']
def parse(self, response):
pass
Post a Comment