Lesson 4 12 min

Web Scraping with Python

Learn to extract data from websites using Python — requests and BeautifulSoup for static pages, handling pagination, and ethical scraping practices with AI assistance.

🔄 Recall Bridge: In the previous lesson, you automated data processing with pandas — reading, cleaning, and transforming spreadsheets. Now let’s get data from the web itself: scraping information from websites when no API or download is available.

Web scraping turns unstructured web pages into structured data. Price monitoring, job listing aggregation, news tracking, competitive analysis — if the data is on a web page, Python can extract it.

The Scraping Stack

pip install requests beautifulsoup4 lxml

Library	Purpose
requests	Download web pages (HTTP requests)
BeautifulSoup	Parse HTML and extract data
lxml	Fast HTML parser (used by BeautifulSoup)

Script 1: Basic Page Scraper

AI prompt:

Write a Python web scraper using requests and BeautifulSoup that: (1) Fetches a webpage at a given URL, (2) Extracts all product names and prices from the page (assume products are in elements with class “product” containing an h2 for the name and a span.price for the price), (3) Saves the data as a CSV with columns: name, price, url, scraped_date, (4) Handles: HTTP errors (non-200 status codes), connection timeouts, missing elements (some products might not have prices). Include proper headers with a User-Agent string.

Core BeautifulSoup operations:

Operation	Code	What It Does
Find one element	`soup.find("h2", class_="title")`	First matching element
Find all elements	`soup.find_all("div", class_="product")`	All matching elements
Get text	`element.get_text(strip=True)`	Text content, whitespace stripped
Get attribute	`element["href"]` or `element.get("href")`	HTML attribute value
CSS selector	`soup.select("div.product h2")`	CSS selector syntax
Navigate	`element.parent`, `element.next_sibling`	Move through the DOM

Script 2: AI-Assisted Selector Discovery

When you don’t know the HTML structure, AI can help:

AI prompt:

Here’s the HTML of a product listing page: [PASTE THE HTML]. I need to extract: product name, price, rating, and product URL. Identify the CSS selectors or BeautifulSoup patterns to extract each piece of data. Show me a working code snippet.

This is where AI shines — parsing HTML structure and finding the right selectors is tedious for humans but trivial for AI.

Ethical Scraping Practices

Practice	Why	How
Check robots.txt	Sites specify what’s allowed to scrape	`requests.get(url + "/robots.txt")`
Respect rate limits	Don’t overload servers	`time.sleep(random.uniform(1, 3))` between requests
Identify yourself	Let site owners contact you	Set `User-Agent: MyProjectName (email@example.com)`
Cache responses	Don’t re-request the same page	Save HTML locally, check before requesting
Check for APIs	APIs are more reliable and permitted	Search for `site.com/api` or developer docs
Check terms of service	Some sites explicitly prohibit scraping	Read ToS before building a scraper

Script 3: Pagination Handler

AI prompt:

Write a pagination-aware scraper using requests and BeautifulSoup: (1) Start at a listing page URL, (2) Extract data from all products on the page, (3) Find and follow the “next page” link, (4) Stop when no next page exists, (5) Add 2-second random delays between pages, (6) Save results incrementally to CSV after each page (resume-safe), (7) Track progress: print page number, total items scraped, (8) If a page returns 0 items, log a warning but continue. Accept the starting URL and output CSV path as arguments.

When NOT to Scrape

Situation	Better Alternative
Site has a public API	Use the API — faster, more reliable, explicitly allowed
Data is behind login	Likely violates ToS — check first
Data is copyrighted content	Legal risk — consult ToS and applicable law
Site blocks all scrapers	They’ve explicitly said no — respect it
Data updates in real-time	APIs or websockets are more appropriate

✅ Quick Check: You’re scraping a page but requests.get() returns HTML that doesn’t contain the data you see in your browser. The page loads data via JavaScript after the initial HTML loads. What do you need? (Answer: For JavaScript-rendered pages, requests + BeautifulSoup can’t work because they only see the initial HTML before JavaScript runs. Options: (1) Check if the JavaScript fetches data from an API — use browser DevTools Network tab to find the API endpoint, then call it directly with requests. This is the best approach. (2) Use Selenium or Playwright to render the page with a real browser. This is slower but works for any page.)

Structuring Your Scraper

A well-structured scraper separates concerns:

# Structure your scraper into clear functions:
# 1. fetch_page(url) - handles HTTP requests, retries, delays
# 2. parse_products(html) - extracts data from HTML
# 3. save_results(data, filepath) - writes to CSV/Excel
# 4. main() - orchestrates the flow, handles pagination

AI prompt for structuring:

Refactor my scraper into clean functions: (1) fetch_page() that handles retries (3 attempts), delays, and error logging, (2) parse_page() that extracts data and returns a list of dictionaries, (3) save_data() that appends to a CSV incrementally, (4) main() that handles pagination and progress tracking. Add a –test flag that only scrapes the first page for testing.

Key Takeaways

Always scrape ethically: check robots.txt, add delays between requests (1-3 seconds), set an honest User-Agent header, and check if an API exists first — aggressive scraping gets your IP blocked and may violate terms of service
Web scrapers are inherently fragile because HTML structure changes without notice: build resilience with fallback selectors, result validation (alert if zero items found), and raw HTML logging so you can quickly update selectors when they break
AI excels at the most tedious part of scraping — figuring out CSS selectors: paste the HTML into AI, describe what data you need, and AI identifies the exact selectors and writes the extraction code

Up Next

In the next lesson, you’ll learn API integration — connecting to web services directly (no HTML parsing needed) for more reliable data automation.

Knowledge Check

1. You want to scrape product prices from an online store to track price changes. You write a script that sends 1,000 requests in rapid succession. The website blocks your IP address after 50 requests. What went wrong and how do you fix it?

The website has anti-scraping protection — you need to use a different IP or VPN You're sending requests too fast. Websites interpret rapid-fire requests as a bot attack (which it technically is). The fix: (1) Add delays between requests — `time.sleep(1)` adds a 1-second pause, or use `random.uniform(1, 3)` for random delays that look more human. (2) Check robots.txt first — it tells you the crawl rate the site allows and which paths are off-limits. (3) Set a proper User-Agent header — identify your script as a polite scraper, not a headless browser pretending to be human. (4) Cache responses — don't re-request pages you've already scraped. (5) Consider if an API exists — many sites offer an API that's faster, more reliable, and explicitly allowed Use Selenium instead of requests — browser-based scraping doesn't get blocked

2. You're scraping a product listing page. You ask AI to write the scraper. It generates code targeting `div.product-card h2.title`. The script works today but returns empty results next week. What happened?

The website changed their HTML structure — CSS class names like 'product-card' and 'title' are implementation details that websites change without notice. This is the fundamental fragility of web scraping. Mitigation strategies: (1) Use multiple selectors as fallbacks — if the primary selector fails, try alternatives. (2) Add validation — if the scraper returns 0 results when you expected 50, raise an alert instead of silently failing. (3) Log the raw HTML when results look wrong — this helps you diagnose what changed. (4) For critical data, prefer APIs over scraping — APIs have versioned contracts, HTML doesn't. AI helps you adapt: paste the new HTML structure and ask 'Update my selector for this new HTML' The website detected you're a scraper and started serving different HTML BeautifulSoup has a bug — update to the latest version

3. A website has product data across 50 pages (pagination). Each page shows 20 products. You need all 1,000 products. What's the best approach?

Scrape all 50 pages in a loop with no delays — it's only 50 requests Build a pagination-aware scraper with delays, progress tracking, and data validation. AI prompt: 'Write a scraper that: (1) Starts at page 1 of the product listing, (2) Extracts all products from the current page, (3) Finds the "next page" link and follows it, (4) Stops when there's no next page or we've scraped enough, (5) Adds a 2-second delay between page requests, (6) Saves results incrementally — if the script crashes on page 35, resume from page 35 without re-scraping 1-34, (7) Prints progress: "Page 15/50 — 300 products scraped so far", (8) Validates each page has products (alert if a page returns 0 products).' Open each page in a browser and manually copy the data — 50 pages is manageable

Answer all questions to check

Complete the quiz above first

The Scraping Stack

Script 1: Basic Page Scraper

Script 2: AI-Assisted Selector Discovery

Ethical Scraping Practices

Script 3: Pagination Handler

When NOT to Scrape

Structuring Your Scraper

Key Takeaways

Up Next

Knowledge Check

Related Skills