Lesson 4 12 min

Web Scraping with Python

Learn to extract data from websites using Python — requests and BeautifulSoup for static pages, handling pagination, and ethical scraping practices with AI assistance.

🔄 Recall Bridge: In the previous lesson, you automated data processing with pandas — reading, cleaning, and transforming spreadsheets. Now let’s get data from the web itself: scraping information from websites when no API or download is available.

Web scraping turns unstructured web pages into structured data. Price monitoring, job listing aggregation, news tracking, competitive analysis — if the data is on a web page, Python can extract it.

The Scraping Stack

pip install requests beautifulsoup4 lxml
LibraryPurpose
requestsDownload web pages (HTTP requests)
BeautifulSoupParse HTML and extract data
lxmlFast HTML parser (used by BeautifulSoup)

Script 1: Basic Page Scraper

AI prompt:

Write a Python web scraper using requests and BeautifulSoup that: (1) Fetches a webpage at a given URL, (2) Extracts all product names and prices from the page (assume products are in elements with class “product” containing an h2 for the name and a span.price for the price), (3) Saves the data as a CSV with columns: name, price, url, scraped_date, (4) Handles: HTTP errors (non-200 status codes), connection timeouts, missing elements (some products might not have prices). Include proper headers with a User-Agent string.

Core BeautifulSoup operations:

OperationCodeWhat It Does
Find one elementsoup.find("h2", class_="title")First matching element
Find all elementssoup.find_all("div", class_="product")All matching elements
Get textelement.get_text(strip=True)Text content, whitespace stripped
Get attributeelement["href"] or element.get("href")HTML attribute value
CSS selectorsoup.select("div.product h2")CSS selector syntax
Navigateelement.parent, element.next_siblingMove through the DOM

Script 2: AI-Assisted Selector Discovery

When you don’t know the HTML structure, AI can help:

AI prompt:

Here’s the HTML of a product listing page: [PASTE THE HTML]. I need to extract: product name, price, rating, and product URL. Identify the CSS selectors or BeautifulSoup patterns to extract each piece of data. Show me a working code snippet.

This is where AI shines — parsing HTML structure and finding the right selectors is tedious for humans but trivial for AI.

Ethical Scraping Practices

PracticeWhyHow
Check robots.txtSites specify what’s allowed to scraperequests.get(url + "/robots.txt")
Respect rate limitsDon’t overload serverstime.sleep(random.uniform(1, 3)) between requests
Identify yourselfLet site owners contact youSet User-Agent: MyProjectName (email@example.com)
Cache responsesDon’t re-request the same pageSave HTML locally, check before requesting
Check for APIsAPIs are more reliable and permittedSearch for site.com/api or developer docs
Check terms of serviceSome sites explicitly prohibit scrapingRead ToS before building a scraper

Script 3: Pagination Handler

AI prompt:

Write a pagination-aware scraper using requests and BeautifulSoup: (1) Start at a listing page URL, (2) Extract data from all products on the page, (3) Find and follow the “next page” link, (4) Stop when no next page exists, (5) Add 2-second random delays between pages, (6) Save results incrementally to CSV after each page (resume-safe), (7) Track progress: print page number, total items scraped, (8) If a page returns 0 items, log a warning but continue. Accept the starting URL and output CSV path as arguments.

When NOT to Scrape

SituationBetter Alternative
Site has a public APIUse the API — faster, more reliable, explicitly allowed
Data is behind loginLikely violates ToS — check first
Data is copyrighted contentLegal risk — consult ToS and applicable law
Site blocks all scrapersThey’ve explicitly said no — respect it
Data updates in real-timeAPIs or websockets are more appropriate

Quick Check: You’re scraping a page but requests.get() returns HTML that doesn’t contain the data you see in your browser. The page loads data via JavaScript after the initial HTML loads. What do you need? (Answer: For JavaScript-rendered pages, requests + BeautifulSoup can’t work because they only see the initial HTML before JavaScript runs. Options: (1) Check if the JavaScript fetches data from an API — use browser DevTools Network tab to find the API endpoint, then call it directly with requests. This is the best approach. (2) Use Selenium or Playwright to render the page with a real browser. This is slower but works for any page.)

Structuring Your Scraper

A well-structured scraper separates concerns:

# Structure your scraper into clear functions:
# 1. fetch_page(url) - handles HTTP requests, retries, delays
# 2. parse_products(html) - extracts data from HTML
# 3. save_results(data, filepath) - writes to CSV/Excel
# 4. main() - orchestrates the flow, handles pagination

AI prompt for structuring:

Refactor my scraper into clean functions: (1) fetch_page() that handles retries (3 attempts), delays, and error logging, (2) parse_page() that extracts data and returns a list of dictionaries, (3) save_data() that appends to a CSV incrementally, (4) main() that handles pagination and progress tracking. Add a –test flag that only scrapes the first page for testing.

Key Takeaways

  • Always scrape ethically: check robots.txt, add delays between requests (1-3 seconds), set an honest User-Agent header, and check if an API exists first — aggressive scraping gets your IP blocked and may violate terms of service
  • Web scrapers are inherently fragile because HTML structure changes without notice: build resilience with fallback selectors, result validation (alert if zero items found), and raw HTML logging so you can quickly update selectors when they break
  • AI excels at the most tedious part of scraping — figuring out CSS selectors: paste the HTML into AI, describe what data you need, and AI identifies the exact selectors and writes the extraction code

Up Next

In the next lesson, you’ll learn API integration — connecting to web services directly (no HTML parsing needed) for more reliable data automation.

Knowledge Check

1. You want to scrape product prices from an online store to track price changes. You write a script that sends 1,000 requests in rapid succession. The website blocks your IP address after 50 requests. What went wrong and how do you fix it?

2. You're scraping a product listing page. You ask AI to write the scraper. It generates code targeting `div.product-card h2.title`. The script works today but returns empty results next week. What happened?

3. A website has product data across 50 pages (pagination). Each page shows 20 products. You need all 1,000 products. What's the best approach?

Answer all questions to check

Complete the quiz above first

Related Skills