Web Scraping with Python
Learn to extract data from websites using Python — requests and BeautifulSoup for static pages, handling pagination, and ethical scraping practices with AI assistance.
🔄 Recall Bridge: In the previous lesson, you automated data processing with pandas — reading, cleaning, and transforming spreadsheets. Now let’s get data from the web itself: scraping information from websites when no API or download is available.
Web scraping turns unstructured web pages into structured data. Price monitoring, job listing aggregation, news tracking, competitive analysis — if the data is on a web page, Python can extract it.
The Scraping Stack
pip install requests beautifulsoup4 lxml
| Library | Purpose |
|---|---|
| requests | Download web pages (HTTP requests) |
| BeautifulSoup | Parse HTML and extract data |
| lxml | Fast HTML parser (used by BeautifulSoup) |
Script 1: Basic Page Scraper
AI prompt:
Write a Python web scraper using requests and BeautifulSoup that: (1) Fetches a webpage at a given URL, (2) Extracts all product names and prices from the page (assume products are in elements with class “product” containing an h2 for the name and a span.price for the price), (3) Saves the data as a CSV with columns: name, price, url, scraped_date, (4) Handles: HTTP errors (non-200 status codes), connection timeouts, missing elements (some products might not have prices). Include proper headers with a User-Agent string.
Core BeautifulSoup operations:
| Operation | Code | What It Does |
|---|---|---|
| Find one element | soup.find("h2", class_="title") | First matching element |
| Find all elements | soup.find_all("div", class_="product") | All matching elements |
| Get text | element.get_text(strip=True) | Text content, whitespace stripped |
| Get attribute | element["href"] or element.get("href") | HTML attribute value |
| CSS selector | soup.select("div.product h2") | CSS selector syntax |
| Navigate | element.parent, element.next_sibling | Move through the DOM |
Script 2: AI-Assisted Selector Discovery
When you don’t know the HTML structure, AI can help:
AI prompt:
Here’s the HTML of a product listing page: [PASTE THE HTML]. I need to extract: product name, price, rating, and product URL. Identify the CSS selectors or BeautifulSoup patterns to extract each piece of data. Show me a working code snippet.
This is where AI shines — parsing HTML structure and finding the right selectors is tedious for humans but trivial for AI.
Ethical Scraping Practices
| Practice | Why | How |
|---|---|---|
| Check robots.txt | Sites specify what’s allowed to scrape | requests.get(url + "/robots.txt") |
| Respect rate limits | Don’t overload servers | time.sleep(random.uniform(1, 3)) between requests |
| Identify yourself | Let site owners contact you | Set User-Agent: MyProjectName (email@example.com) |
| Cache responses | Don’t re-request the same page | Save HTML locally, check before requesting |
| Check for APIs | APIs are more reliable and permitted | Search for site.com/api or developer docs |
| Check terms of service | Some sites explicitly prohibit scraping | Read ToS before building a scraper |
Script 3: Pagination Handler
AI prompt:
Write a pagination-aware scraper using requests and BeautifulSoup: (1) Start at a listing page URL, (2) Extract data from all products on the page, (3) Find and follow the “next page” link, (4) Stop when no next page exists, (5) Add 2-second random delays between pages, (6) Save results incrementally to CSV after each page (resume-safe), (7) Track progress: print page number, total items scraped, (8) If a page returns 0 items, log a warning but continue. Accept the starting URL and output CSV path as arguments.
When NOT to Scrape
| Situation | Better Alternative |
|---|---|
| Site has a public API | Use the API — faster, more reliable, explicitly allowed |
| Data is behind login | Likely violates ToS — check first |
| Data is copyrighted content | Legal risk — consult ToS and applicable law |
| Site blocks all scrapers | They’ve explicitly said no — respect it |
| Data updates in real-time | APIs or websockets are more appropriate |
✅ Quick Check: You’re scraping a page but
requests.get()returns HTML that doesn’t contain the data you see in your browser. The page loads data via JavaScript after the initial HTML loads. What do you need? (Answer: For JavaScript-rendered pages, requests + BeautifulSoup can’t work because they only see the initial HTML before JavaScript runs. Options: (1) Check if the JavaScript fetches data from an API — use browser DevTools Network tab to find the API endpoint, then call it directly with requests. This is the best approach. (2) Use Selenium or Playwright to render the page with a real browser. This is slower but works for any page.)
Structuring Your Scraper
A well-structured scraper separates concerns:
# Structure your scraper into clear functions:
# 1. fetch_page(url) - handles HTTP requests, retries, delays
# 2. parse_products(html) - extracts data from HTML
# 3. save_results(data, filepath) - writes to CSV/Excel
# 4. main() - orchestrates the flow, handles pagination
AI prompt for structuring:
Refactor my scraper into clean functions: (1) fetch_page() that handles retries (3 attempts), delays, and error logging, (2) parse_page() that extracts data and returns a list of dictionaries, (3) save_data() that appends to a CSV incrementally, (4) main() that handles pagination and progress tracking. Add a –test flag that only scrapes the first page for testing.
Key Takeaways
- Always scrape ethically: check robots.txt, add delays between requests (1-3 seconds), set an honest User-Agent header, and check if an API exists first — aggressive scraping gets your IP blocked and may violate terms of service
- Web scrapers are inherently fragile because HTML structure changes without notice: build resilience with fallback selectors, result validation (alert if zero items found), and raw HTML logging so you can quickly update selectors when they break
- AI excels at the most tedious part of scraping — figuring out CSS selectors: paste the HTML into AI, describe what data you need, and AI identifies the exact selectors and writes the extraction code
Up Next
In the next lesson, you’ll learn API integration — connecting to web services directly (no HTML parsing needed) for more reliable data automation.
Knowledge Check
Complete the quiz above first
Lesson completed!