$ cat ~/rules/python-scraping

CLAUDE.md rules for Python web scraping projects using httpx, BeautifulSoup, Playwright, and Scrapy
markdown
# Python Web Scraping

## Stack

- Python 3.12+
- httpx for HTTP requests (async by default)
- BeautifulSoup4 + lxml for HTML parsing
- Playwright for JavaScript-rendered pages
- Scrapy for large-scale crawls
- Pydantic for data validation
- uv for dependency management

## Project Structure

```
src/
  scrapers/         # One module per target site
  models/           # Pydantic models for scraped data
  pipelines/        # Data cleaning and export (CSV, JSON, DB)
  middleware/        # Retry logic, proxy rotation, rate limiting
  utils/            # URL helpers, text normalization
tests/
  fixtures/         # Saved HTML snapshots for offline tests
```

## HTTP Requests

- Always use `httpx.AsyncClient` with a shared session for connection pooling
- Set a descriptive `User-Agent` header — never use the default
- Set reasonable timeouts: `httpx.Timeout(connect=10, read=30)`
- Handle status codes explicitly — do not silently ignore 4xx/5xx
- Use `response.raise_for_status()` then catch `httpx.HTTPStatusError`
- Prefer `response.text` over `response.content` for HTML

```python
async with httpx.AsyncClient(
    headers={"User-Agent": "MyBot/1.0 (+https://example.com/bot)"},
    timeout=httpx.Timeout(connect=10, read=30),
    follow_redirects=True,
    limits=httpx.Limits(max_connections=20, max_keepalive_connections=10),
) as client:
    response = await client.get(url)
    response.raise_for_status()
```

## Parsing

- Use `lxml` parser for speed: `BeautifulSoup(html, "lxml")`
- Prefer CSS selectors (`select` / `select_one`) over `find` / `find_all`
- Always check for `None` before accessing `.text` or attributes
- Extract data into Pydantic models immediately — no raw dicts
- Strip and normalize whitespace: `.get_text(strip=True)`

```python
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float
    url: str
    in_stock: bool

el = soup.select_one("h1.product-title")
name = el.get_text(strip=True) if el else ""
```

## Rate Limiting & Politeness

- Respect `robots.txt` — check with `urllib.robotparser` before crawling
- Add a delay between requests: `asyncio.sleep(1)` minimum
- Use exponential backoff on retries (tenacity or manual)
- Limit concurrent requests with `asyncio.Semaphore`
- Check for and respect `Retry-After` headers on 429 responses
- Stop immediately if you receive a 403 with a CAPTCHA page

```python
semaphore = asyncio.Semaphore(5)

async def fetch(client: httpx.AsyncClient, url: str) -> str:
    async with semaphore:
        await asyncio.sleep(1)
        response = await client.get(url)
        response.raise_for_status()
        return response.text
```

## JavaScript-Rendered Pages

- Only use Playwright when content is not in the initial HTML
- Always use headless mode: `browser.launch(headless=True)`
- Wait for specific selectors, not arbitrary timeouts: `page.wait_for_selector(".data")`
- Block unnecessary resources to speed up loads:

```python
await page.route("**/*.{png,jpg,gif,svg,css,font,woff2}", lambda route: route.abort())
```

- Close browser contexts after each task to avoid memory leaks
- Extract HTML with `page.content()` then parse with BeautifulSoup — avoid chaining Playwright selectors for complex extraction

## Error Handling

- Retry transient failures (429, 500, 502, 503, 504, timeouts) up to 3 times
- Log the URL and status code on every failure
- Never crash on a single page failure — skip and continue
- Collect errors in a list and report a summary at the end
- Save failed URLs to a file for re-processing

```python
RETRYABLE = {429, 500, 502, 503, 504}

async def fetch_with_retry(client, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = await client.get(url)
            resp.raise_for_status()
            return resp.text
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in RETRYABLE:
                raise
            await asyncio.sleep(2 ** attempt)
    raise RuntimeError(f"Failed after {max_retries} retries: {url}")
```

## Data Validation & Export

- Validate every scraped record with Pydantic — drop or flag invalid rows
- Export to structured formats: JSON Lines (`.jsonl`), CSV, or directly to a database
- Include metadata: scrape timestamp, source URL, page hash
- Deduplicate by a natural key (URL, product ID, etc.)

## Testing

- Save HTML fixtures for offline testing — never hit live sites in CI
- Test parsing logic against fixtures with known expected output
- Test edge cases: missing fields, empty pages, malformed HTML, different page layouts
- Use `respx` to mock httpx requests in tests
- Run `mypy --strict` and `ruff check` in CI

## Anti-Patterns to Avoid

- Do not use `requests` — use `httpx` (async, HTTP/2 support)
- Do not use regex to parse HTML — use a proper parser
- Do not hardcode delays with `time.sleep` in async code — use `asyncio.sleep`
- Do not store raw HTML blobs in the database — extract structured data
- Do not ignore `robots.txt` or ToS
- Do not run scrapers without a rate limit