← cd ../rules
$ cat ~/rules/python-scraping
CLAUDE.md rules for Python web scraping projects using httpx, BeautifulSoup, Playwright, and Scrapy
markdown
# Python Web Scraping
## Stack
- Python 3.12+
- httpx for HTTP requests (async by default)
- BeautifulSoup4 + lxml for HTML parsing
- Playwright for JavaScript-rendered pages
- Scrapy for large-scale crawls
- Pydantic for data validation
- uv for dependency management
## Project Structure
```
src/
scrapers/ # One module per target site
models/ # Pydantic models for scraped data
pipelines/ # Data cleaning and export (CSV, JSON, DB)
middleware/ # Retry logic, proxy rotation, rate limiting
utils/ # URL helpers, text normalization
tests/
fixtures/ # Saved HTML snapshots for offline tests
```
## HTTP Requests
- Always use `httpx.AsyncClient` with a shared session for connection pooling
- Set a descriptive `User-Agent` header — never use the default
- Set reasonable timeouts: `httpx.Timeout(connect=10, read=30)`
- Handle status codes explicitly — do not silently ignore 4xx/5xx
- Use `response.raise_for_status()` then catch `httpx.HTTPStatusError`
- Prefer `response.text` over `response.content` for HTML
```python
async with httpx.AsyncClient(
headers={"User-Agent": "MyBot/1.0 (+https://example.com/bot)"},
timeout=httpx.Timeout(connect=10, read=30),
follow_redirects=True,
limits=httpx.Limits(max_connections=20, max_keepalive_connections=10),
) as client:
response = await client.get(url)
response.raise_for_status()
```
## Parsing
- Use `lxml` parser for speed: `BeautifulSoup(html, "lxml")`
- Prefer CSS selectors (`select` / `select_one`) over `find` / `find_all`
- Always check for `None` before accessing `.text` or attributes
- Extract data into Pydantic models immediately — no raw dicts
- Strip and normalize whitespace: `.get_text(strip=True)`
```python
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
url: str
in_stock: bool
el = soup.select_one("h1.product-title")
name = el.get_text(strip=True) if el else ""
```
## Rate Limiting & Politeness
- Respect `robots.txt` — check with `urllib.robotparser` before crawling
- Add a delay between requests: `asyncio.sleep(1)` minimum
- Use exponential backoff on retries (tenacity or manual)
- Limit concurrent requests with `asyncio.Semaphore`
- Check for and respect `Retry-After` headers on 429 responses
- Stop immediately if you receive a 403 with a CAPTCHA page
```python
semaphore = asyncio.Semaphore(5)
async def fetch(client: httpx.AsyncClient, url: str) -> str:
async with semaphore:
await asyncio.sleep(1)
response = await client.get(url)
response.raise_for_status()
return response.text
```
## JavaScript-Rendered Pages
- Only use Playwright when content is not in the initial HTML
- Always use headless mode: `browser.launch(headless=True)`
- Wait for specific selectors, not arbitrary timeouts: `page.wait_for_selector(".data")`
- Block unnecessary resources to speed up loads:
```python
await page.route("**/*.{png,jpg,gif,svg,css,font,woff2}", lambda route: route.abort())
```
- Close browser contexts after each task to avoid memory leaks
- Extract HTML with `page.content()` then parse with BeautifulSoup — avoid chaining Playwright selectors for complex extraction
## Error Handling
- Retry transient failures (429, 500, 502, 503, 504, timeouts) up to 3 times
- Log the URL and status code on every failure
- Never crash on a single page failure — skip and continue
- Collect errors in a list and report a summary at the end
- Save failed URLs to a file for re-processing
```python
RETRYABLE = {429, 500, 502, 503, 504}
async def fetch_with_retry(client, url, max_retries=3):
for attempt in range(max_retries):
try:
resp = await client.get(url)
resp.raise_for_status()
return resp.text
except httpx.HTTPStatusError as e:
if e.response.status_code not in RETRYABLE:
raise
await asyncio.sleep(2 ** attempt)
raise RuntimeError(f"Failed after {max_retries} retries: {url}")
```
## Data Validation & Export
- Validate every scraped record with Pydantic — drop or flag invalid rows
- Export to structured formats: JSON Lines (`.jsonl`), CSV, or directly to a database
- Include metadata: scrape timestamp, source URL, page hash
- Deduplicate by a natural key (URL, product ID, etc.)
## Testing
- Save HTML fixtures for offline testing — never hit live sites in CI
- Test parsing logic against fixtures with known expected output
- Test edge cases: missing fields, empty pages, malformed HTML, different page layouts
- Use `respx` to mock httpx requests in tests
- Run `mypy --strict` and `ruff check` in CI
## Anti-Patterns to Avoid
- Do not use `requests` — use `httpx` (async, HTTP/2 support)
- Do not use regex to parse HTML — use a proper parser
- Do not hardcode delays with `time.sleep` in async code — use `asyncio.sleep`
- Do not store raw HTML blobs in the database — extract structured data
- Do not ignore `robots.txt` or ToS
- Do not run scrapers without a rate limit