373 words
2 minutes
Python BeautifulSoup Module - Complete Tutorial
Python BeautifulSoup Module - Complete Tutorial
Table of Contents
- What Is BeautifulSoup
- Installation
- HTML Basics
- Parsing HTML
- Finding Elements
- Navigating the Tree
- Extracting Content
- Modifying HTML
- Web Scraping Patterns
- Ethics and Safety
- Quick Reference
What Is BeautifulSoup
BeautifulSoup (bs4) is a Python library for parsing HTML and XML. It builds a parse tree that makes it easy to find and extract data.
What it can do:
- Parse messy HTML
- Find elements by tag, class, id, or CSS selectors
- Navigate parent, child, and sibling relationships
- Extract text and attributes
- Modify HTML when needed
Installation
pip install beautifulsoup4Optional parsers:
pip install lxml html5libParser comparison:
| Parser | Pros | Cons |
|---|---|---|
html.parser | Built in | Slower on large pages |
lxml | Fast | External dependency |
html5lib | Most tolerant | Slowest |
HTML Basics
HTML is a tree. Elements can be nested and have attributes.
<div class="card" id="intro"> <h1>Title</h1> <p>Text</p> <a href="/about">About</a></div>Key terms:
- Tag:
div,h1,a - Attribute:
class,id,href - Text: content inside a tag
Parsing HTML
From a string
from bs4 import BeautifulSoup
html = """<html> <body> <h1>Hello</h1> <p class="msg">Welcome</p> </body></html>"""
soup = BeautifulSoup(html, "html.parser")print(soup.prettify())From a file
from bs4 import BeautifulSoup
with open("page.html", "r", encoding="utf-8") as f: soup = BeautifulSoup(f.read(), "html.parser")
print(soup.title)From a URL with requests
import requestsfrom bs4 import BeautifulSoup
url = "https://example.com"resp = requests.get(url, timeout=10)resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")print(soup.title.text)Finding Elements
By tag
soup.find("h1")soup.find_all("p")By class or id
soup.find(class_="card")soup.find(id="intro")With CSS selectors
soup.select_one("div.card > h1")soup.select("a[href]")CSS selector tips:
| Selector | Meaning | Example |
|---|---|---|
tag | Tag name | p |
.class | Class | .card |
#id | ID | #intro |
a[href] | Attribute exists | a[href] |
parent child | Descendant | .card a |
parent > child | Direct child | .card > h1 |
Navigating the Tree
container = soup.find("div", class_="card")heading = container.find("h1")
print(heading.parent.name)print([child.name for child in container.children if child.name])print(container.next_sibling)Common navigation helpers:
parentparentschildrendescendantsnext_siblingandprevious_sibling
Extracting Content
Text
title = soup.find("h1")print(title.get_text(strip=True))Attributes
link = soup.find("a")print(link["href"])print(link.get("class"))Links and images
links = [a["href"] for a in soup.find_all("a", href=True)]images = [img.get("src") for img in soup.find_all("img")]Modifying HTML
p = soup.find("p")p.string = "Updated text"p["class"] = "highlight"Add a new element:
new_tag = soup.new_tag("span")new_tag.string = "New"soup.body.append(new_tag)Web Scraping Patterns
Table extraction
table = soup.find("table")headers = [th.get_text(strip=True) for th in table.find_all("th")]rows = []for tr in table.find_all("tr")[1:]: rows.append([td.get_text(strip=True) for td in tr.find_all("td")])Pagination
import requestsfrom bs4 import BeautifulSoup
base = "https://example.com/page/{}"for page in range(1, 4): resp = requests.get(base.format(page), timeout=10) resp.raise_for_status() soup = BeautifulSoup(resp.text, "html.parser") for item in soup.select(".item .title"): print(item.get_text(strip=True))Extracting forms
forms = soup.find_all("form")for form in forms: print(form.get("action"), form.get("method")) for inp in form.find_all("input"): print(inp.get("name"), inp.get("type"))Ethics and Safety
Always:
- Read
robots.txtand site terms - Use timeouts and respect rate limits
- Scrape only public data you have permission to access
Never:
- Bypass authentication without authorization
- Overload servers
- Collect sensitive data without consent
Quick Reference
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find("div")soup.find_all("a")
soup.find(class_="card")soup.find(id="main")
soup.select_one(".card > h1")soup.select("a[href]")
element.get_text(strip=True)link.get("href")