373 words
2 minutes
Python BeautifulSoup Module - Complete Tutorial

Python BeautifulSoup Module - Complete Tutorial#

Table of Contents#

  1. What Is BeautifulSoup
  2. Installation
  3. HTML Basics
  4. Parsing HTML
  5. Finding Elements
  6. Navigating the Tree
  7. Extracting Content
  8. Modifying HTML
  9. Web Scraping Patterns
  10. Ethics and Safety
  11. Quick Reference

What Is BeautifulSoup#

BeautifulSoup (bs4) is a Python library for parsing HTML and XML. It builds a parse tree that makes it easy to find and extract data.

What it can do:

  • Parse messy HTML
  • Find elements by tag, class, id, or CSS selectors
  • Navigate parent, child, and sibling relationships
  • Extract text and attributes
  • Modify HTML when needed

Installation#

Terminal window
pip install beautifulsoup4

Optional parsers:

Terminal window
pip install lxml html5lib

Parser comparison:

ParserProsCons
html.parserBuilt inSlower on large pages
lxmlFastExternal dependency
html5libMost tolerantSlowest

HTML Basics#

HTML is a tree. Elements can be nested and have attributes.

<div class="card" id="intro">
<h1>Title</h1>
<p>Text</p>
<a href="/about">About</a>
</div>

Key terms:

  • Tag: div, h1, a
  • Attribute: class, id, href
  • Text: content inside a tag

Parsing HTML#

From a string#

from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1>Hello</h1>
<p class="msg">Welcome</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())

From a file#

from bs4 import BeautifulSoup
with open("page.html", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "html.parser")
print(soup.title)

From a URL with requests#

import requests
from bs4 import BeautifulSoup
url = "https://example.com"
resp = requests.get(url, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
print(soup.title.text)

Finding Elements#

By tag#

soup.find("h1")
soup.find_all("p")

By class or id#

soup.find(class_="card")
soup.find(id="intro")

With CSS selectors#

soup.select_one("div.card > h1")
soup.select("a[href]")

CSS selector tips:

SelectorMeaningExample
tagTag namep
.classClass.card
#idID#intro
a[href]Attribute existsa[href]
parent childDescendant.card a
parent > childDirect child.card > h1

container = soup.find("div", class_="card")
heading = container.find("h1")
print(heading.parent.name)
print([child.name for child in container.children if child.name])
print(container.next_sibling)

Common navigation helpers:

  • parent
  • parents
  • children
  • descendants
  • next_sibling and previous_sibling

Extracting Content#

Text#

title = soup.find("h1")
print(title.get_text(strip=True))

Attributes#

link = soup.find("a")
print(link["href"])
print(link.get("class"))
links = [a["href"] for a in soup.find_all("a", href=True)]
images = [img.get("src") for img in soup.find_all("img")]

Modifying HTML#

p = soup.find("p")
p.string = "Updated text"
p["class"] = "highlight"

Add a new element:

new_tag = soup.new_tag("span")
new_tag.string = "New"
soup.body.append(new_tag)

Web Scraping Patterns#

Table extraction#

table = soup.find("table")
headers = [th.get_text(strip=True) for th in table.find_all("th")]
rows = []
for tr in table.find_all("tr")[1:]:
rows.append([td.get_text(strip=True) for td in tr.find_all("td")])

Pagination#

import requests
from bs4 import BeautifulSoup
base = "https://example.com/page/{}"
for page in range(1, 4):
resp = requests.get(base.format(page), timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
for item in soup.select(".item .title"):
print(item.get_text(strip=True))

Extracting forms#

forms = soup.find_all("form")
for form in forms:
print(form.get("action"), form.get("method"))
for inp in form.find_all("input"):
print(inp.get("name"), inp.get("type"))

Ethics and Safety#

Always:

  • Read robots.txt and site terms
  • Use timeouts and respect rate limits
  • Scrape only public data you have permission to access

Never:

  • Bypass authentication without authorization
  • Overload servers
  • Collect sensitive data without consent

Quick Reference#

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find("div")
soup.find_all("a")
soup.find(class_="card")
soup.find(id="main")
soup.select_one(".card > h1")
soup.select("a[href]")
element.get_text(strip=True)
link.get("href")