373 words

2 minutes

Python BeautifulSoup Module - Complete Tutorial

2026-02-05

Programming Foundamentals

python

/

beautifulsoup

/

web-scraping

/

html

/

cybersecurity

Python BeautifulSoup Module - Complete Tutorial#

Table of Contents#

What Is BeautifulSoup
Installation
HTML Basics
Parsing HTML
Finding Elements
Navigating the Tree
Extracting Content
Modifying HTML
Web Scraping Patterns
Ethics and Safety
Quick Reference

What Is BeautifulSoup#

BeautifulSoup (bs4) is a Python library for parsing HTML and XML. It builds a parse tree that makes it easy to find and extract data.

What it can do:

Parse messy HTML
Find elements by tag, class, id, or CSS selectors
Navigate parent, child, and sibling relationships
Extract text and attributes
Modify HTML when needed

Installation#

1
pip install beautifulsoup4

Optional parsers:

1
pip install lxml html5lib

Parser comparison:

Parser	Pros	Cons
`html.parser`	Built in	Slower on large pages
`lxml`	Fast	External dependency
`html5lib`	Most tolerant	Slowest

HTML Basics#

HTML is a tree. Elements can be nested and have attributes.

1
<div class="card" id="intro">
2
  <h1>Title</h1>
3
  <p>Text</p>
4
  <a href="/about">About</a>
5
</div>

Key terms:

Tag: div, h1, a
Attribute: class, id, href
Text: content inside a tag

Parsing HTML#

From a string#

1
from bs4 import BeautifulSoup
2

3
html = """
4
<html>
5
  <body>
6
    <h1>Hello</h1>
7
    <p class="msg">Welcome</p>
8
  </body>
9
</html>
10
"""
11

12
soup = BeautifulSoup(html, "html.parser")
13
print(soup.prettify())

From a file#

1
from bs4 import BeautifulSoup
2

3
with open("page.html", "r", encoding="utf-8") as f:
4
    soup = BeautifulSoup(f.read(), "html.parser")
5

6
print(soup.title)

From a URL with requests#

1
import requests
2
from bs4 import BeautifulSoup
3

4
url = "https://example.com"
5
resp = requests.get(url, timeout=10)
6
resp.raise_for_status()
7

8
soup = BeautifulSoup(resp.text, "html.parser")
9
print(soup.title.text)

Finding Elements#

By tag#

1
soup.find("h1")
2
soup.find_all("p")

By class or id#

1
soup.find(class_="card")
2
soup.find(id="intro")

With CSS selectors#

1
soup.select_one("div.card > h1")
2
soup.select("a[href]")

CSS selector tips:

Selector	Meaning	Example
`tag`	Tag name	`p`
`.class`	Class	`.card`
`#id`	ID	`#intro`
`a[href]`	Attribute exists	`a[href]`
`parent child`	Descendant	`.card a`
`parent > child`	Direct child	`.card > h1`

Navigating the Tree#

1
container = soup.find("div", class_="card")
2
heading = container.find("h1")
3

4
print(heading.parent.name)
5
print([child.name for child in container.children if child.name])
6
print(container.next_sibling)

Common navigation helpers:

parent
parents
children
descendants
next_sibling and previous_sibling

Extracting Content#

Text#

1
title = soup.find("h1")
2
print(title.get_text(strip=True))

Attributes#

1
link = soup.find("a")
2
print(link["href"])
3
print(link.get("class"))

Links and images#

1
links = [a["href"] for a in soup.find_all("a", href=True)]
2
images = [img.get("src") for img in soup.find_all("img")]

Modifying HTML#

1
p = soup.find("p")
2
p.string = "Updated text"
3
p["class"] = "highlight"

Add a new element:

1
new_tag = soup.new_tag("span")
2
new_tag.string = "New"
3
soup.body.append(new_tag)

Web Scraping Patterns#

Table extraction#

1
table = soup.find("table")
2
headers = [th.get_text(strip=True) for th in table.find_all("th")]
3
rows = []
4
for tr in table.find_all("tr")[1:]:
5
    rows.append([td.get_text(strip=True) for td in tr.find_all("td")])

Pagination#

1
import requests
2
from bs4 import BeautifulSoup
3

4
base = "https://example.com/page/{}"
5
for page in range(1, 4):
6
    resp = requests.get(base.format(page), timeout=10)
7
    resp.raise_for_status()
8
    soup = BeautifulSoup(resp.text, "html.parser")
9
    for item in soup.select(".item .title"):
10
        print(item.get_text(strip=True))

Extracting forms#

1
forms = soup.find_all("form")
2
for form in forms:
3
    print(form.get("action"), form.get("method"))
4
    for inp in form.find_all("input"):
5
        print(inp.get("name"), inp.get("type"))

Ethics and Safety#

Always:

Read robots.txt and site terms
Use timeouts and respect rate limits
Scrape only public data you have permission to access

Never:

Bypass authentication without authorization
Overload servers
Collect sensitive data without consent

Quick Reference#

1
from bs4 import BeautifulSoup
2

3
soup = BeautifulSoup(html, "html.parser")
4

5
soup.find("div")
6
soup.find_all("a")
7

8
soup.find(class_="card")
9
soup.find(id="main")
10

11
soup.select_one(".card > h1")
12
soup.select("a[href]")
13

14
element.get_text(strip=True)
15
link.get("href")