beautifulsoup4

Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

beautifulsoup4 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.

Teams using beautifulsoup4 should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/beautifulsoup4/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/development/beautifulsoup4/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/beautifulsoup4/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How beautifulsoup4 Compares

Feature / Agent	beautifulsoup4	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

## Imports

```python
import bs4
from bs4 import BeautifulSoup, Tag, Comment
from bs4.exceptions import FeatureNotFound, ParserRejectedMarkup
from bs4.dammit import UnicodeDammit
```

## Core Patterns

### Parse markup with an explicit parser ✅ Current
```python
from __future__ import annotations

from bs4 import BeautifulSoup

html_doc = "<html><body><p class='body strikeout'>Hello</p></body></html>"

# Always choose the parser explicitly for consistent behavior across environments.
soup = BeautifulSoup(html_doc, "html.parser")

p = soup.find("p")
assert p is not None
print(p.name)          # "p"
print(p.get_text())    # "Hello"
```
* Prefer `BeautifulSoup(markup, "html.parser")`, `"lxml"`, `"html5lib"`, or `"xml"/"lxml-xml"` depending on your needs; different parsers can produce different trees for invalid documents.

### Parse from a file handle (context manager) ✅ Current
```python
from __future__ import annotations

from pathlib import Path
from bs4 import BeautifulSoup

path = Path("example.html")
path.write_text("<html><body><a href='/x'>Link</a></body></html>", encoding="utf-8")

with path.open("r", encoding="utf-8") as fp:
    soup = BeautifulSoup(fp, "html.parser")

a = soup.find("a")
assert a is not None
print(a.get("href"))  # "/x"
```
* Pass an open file handle directly to `BeautifulSoup` to let the builder stream/handle encodings appropriately.

### Find elements and navigate relatives ✅ Current
```python
from __future__ import annotations

from typing import Optional
from bs4 import BeautifulSoup, Tag

html_doc = """
<div id="root">
  <h1>Title</h1>
  <p>First</p>
  <p>Second <span>inner</span></p>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

root: Optional[Tag] = soup.find(id="root")
assert root is not None

h1: Optional[Tag] = root.find("h1")
assert h1 is not None

# Navigate
second_p: Optional[Tag] = h1.find_next("p")
assert second_p is not None
print(second_p.get_text(strip=True))  # "First"

all_ps = root.find_all("p")
print([p.get_text(" ", strip=True) for p in all_ps])  # ["First", "Second inner"]
```
* Use `find`, `find_all`, and the `find_next*` / `find_previous*` / sibling / parent variants for tree navigation.

### Work with tag attributes (including multi-valued `class`) ✅ Current
```python
from __future__ import annotations

from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup("<p id='x' class='body strikeout'></p>", "html.parser")
p = soup.find("p")
assert isinstance(p, Tag)

# Dict-like access
print(p["id"])         # "x"
print(p.get("id"))     # "x"

# Multi-valued HTML attributes like class are lists by default.
print(p["class"])      # ["body", "strikeout"]

# If you always want a list (even for non-multivalued attrs), use get_attribute_list.
print(p.get_attribute_list("id"))     # ["x"]
print(p.get_attribute_list("class"))  # ["body", "strikeout"]

# Mutation
p["data-role"] = "demo"
del p["id"]
print(p.attrs)  # {'class': ['body', 'strikeout'], 'data-role': 'demo'}
```
* In HTML mode, `class`, `rel`, etc. are typically stored as `list[str]`. Use `Tag.get_attribute_list(name)` to normalize to a list.

### Handle text nodes and comments safely ✅ Current
```python
from __future__ import annotations

from bs4 import BeautifulSoup, Comment
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello<!--secret--></p>", "html.parser")
p = soup.find("p")
assert p is not None

# Comments are special text nodes.
comment = p.find(string=lambda s: isinstance(s, Comment))
assert isinstance(comment, Comment)
print(comment)  # "secret"

# NavigableString is immutable; replace the node instead of editing in place.
text = p.find(string=lambda s: isinstance(s, NavigableString) and not isinstance(s, Comment))
assert isinstance(text, NavigableString)
text.replace_with("Hi")

print(p.get_text())  # "Hi"
```
* Treat `NavigableString` as immutable; use `replace_with(...)` to change text.

## Configuration

- **Parser selection (`features`)**:
  - `"html.parser"`: built-in, decent baseline.
  - `"lxml"`: fast (requires `lxml`).
  - `"html5lib"`: most lenient (slow; requires `html5lib`).
  - `"xml"` / `"lxml-xml"`: XML parsing mode (attribute handling differs from HTML).
- **`parse_only`**: pass a `SoupStrainer` (not covered here) to parse only parts of a document for speed/memory.
- **`from_encoding` / `exclude_encodings`**: hint or restrict encoding detection when input is bytes.
- **Large text nodes with lxml**: when using an lxml builder and documents may contain a single text node > 10,000,000 bytes, pass `huge_tree=True` to `BeautifulSoup(...)` to avoid lxml security limits truncating the parse.
- **Multi-valued attributes**:
  - Default (HTML): `class`/`rel` etc. become lists.
  - To disable list conversion: `BeautifulSoup(markup, "html.parser", multi_valued_attributes=None)`
  - In XML mode, multi-valued attributes are not enabled by default; you can opt in via `multi_valued_attributes={'*': 'class'}`.

## Pitfalls

### Wrong: Not specifying a parser (inconsistent trees)
```python
from bs4 import BeautifulSoup

html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc)  # parser not specified
print(soup.find("b"))
```

### Right: Choose a parser explicitly
```python
from bs4 import BeautifulSoup

html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find("b"))
```

### Wrong: Treating `class` as a string in HTML mode
```python
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
# In HTML mode, soup.p["class"] is a list, so this fails.
classes = soup.p["class"].split()  # type: ignore[attr-defined]
print(classes)
```

### Right: Use the list directly (or normalize with `get_attribute_list`)
```python
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
classes = soup.p["class"]
print(classes)  # ["body", "strikeout"]

ids = soup.p.get_attribute_list("id")
print(ids)  # []
```

### Wrong: Assuming multi-valued attributes exist in XML mode
```python
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "xml")
# In XML mode, "class" is a string by default; indexing returns a character.
first = soup.p["class"][0]
print(first)  # "b" (not "body")
```

### Right: Opt in to multi-valued attributes when parsing XML
```python
from bs4 import BeautifulSoup

class_is_multi = {"*": "class"}
soup = BeautifulSoup("<p class='body strikeout'></p>", "xml", multi_valued_attributes=class_is_multi)
first = soup.p["class"][0]
print(first)  # "body"
```

### Wrong: Editing a `NavigableString` “in place”
```python
from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)

# Strings are immutable; this does not update the parse tree.
text = NavigableString("Hi")
print(soup.p.get_text())  # still "Hello"
```

### Right: Replace the existing node with `replace_with`
```python
from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)

text.replace_with("Hi")
print(soup.p.get_text())  # "Hi"
```

### Wrong: lxml builder truncation with huge text nodes (missing `huge_tree=True`)
```python
from bs4 import BeautifulSoup

# If this markup contains a single >10,000,000 byte text node, lxml may stop early.
markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml")
print(soup.find("root") is not None)
```

### Right: Enable huge tree support when needed
```python
from bs4 import BeautifulSoup

markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml", huge_tree=True)
print(soup.find("root") is not None)
```

## References

- [Download](https://www.crummy.com/software/BeautifulSoup/bs4/download/)
- [Homepage](https://www.crummy.com/software/BeautifulSoup/bs4/)

## Migration from v4.13.x

- **Typing changes (4.14.0+)**: `find_*` methods gained overloads to improve type safety.
  - Prefer annotating results as `Optional[Tag]`, `Optional[NavigableString]`, `Sequence[Tag]`, etc.
  - Known edge case: `find_all("a", string="...")` may still confuse type checkers; refactor or use `typing.cast`.

```python
from __future__ import annotations

from typing import Optional, Sequence, cast
from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup("<a>b</a>", "html.parser")

# Preferred: reflect optionality
a: Optional[Tag] = soup.find("a")

# Edge case: mixed filters may require a cast for static type checkers
tags = cast(Sequence[Tag], soup.find_all("a", string="b"))
print([t.get_text() for t in tags])
```

- **`ResultSet` typing churn across 4.14.x**: inheritance changed in 4.14.0/4.14.1/4.14.2; avoid depending on specific ABC inheritance.
  - If you need a stable container type at boundaries: `results = list(soup.find_all(...))`.

- **lxml huge text nodes (4.14.3 note)**: if using an lxml builder and expecting extremely large text nodes, pass `huge_tree=True`.

## API Reference

- **BeautifulSoup(markup, features=..., parse_only=..., from_encoding=..., exclude_encodings=..., element_classes=..., \*\*kwargs)** - parse markup into a tree; specify `features` (parser) explicitly.
- **BeautifulSoup.find(...)** - return the first matching element (often `Tag | None`); supports tag name, attrs, and other filters.
- **BeautifulSoup.find_all(...)** - return all matching elements (list-like result set); convert to `list(...)` if you need a stable container type.
- **BeautifulSoup.find_next(...) / find_all_next(...)** - search forward in document order from a starting node.
- **BeautifulSoup.find_previous(...) / find_all_previous(...)** - search backward in document order from a starting node.
- **BeautifulSoup.find_next_sibling(...) / find_next_siblings(...)** - search among following siblings.
- **BeautifulSoup.find_previous_sibling(...) / find_previous_siblings(...)** - search among preceding siblings.
- **BeautifulSoup.find_parent(...) / find_parents(...)** - search upward to parents/ancestors.
- **BeautifulSoup.get_text(separator="...", strip=False)** - extract combined text content from a subtree.
- **BeautifulSoup.prettify()** - render formatted markup for debugging/inspection.
- **BeautifulSoup.contains_replacement_characters** - flag indicating replacement characters were introduced during entity/encoding handling (builder-dependent).
- **Tag.name** - the tag’s name (e.g., `"a"`, `"p"`).
- **Tag.attrs** - dict of attributes; multi-valued HTML attributes may be lists.
- **Tag.get(key, default=None)** - safe attribute lookup.
- **Tag.get_attribute_list(name)** - normalize an attribute to a list regardless of internal storage.
- **UnicodeDammit(...)** - helper for detecting/decoding unknown encodings before parsing.
- **Comment** - class for HTML/XML comment nodes (a specialized string-like node).
- **ParserRejectedMarkup** - exception raised when the underlying parser rejects markup.
- **FeatureNotFound** - exception raised when the requested parser feature/builder is unavailable.

Related Skills

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

Buffer Overflow Payload Generator

from diegosouzapw/awesome-omni-skill

Generates a buffer overflow attack payload with a specific stack layout (padding, return address, NOP sled, shellcode) and saves it to a file.

browser-testing

from diegosouzapw/awesome-omni-skill

Use when testing web applications, debugging browser console errors, automating form interactions, or verifying UI implementations. Load for localhost testing, authenticated app testing (Gmail, Notion), or recording demo GIFs. Requires Chrome extension 1.0.36+, Claude Code 2.0.73+, paid plan.

browser-fetch

from diegosouzapw/awesome-omni-skill

Delegate browser automation to a lightweight subagent (Haiku) to reduce context consumption. Also provides web clipping (HTML→Markdown) via clipper.

Browser Automation Expert

from diegosouzapw/awesome-omni-skill

浏览器自动化与网页测试专家。支持基于 MCP 工具（Puppeteer/Playwright）的实时交互，以及基于 Python 脚本的复杂自动化流实现。

bronze-layer-setup

from diegosouzapw/awesome-omni-skill

End-to-end Bronze layer creation for testing and demos. Creates table DDLs, generates fake data with Faker, copies from existing sources, and configures Asset Bundle jobs. Covers Unity Catalog compliance, Change Data Feed, automatic liquid clustering, and governance metadata. Use when setting up Bronze layer tables, creating test/demo data, rapid prototyping Medallion Architecture, or bootstrapping a new Databricks project. For Faker-specific patterns (corruption rates, function signatures, provider examples), load the faker-data-generation skill.

brand-identity

from diegosouzapw/awesome-omni-skill

Provides the single source of truth for brand guidelines, design tokens, technology choices, and voice/tone. Use this skill whenever generating UI components, styling applications, writing copy, or creating user-facing assets to ensure brand consistency.

brainstorming

from diegosouzapw/awesome-omni-skill

Use when creating or developing anything, before writing code or implementation plans - refines rough ideas into fully-formed designs through structured Socratic questioning, alternative exploration, and incremental validation

boxlog-frontend-design

from diegosouzapw/awesome-omni-skill

BoxLog専用のフロントエンドデザインスキル。「装飾のない基本体験」を実現するためのUI設計ガイドライン。STYLE_GUIDE.mdを補完し、フォント・アニメーション・デザイン判断基準を提供。

bounty-hunter

from diegosouzapw/awesome-omni-skill

Find, evaluate, and submit online bounties and hackathons for prize money. Use when user mentions "bounties", "hackathon", "earn money", "Superteam Earn", "prize money", "submissions", "freelance bounties", or asks to find paid opportunities. Covers discovery, eligibility filtering, content drafting, and submission workflows.

bootstrap-phase-workflow

from diegosouzapw/awesome-omni-skill

Integrate the vibe/mature phase workflow into a project

bootstrap-auto

from diegosouzapw/awesome-omni-skill

[Implementation] Bootstrap a new project automatically