beautifulsoup4
Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.
Best use case
beautifulsoup4 is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.
Teams using beautifulsoup4 should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/beautifulsoup4/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How beautifulsoup4 Compares
| Feature / Agent | beautifulsoup4 | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
## Imports
```python
import bs4
from bs4 import BeautifulSoup, Tag, Comment
from bs4.exceptions import FeatureNotFound, ParserRejectedMarkup
from bs4.dammit import UnicodeDammit
```
## Core Patterns
### Parse markup with an explicit parser ✅ Current
```python
from __future__ import annotations
from bs4 import BeautifulSoup
html_doc = "<html><body><p class='body strikeout'>Hello</p></body></html>"
# Always choose the parser explicitly for consistent behavior across environments.
soup = BeautifulSoup(html_doc, "html.parser")
p = soup.find("p")
assert p is not None
print(p.name) # "p"
print(p.get_text()) # "Hello"
```
* Prefer `BeautifulSoup(markup, "html.parser")`, `"lxml"`, `"html5lib"`, or `"xml"/"lxml-xml"` depending on your needs; different parsers can produce different trees for invalid documents.
### Parse from a file handle (context manager) ✅ Current
```python
from __future__ import annotations
from pathlib import Path
from bs4 import BeautifulSoup
path = Path("example.html")
path.write_text("<html><body><a href='/x'>Link</a></body></html>", encoding="utf-8")
with path.open("r", encoding="utf-8") as fp:
soup = BeautifulSoup(fp, "html.parser")
a = soup.find("a")
assert a is not None
print(a.get("href")) # "/x"
```
* Pass an open file handle directly to `BeautifulSoup` to let the builder stream/handle encodings appropriately.
### Find elements and navigate relatives ✅ Current
```python
from __future__ import annotations
from typing import Optional
from bs4 import BeautifulSoup, Tag
html_doc = """
<div id="root">
<h1>Title</h1>
<p>First</p>
<p>Second <span>inner</span></p>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
root: Optional[Tag] = soup.find(id="root")
assert root is not None
h1: Optional[Tag] = root.find("h1")
assert h1 is not None
# Navigate
second_p: Optional[Tag] = h1.find_next("p")
assert second_p is not None
print(second_p.get_text(strip=True)) # "First"
all_ps = root.find_all("p")
print([p.get_text(" ", strip=True) for p in all_ps]) # ["First", "Second inner"]
```
* Use `find`, `find_all`, and the `find_next*` / `find_previous*` / sibling / parent variants for tree navigation.
### Work with tag attributes (including multi-valued `class`) ✅ Current
```python
from __future__ import annotations
from bs4 import BeautifulSoup, Tag
soup = BeautifulSoup("<p id='x' class='body strikeout'></p>", "html.parser")
p = soup.find("p")
assert isinstance(p, Tag)
# Dict-like access
print(p["id"]) # "x"
print(p.get("id")) # "x"
# Multi-valued HTML attributes like class are lists by default.
print(p["class"]) # ["body", "strikeout"]
# If you always want a list (even for non-multivalued attrs), use get_attribute_list.
print(p.get_attribute_list("id")) # ["x"]
print(p.get_attribute_list("class")) # ["body", "strikeout"]
# Mutation
p["data-role"] = "demo"
del p["id"]
print(p.attrs) # {'class': ['body', 'strikeout'], 'data-role': 'demo'}
```
* In HTML mode, `class`, `rel`, etc. are typically stored as `list[str]`. Use `Tag.get_attribute_list(name)` to normalize to a list.
### Handle text nodes and comments safely ✅ Current
```python
from __future__ import annotations
from bs4 import BeautifulSoup, Comment
from bs4.element import NavigableString
soup = BeautifulSoup("<p>Hello<!--secret--></p>", "html.parser")
p = soup.find("p")
assert p is not None
# Comments are special text nodes.
comment = p.find(string=lambda s: isinstance(s, Comment))
assert isinstance(comment, Comment)
print(comment) # "secret"
# NavigableString is immutable; replace the node instead of editing in place.
text = p.find(string=lambda s: isinstance(s, NavigableString) and not isinstance(s, Comment))
assert isinstance(text, NavigableString)
text.replace_with("Hi")
print(p.get_text()) # "Hi"
```
* Treat `NavigableString` as immutable; use `replace_with(...)` to change text.
## Configuration
- **Parser selection (`features`)**:
- `"html.parser"`: built-in, decent baseline.
- `"lxml"`: fast (requires `lxml`).
- `"html5lib"`: most lenient (slow; requires `html5lib`).
- `"xml"` / `"lxml-xml"`: XML parsing mode (attribute handling differs from HTML).
- **`parse_only`**: pass a `SoupStrainer` (not covered here) to parse only parts of a document for speed/memory.
- **`from_encoding` / `exclude_encodings`**: hint or restrict encoding detection when input is bytes.
- **Large text nodes with lxml**: when using an lxml builder and documents may contain a single text node > 10,000,000 bytes, pass `huge_tree=True` to `BeautifulSoup(...)` to avoid lxml security limits truncating the parse.
- **Multi-valued attributes**:
- Default (HTML): `class`/`rel` etc. become lists.
- To disable list conversion: `BeautifulSoup(markup, "html.parser", multi_valued_attributes=None)`
- In XML mode, multi-valued attributes are not enabled by default; you can opt in via `multi_valued_attributes={'*': 'class'}`.
## Pitfalls
### Wrong: Not specifying a parser (inconsistent trees)
```python
from bs4 import BeautifulSoup
html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc) # parser not specified
print(soup.find("b"))
```
### Right: Choose a parser explicitly
```python
from bs4 import BeautifulSoup
html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find("b"))
```
### Wrong: Treating `class` as a string in HTML mode
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
# In HTML mode, soup.p["class"] is a list, so this fails.
classes = soup.p["class"].split() # type: ignore[attr-defined]
print(classes)
```
### Right: Use the list directly (or normalize with `get_attribute_list`)
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
classes = soup.p["class"]
print(classes) # ["body", "strikeout"]
ids = soup.p.get_attribute_list("id")
print(ids) # []
```
### Wrong: Assuming multi-valued attributes exist in XML mode
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p class='body strikeout'></p>", "xml")
# In XML mode, "class" is a string by default; indexing returns a character.
first = soup.p["class"][0]
print(first) # "b" (not "body")
```
### Right: Opt in to multi-valued attributes when parsing XML
```python
from bs4 import BeautifulSoup
class_is_multi = {"*": "class"}
soup = BeautifulSoup("<p class='body strikeout'></p>", "xml", multi_valued_attributes=class_is_multi)
first = soup.p["class"][0]
print(first) # "body"
```
### Wrong: Editing a `NavigableString` “in place”
```python
from bs4 import BeautifulSoup
from bs4.element import NavigableString
soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)
# Strings are immutable; this does not update the parse tree.
text = NavigableString("Hi")
print(soup.p.get_text()) # still "Hello"
```
### Right: Replace the existing node with `replace_with`
```python
from bs4 import BeautifulSoup
from bs4.element import NavigableString
soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)
text.replace_with("Hi")
print(soup.p.get_text()) # "Hi"
```
### Wrong: lxml builder truncation with huge text nodes (missing `huge_tree=True`)
```python
from bs4 import BeautifulSoup
# If this markup contains a single >10,000,000 byte text node, lxml may stop early.
markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml")
print(soup.find("root") is not None)
```
### Right: Enable huge tree support when needed
```python
from bs4 import BeautifulSoup
markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml", huge_tree=True)
print(soup.find("root") is not None)
```
## References
- [Download](https://www.crummy.com/software/BeautifulSoup/bs4/download/)
- [Homepage](https://www.crummy.com/software/BeautifulSoup/bs4/)
## Migration from v4.13.x
- **Typing changes (4.14.0+)**: `find_*` methods gained overloads to improve type safety.
- Prefer annotating results as `Optional[Tag]`, `Optional[NavigableString]`, `Sequence[Tag]`, etc.
- Known edge case: `find_all("a", string="...")` may still confuse type checkers; refactor or use `typing.cast`.
```python
from __future__ import annotations
from typing import Optional, Sequence, cast
from bs4 import BeautifulSoup, Tag
soup = BeautifulSoup("<a>b</a>", "html.parser")
# Preferred: reflect optionality
a: Optional[Tag] = soup.find("a")
# Edge case: mixed filters may require a cast for static type checkers
tags = cast(Sequence[Tag], soup.find_all("a", string="b"))
print([t.get_text() for t in tags])
```
- **`ResultSet` typing churn across 4.14.x**: inheritance changed in 4.14.0/4.14.1/4.14.2; avoid depending on specific ABC inheritance.
- If you need a stable container type at boundaries: `results = list(soup.find_all(...))`.
- **lxml huge text nodes (4.14.3 note)**: if using an lxml builder and expecting extremely large text nodes, pass `huge_tree=True`.
## API Reference
- **BeautifulSoup(markup, features=..., parse_only=..., from_encoding=..., exclude_encodings=..., element_classes=..., \*\*kwargs)** - parse markup into a tree; specify `features` (parser) explicitly.
- **BeautifulSoup.find(...)** - return the first matching element (often `Tag | None`); supports tag name, attrs, and other filters.
- **BeautifulSoup.find_all(...)** - return all matching elements (list-like result set); convert to `list(...)` if you need a stable container type.
- **BeautifulSoup.find_next(...) / find_all_next(...)** - search forward in document order from a starting node.
- **BeautifulSoup.find_previous(...) / find_all_previous(...)** - search backward in document order from a starting node.
- **BeautifulSoup.find_next_sibling(...) / find_next_siblings(...)** - search among following siblings.
- **BeautifulSoup.find_previous_sibling(...) / find_previous_siblings(...)** - search among preceding siblings.
- **BeautifulSoup.find_parent(...) / find_parents(...)** - search upward to parents/ancestors.
- **BeautifulSoup.get_text(separator="...", strip=False)** - extract combined text content from a subtree.
- **BeautifulSoup.prettify()** - render formatted markup for debugging/inspection.
- **BeautifulSoup.contains_replacement_characters** - flag indicating replacement characters were introduced during entity/encoding handling (builder-dependent).
- **Tag.name** - the tag’s name (e.g., `"a"`, `"p"`).
- **Tag.attrs** - dict of attributes; multi-valued HTML attributes may be lists.
- **Tag.get(key, default=None)** - safe attribute lookup.
- **Tag.get_attribute_list(name)** - normalize an attribute to a list regardless of internal storage.
- **UnicodeDammit(...)** - helper for detecting/decoding unknown encodings before parsing.
- **Comment** - class for HTML/XML comment nodes (a specialized string-like node).
- **ParserRejectedMarkup** - exception raised when the underlying parser rejects markup.
- **FeatureNotFound** - exception raised when the requested parser feature/builder is unavailable.Related Skills
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
Buffer Overflow Payload Generator
Generates a buffer overflow attack payload with a specific stack layout (padding, return address, NOP sled, shellcode) and saves it to a file.
browser-testing
Use when testing web applications, debugging browser console errors, automating form interactions, or verifying UI implementations. Load for localhost testing, authenticated app testing (Gmail, Notion), or recording demo GIFs. Requires Chrome extension 1.0.36+, Claude Code 2.0.73+, paid plan.
browser-fetch
Delegate browser automation to a lightweight subagent (Haiku) to reduce context consumption. Also provides web clipping (HTML→Markdown) via clipper.
Browser Automation Expert
浏览器自动化与网页测试专家。支持基于 MCP 工具(Puppeteer/Playwright)的实时交互,以及基于 Python 脚本的复杂自动化流实现。
bronze-layer-setup
End-to-end Bronze layer creation for testing and demos. Creates table DDLs, generates fake data with Faker, copies from existing sources, and configures Asset Bundle jobs. Covers Unity Catalog compliance, Change Data Feed, automatic liquid clustering, and governance metadata. Use when setting up Bronze layer tables, creating test/demo data, rapid prototyping Medallion Architecture, or bootstrapping a new Databricks project. For Faker-specific patterns (corruption rates, function signatures, provider examples), load the faker-data-generation skill.
brand-identity
Provides the single source of truth for brand guidelines, design tokens, technology choices, and voice/tone. Use this skill whenever generating UI components, styling applications, writing copy, or creating user-facing assets to ensure brand consistency.
brainstorming
Use when creating or developing anything, before writing code or implementation plans - refines rough ideas into fully-formed designs through structured Socratic questioning, alternative exploration, and incremental validation
boxlog-frontend-design
BoxLog専用のフロントエンドデザインスキル。「装飾のない基本体験」を実現するためのUI設計ガイドライン。STYLE_GUIDE.mdを補完し、フォント・アニメーション・デザイン判断基準を提供。
bounty-hunter
Find, evaluate, and submit online bounties and hackathons for prize money. Use when user mentions "bounties", "hackathon", "earn money", "Superteam Earn", "prize money", "submissions", "freelance bounties", or asks to find paid opportunities. Covers discovery, eligibility filtering, content drafting, and submission workflows.
bootstrap-phase-workflow
Integrate the vibe/mature phase workflow into a project
bootstrap-auto
[Implementation] Bootstrap a new project automatically