hugging-face-dataset-viewer
Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.
About this skill
This skill empowers AI agents to perform read-only exploration and extraction from Hugging Face datasets using the Dataset Viewer API. It provides a structured way to access dataset metadata (like configurations and splits), preview initial rows, paginate through content, and execute text-based searches within datasets. Agents can also retrieve direct links to parquet files for more extensive data processing. This integration is vital for tasks requiring on-demand data inspection or validation without needing to download entire datasets. It extends an agent's ability to interact with external data services, making it a powerful tool for data scientists, researchers, and developers leveraging Hugging Face's vast collection of datasets.
Best use case
Exploring unknown or newly encountered Hugging Face datasets to understand their content and structure. Validating the availability and configurations of a specific dataset before deeper analysis. Previewing initial dataset content to quickly grasp its schema, data types, and typical entries. Retrieving specific data rows for quick insights, examples, or to debug data-related issues. Searching for particular keywords or patterns within dataset content to find relevant information. Obtaining direct links to parquet files for programmatic download or integration into external data pipelines.
Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.
The AI agent will successfully retrieve requested dataset information, such as available splits, configurations, initial rows, paginated content, search results matching specific queries, or direct parquet file links. This enables the agent to provide accurate, on-demand data insights to the user without needing to manually browse the Hugging Face website or download full datasets.
Practical example
Example input
Can you show me the first 5 rows of the 'squad' dataset, specifically for the 'plain_text' configuration and 'validation' split? Also, list all available splits for this dataset.
Example output
```json
{
"message": "Successfully retrieved splits and first 5 rows for the 'squad' dataset.",
"splits": [
{"config": "plain_text", "split": "train"},
{"config": "plain_text", "split": "validation"}
],
"first_rows": [
{
"id": "56be4db0acb8001400a502ee",
"title": "University_of_Michigan",
"context": "The University of Michigan is a public research university in Ann Arbor, Michigan.",
"question": "In what city is the University of Michigan?",
"answers": {"text": ["Ann Arbor"], "answer_start": [57]}
},
{
"id": "56be4db0acb8001400a502ef",
"title": "University_of_Michigan",
"context": "It is the state's oldest university and the flagship institution of the University of Michigan system.",
"question": "What is the flagship institution of the University of Michigan system?",
"answers": {"text": ["University of Michigan"], "answer_start": [80]}
}
// ... (up to 5 rows)
]
}
```When to use this skill
- Use this skill when your AI agent needs to perform read-only exploration of a Hugging Face dataset, retrieve specific data points, validate dataset structure, or search for content directly through the Dataset Viewer API. It is ideal for agents requiring quick access and inspection of public datasets without full download or complex setup.
When not to use this skill
- This skill is not suitable for modifying Hugging Face datasets, uploading new data, training machine learning models directly (though it can provide the data), or performing complex, custom data transformations that require more than simple filtering or pagination. It is strictly for read-only data access and exploration.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/hugging-face-dataset-viewer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How hugging-face-dataset-viewer Compares
| Feature / Agent | hugging-face-dataset-viewer | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | easy | N/A |
Frequently Asked Questions
What does this skill do?
Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as easy. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Hugging Face Dataset Viewer ## When to Use Use this skill when you need read-only exploration of a Hugging Face dataset through the Dataset Viewer API. Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction. ## Core workflow 1. Optionally validate dataset availability with `/is-valid`. 2. Resolve `config` + `split` with `/splits`. 3. Preview with `/first-rows`. 4. Paginate content with `/rows` using `offset` and `length` (max 100). 5. Use `/search` for text matching and `/filter` for row predicates. 6. Retrieve parquet links via `/parquet` and totals/metadata via `/size` and `/statistics`. ## Defaults - Base URL: `https://datasets-server.huggingface.co` - Default API method: `GET` - Query params should be URL-encoded. - `offset` is 0-based. - `length` max is usually `100` for row-like endpoints. - Gated/private datasets require `Authorization: Bearer <HF_TOKEN>`. ## Dataset Viewer - `Validate dataset`: `/is-valid?dataset=<namespace/repo>` - `List subsets and splits`: `/splits?dataset=<namespace/repo>` - `Preview first rows`: `/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>` - `Paginate rows`: `/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>` - `Search text`: `/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>` - `Filter with predicates`: `/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>` - `List parquet shards`: `/parquet?dataset=<namespace/repo>` - `Get size totals`: `/size?dataset=<namespace/repo>` - `Get column statistics`: `/statistics?dataset=<namespace/repo>&config=<config>&split=<split>` - `Get Croissant metadata (if available)`: `/croissant?dataset=<namespace/repo>` Pagination pattern: ```bash curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100" curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100" ``` When pagination is partial, use response fields such as `num_rows_total`, `num_rows_per_page`, and `partial` to drive continuation logic. Search/filter notes: - `/search` matches string columns (full-text style behavior is internal to the API). - `/filter` requires predicate syntax in `where` and optional sort in `orderby`. - Keep filtering and searches read-only and side-effect free. ## Querying Datasets Use `npx parquetlens` with Hub parquet alias paths for SQL querying. Parquet alias shape: ```text hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet ``` Derive `<config>`, `<split>`, and `<shard>` from Dataset Viewer `/parquet`: ```bash curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \ | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"' ``` Run SQL query: ```bash npx -y -p parquetlens -p @parquetlens/sql parquetlens \ "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \ --sql "SELECT * FROM data LIMIT 20" ``` ### SQL export - CSV: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"` - JSON: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"` - Parquet: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"` ## Creating and Uploading Datasets Use one of these flows depending on dependency constraints. Zero local dependencies (Hub UI): - Create dataset repo in browser: `https://huggingface.co/new-dataset` - Upload parquet files in the repo "Files and versions" page. - Verify shards appear in Dataset Viewer: ```bash curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>" ``` Low dependency CLI flow (`npx @huggingface/hub` / `hfjs`): - Set auth token: ```bash export HF_TOKEN=<your_hf_token> ``` - Upload parquet folder to a dataset repo (auto-creates repo if missing): ```bash npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data ``` - Upload as private repo on creation: ```bash npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private ``` After upload, call `/parquet` to discover `<config>/<split>/<shard>` values for querying with `@~parquet`.
Related Skills
hugging-face-vision-trainer
Train or fine-tune vision models on Hugging Face Jobs for detection, classification, and SAM or SAM2 segmentation.
hugging-face-trackio
Track ML experiments with Trackio using Python logging, alerts, and CLI metric retrieval.
hugging-face-tool-builder
Your purpose is now is to create reusable command line scripts and utilities for using the Hugging Face API, allowing chaining, piping and intermediate processing where helpful. You can access the API directly, as well as use the hf command line tool.
hugging-face-papers
Read and analyze Hugging Face paper pages or arXiv papers with markdown and papers API metadata.
hugging-face-paper-publisher
Publish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.
hugging-face-model-trainer
Train or fine-tune TRL language models on Hugging Face Jobs, including SFT, DPO, GRPO, and GGUF export.
hugging-face-jobs
Run workloads on Hugging Face Jobs with managed CPUs, GPUs, TPUs, secrets, and Hub persistence.
hugging-face-evaluation
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
hugging-face-datasets
Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.
hugging-face-community-evals
Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval.
hugging-face-cli
Use the Hugging Face Hub CLI (`hf`) to download, upload, and manage models, datasets, and Spaces.
code-reviewer
Elite code review expert specializing in modern AI-powered code