academic-benchmark-researcher
When the user requests information about academic benchmarks, datasets, or research papers, particularly in machine learning, deep learning, or logical reasoning domains. This skill enables systematic research of academic benchmarks by searching web sources, downloading and analyzing arXiv papers, extracting key metadata (number of tasks, training availability, difficulty levels), and compiling comparative summaries. It triggers on requests involving dataset comparisons, benchmark analysis, or academic paper research for table creation.
Best use case
academic-benchmark-researcher is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
When the user requests information about academic benchmarks, datasets, or research papers, particularly in machine learning, deep learning, or logical reasoning domains. This skill enables systematic research of academic benchmarks by searching web sources, downloading and analyzing arXiv papers, extracting key metadata (number of tasks, training availability, difficulty levels), and compiling comparative summaries. It triggers on requests involving dataset comparisons, benchmark analysis, or academic paper research for table creation.
Teams using academic-benchmark-researcher should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/academic-benchmark-researcher/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How academic-benchmark-researcher Compares
| Feature / Agent | academic-benchmark-researcher | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
When the user requests information about academic benchmarks, datasets, or research papers, particularly in machine learning, deep learning, or logical reasoning domains. This skill enables systematic research of academic benchmarks by searching web sources, downloading and analyzing arXiv papers, extracting key metadata (number of tasks, training availability, difficulty levels), and compiling comparative summaries. It triggers on requests involving dataset comparisons, benchmark analysis, or academic paper research for table creation.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Instructions
## Primary Objective
Systematically research academic benchmarks, datasets, or research papers to extract and compile comparative information (e.g., into a summary table). The core workflow involves: 1) Identifying relevant sources, 2) Extracting key metadata, 3) Synthesizing findings into a structured output (like a LaTeX table).
## Core Workflow
1. **Clarify & Parse Request:** Identify the specific benchmarks/datasets/papers mentioned by the user. Note any required output format (e.g., LaTeX table with specific columns) and constraints (e.g., "no commented lines").
2. **Initial Information Gathering:** For each identified entity (dataset/paper):
* Use `local-web_search` to find general information, official pages (GitHub, project sites), and relevant arXiv IDs.
* For arXiv papers, use `arxiv_local-download_paper` or `fetch-fetch_markdown` to obtain the paper content.
* Search for specific attributes requested by the user (e.g., "number of tasks," "training set," "difficulty levels").
3. **Deep Dive & Verification:** Read paper abstracts, introductions, and methodology sections (using `arxiv_local-read_paper` or parsed markdown) to confirm key details. Cross-reference information from multiple sources (official site, paper, blog posts) for accuracy.
4. **Information Synthesis:** Compile the extracted metadata into a structured format aligned with the user's request. Resolve any ambiguities (e.g., if a "task" count refers to broad categories or individual instances) based on the most authoritative source (typically the original paper).
5. **Output Generation:** Create the final deliverable (e.g., a `.tex` file). Ensure it strictly adheres to the user's formatting specifications. Optionally, provide a concise textual summary of the findings.
## Key Metadata to Extract
When researching a benchmark/dataset, prioritize finding:
* **Full Name & Acronym**
* **Number of Tasks/Categories:** Distinguish between broad task categories and individual task instances.
* **Training Data Availability:** Does it include a dedicated training set, or is it for evaluation only?
* **Difficulty Levels:** Does it feature adjustable or tiered difficulty levels?
* **Core Purpose/Description**
* **Primary Source (arXiv ID, GitHub repo)**
## Tool Usage Guidelines
* `local-web_search`: Use for initial discovery and finding high-level descriptions. Employ specific queries combining the dataset name and target attributes (e.g., "BBH training set few-shot examples").
* `arxiv_local-download_paper` / `fetch-fetch_markdown`: Use to access the canonical source for detailed information. Prefer `arxiv_local-download_paper` for full text analysis when needed.
* `filesystem-write_file` / `filesystem-read_file`: Use for creating and verifying final output files in the workspace.
* `local-claim_done`: Use only after successfully delivering the requested output and providing a final summary.
## Output Standards
* **LaTeX Tables:** Ensure the output contains only the specified table content, without extra comments, document headers, or unrelated text.
* **Summaries:** Be concise but complete, highlighting the sourced information for each dataset.
* **Accuracy:** Base conclusions on the original paper or official project documentation where possible. Acknowledge if information is not explicitly stated.
## Common Pitfalls & Resolutions
* **Ambiguous Task Counts:** If a paper mentions "5 task categories" (like KOR-Bench), report that as the task count unless the user specifies otherwise. Clarify in the summary if needed.
* **Missing Information:** If a key attribute (e.g., training set) is not mentioned in primary sources, infer based on benchmark type (e.g., many evaluation benchmarks lack training sets) and denote with `\ding{55}`. State the assumption in your summary.
* **arXiv Paper Processing:** If `arxiv_local-download_paper` returns a "converting" status, use `fetch-fetch_markdown` on the arXiv abstract page as a reliable fallback to get the paper's metadata and abstract.Related Skills
gpt-researcher
Run GPT-Researcher multi-agent deep research framework locally using OpenAI GPT-5.2. Replaces ChatGPT Deep Research with local control. Researches 100+ sources in parallel, provides comprehensive citations. Use for Phase 3 industry/technical research or comprehensive synthesis. Takes 6-20 min depending on report type. Supports multiple LLM providers.
agent-market-researcher
Expert market researcher specializing in market analysis, consumer insights, and competitive intelligence. Masters market sizing, segmentation, and trend analysis with focus on identifying opportunities and informing strategic business decisions.
agent-data-researcher
Expert data researcher specializing in discovering, collecting, and analyzing diverse data sources. Masters data mining, statistical analysis, and pattern recognition with focus on extracting meaningful insights from complex datasets to support evidence-based decisions.
agency-researcher
Find and qualify real estate agencies in a given suburb
academic-search
Search academic paper repositories (arXiv, Semantic Scholar) for scholarly articles in physics, mathematics, computer science, quantitative biology, AI/ML, and related fields
academic-data-integration
When the user needs to integrate multiple data sources (Canvas API, user memory, file systems) to create comprehensive academic reports. This skill combines course information, assignment details, submission status, and user context to generate actionable insights. Triggers include requests that involve cross-referencing multiple data sources or creating consolidated academic reports from disparate systems.
academic-course-setup-automator
When the user needs to set up multiple academic courses in a learning management system (Canvas/LMS) from structured data sources. This skill automates the entire workflow extracting course schedules from emails/attachments, matching instructors from CSV files, creating courses, enrolling teachers, publishing announcements with class details, uploading syllabi, enabling resource sharing for instructors teaching multiple courses, and publishing all courses. Triggers include course schedule setup, Canvas/LMS administration, academic term preparation, instructor assignment, syllabus distribution, and multi-course management.
academic-writing-style
Personalized academic writing assistant for university assignments in Chinese and English. Use when users need help writing/revising academic reports, project docs, technical analyses, research reviews, or case studies. Produces natural prose avoiding AI markers. Triggers: academic writing, assignment, report, technical analysis, research review, case study. | 个性化学术写作助手,适用于中英文大学作业。触发词:学术写作、作业、报告、技术分析、研究综述、案例研究、项目文档。
academic-writing-standards
Expert knowledge of academic writing standards for peer-reviewed papers, including citation integrity, style compliance, clarity, and scientific writing best practices. Use when reviewing or editing academic manuscripts, papers, or research documentation.
academic-research-writing
Use when writing CS research papers (conference, journal, thesis), reviewing scientific manuscripts, improving academic writing clarity, or preparing IEEE/ACM submissions. Invoke when user mentions paper, manuscript, research writing, journal submission, or needs help with academic structure, formatting, or revision.
benchmark-email-automation
Automate Benchmark Email tasks via Rube MCP (Composio). Always search tools first for current schemas.
api-researcher
Expert API research including discovery, evaluation, integration analysis, and documentation review