python-data-analytics
Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
Best use case
python-data-analytics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
Teams using python-data-analytics should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/python-data-analytics/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How python-data-analytics Compares
| Feature / Agent | python-data-analytics | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
# Python Data Analytics
Acknowledgement: Shared by Peter Bamuhigire, techguypeter.com, +256 784 464178.
<!-- dual-compat-start -->
## Use When
- Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
## Do Not Use When
- The task is unrelated to `python-data-analytics` or would be better handled by a more specific companion skill.
- The request only needs a trivial answer and none of this skill's constraints or references materially help.
## Required Inputs
- Gather relevant project context, constraints, and the concrete problem to solve; load `references` only as needed.
- Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
## Workflow
- Read this `SKILL.md` first, then load only the referenced deep-dive files that are necessary for the task.
- Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
## Quality Standards
- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
## Anti-Patterns
- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.
## Outputs
- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.
## Evidence Produced
| Category | Artifact | Format | Example |
|----------|----------|--------|---------|
| Correctness | Analytics test plan | Markdown doc covering KPI computation, cohort/funnel/retention, and edge-case (empty / sparse) tests | `docs/python/analytics-tests.md` |
| Performance | Pandas/Polars performance note | Markdown doc covering DataFrame size, vectorisation choices, and memory footprint per query | `docs/python/analytics-perf-note.md` |
## References
- Use the `references/` directory for deep detail after reading the core workflow below.
<!-- dual-compat-end -->
Use Python when SQL stops being the right tool: multi-step transformations, statistical tests, time-series analysis, anomaly detection, complex financial math, or geospatial computation. Pandas / numpy / Polars as the compute engine; results feed back to PHP via FastAPI sidecar or into Excel/PDF via document-generation.
**Prerequisites:** Load `python-modern-standards` and `python-saas-integration` before this skill.
## When this skill applies
- Computing cohort retention, funnel conversion, churn curves.
- Financial math: IRR, NPV, amortization, depreciation schedules at scale.
- Accounting analytics: trial balance, AR/AP aging, bank reconciliation, inventory valuation, fixed-asset reconciliation, cost variance, and finance dashboard calculations.
- Statistical tests: A/B test significance, distribution comparisons, trend detection.
- Analytics method selection: descriptive, diagnostic, predictive, and prescriptive
analysis with explicit data-quality and decision-handoff controls.
- Outlier/anomaly detection on SaaS metrics.
- Geospatial analytics: territory stats, route distances, geofence compliance reports.
- Any workload where SQL requires >3 subqueries or a window function chain that's hard to read.
## When Python beats raw SQL (decision rule)
```text
Single-table aggregation, < 3 group-bys -> SQL
Joins + window functions, still readable -> SQL
Iterative calculation (running state, rolling) -> Python
Statistical test / hypothesis / distribution -> Python
Financial schedule (amortization, IRR) -> Python
Multi-source merge (DB + API + file) -> Python
Requires chart output -> Python
Result is a matrix / pivot > 10 columns -> Python
```
Rule of thumb: if explaining the SQL to a teammate takes more than 60 seconds, move to Python.
## Core stack
- **pandas 2.x** — default DataFrame library. Arrow-backed dtypes on 3.11+.
- **numpy** — array math, used by pandas under the hood.
- **Polars** — use for datasets > 1M rows or where pandas is too slow. Lazy API is great.
- **SQLAlchemy 2.x** — for DB access. Never mix with raw DB cursors in the same module.
- **Shapely + GeoPandas** — for geospatial work.
- **statsmodels + scipy** — statistical tests and time-series.
- **Decimal** (stdlib) — for all currency math; never `float`.
## Loading data
Use SQLAlchemy engine + `pandas.read_sql` with parameters. Chunk for large results.
```python
from sqlalchemy import create_engine, text
import pandas as pd
engine = create_engine(settings.database_url, pool_pre_ping=True)
sql = text("""
SELECT customer_id, invoice_date, total_amount
FROM invoices
WHERE tenant_id = :tenant_id
AND invoice_date >= :start
AND invoice_date < :end
""")
df = pd.read_sql(sql, engine, params={"tenant_id": tenant_id, "start": start, "end": end},
parse_dates=["invoice_date"])
```
For > 1M rows, chunked reads with `chunksize=100_000` and a generator pipeline. See `references/loading-data.md`.
## Pandas idioms (the ones that matter)
**Vectorize, don't loop.**
```python
# SLOW
df["total"] = df.apply(lambda r: r["qty"] * r["unit_price"], axis=1)
# FAST
df["total"] = df["qty"] * df["unit_price"]
```
**Avoid `inplace=True`.** Leads to chained assignment bugs in 2.x. Use explicit reassignment.
**Use dtypes on load.** `pd.read_sql(..., dtype={"customer_id": "int64"})`. Cast dates, categoricals, booleans early.
**`copy()` when slicing, always.** `df2 = df[df.status == "paid"].copy()` — prevents `SettingWithCopyWarning`.
**`groupby().agg()` with named columns** is clearer than dict form:
```python
summary = (
df.groupby("tenant_id", as_index=False)
.agg(
revenue=("total_amount", "sum"),
invoices=("id", "count"),
avg_invoice=("total_amount", "mean"),
)
)
```
See `references/pandas-idioms.md` for the full catalog (copy vs view, dtypes, datetime handling, common pitfalls).
## Cohort, funnel, retention
The three most-requested SaaS analytics patterns. Full recipes in `references/cohort-funnel-retention.md`. Summary here:
**Cohort retention:**
1. Assign each customer to their signup cohort (month).
2. For each month after signup, count active customers per cohort.
3. Divide by cohort size → retention %.
```python
customers["cohort"] = customers["signup_date"].dt.to_period("M")
activity = activity.merge(customers[["customer_id", "cohort"]], on="customer_id")
activity["period"] = activity["activity_date"].dt.to_period("M")
activity["offset"] = (activity["period"] - activity["cohort"]).apply(lambda x: x.n)
cohort_size = customers.groupby("cohort").size()
retention = (
activity.groupby(["cohort", "offset"])["customer_id"].nunique()
.unstack(fill_value=0)
.div(cohort_size, axis=0) * 100
)
```
**Funnel conversion:** sequential step counts → percentage drop-offs.
**Retention curves:** for each cohort, plot offset vs %retained — identify where users churn.
## Financial math — always Decimal, never float
Currency math is Decimal. Period.
```python
from decimal import Decimal, ROUND_HALF_UP
def amortization_schedule(principal: Decimal, annual_rate: Decimal, months: int) -> list[dict]:
monthly_rate = annual_rate / Decimal(12) / Decimal(100)
# Payment formula using Decimal throughout
...
```
Recipes for IRR, NPV, amortization, depreciation (straight-line, declining, units-of-production), and day-count conventions (30/360, actual/365) in `references/financial-math.md`. For accounting and bookkeeping analytics, load `references/accounting-finance-analytics.md` before calculating trial balances, aging schedules, reconciliations, cost variances, or finance dashboards.
## Statistical tests and anomalies
- **A/B test significance:** two-sample t-test, chi-square for proportions, Mann-Whitney U when non-normal.
- **Outliers:** z-score (thresholds: 2.5–3), IQR method (1.5×IQR), Modified Z-score with median.
- **Trend detection:** Mann-Kendall for monotonic trends; decompose seasonality with STL.
- **Anomaly detection:** rolling-window mean ± k·std; isolation forest when rules aren't enough.
See `references/statistics-and-anomalies.md` for choosing the right test per question, and `python-ml-predictive` for model-based anomaly detection.
## Geospatial analytics
Use Shapely for geometry, GeoPandas when you have many features. Coordinate reference systems matter — always tag CRS (EPSG:4326 for WGS84 lat/lng).
```python
import geopandas as gpd
from shapely.geometry import Point
deliveries = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.lng, df.lat), crs="EPSG:4326"
).to_crs("EPSG:3857") # project to meters for distance
zones = gpd.read_file("zones.geojson").to_crs("EPSG:3857")
joined = gpd.sjoin(deliveries, zones, how="left", predicate="within")
```
Distance calculations in EPSG:3857 (meters), not 4326 (degrees — degrees ≠ distance). See `references/geospatial-analytics.md`.
## Performance — when to leave pandas
- **Rows < 100K:** pandas is fine.
- **Rows 100K–5M:** pandas works; profile and optimize dtypes (`category`, Arrow-backed).
- **Rows > 5M or memory tight:** switch to **Polars**. Its lazy API (`pl.scan_csv`/`scan_parquet`) beats pandas substantially on aggregations.
- **Truly large (> 100M rows):** this doesn't belong in a sidecar. Move to worker + chunked processing, or to a warehouse (ClickHouse / DuckDB / BigQuery) and let Python orchestrate.
Always profile with `cProfile` or `py-spy` before optimizing. See `references/performance-and-polars.md`.
## Output handoff
Analytics results don't live in Python — they return to PHP or become documents.
- **JSON back to PHP:** Pydantic models → FastAPI response. Decimal → string (IEEE754 will bite you).
- **Persist to MySQL:** `df.to_sql(...)` is OK for < 10K rows; beyond that, use `SQLAlchemy Core` with `INSERT ... ON DUPLICATE KEY UPDATE` and chunked inserts.
- **Feed into document-generation:** pass the DataFrame directly to the skill's Excel/Word/PDF generators. See `python-document-generation`.
## Pitfalls specific to SaaS analytics
- **Timezone:** store UTC, report in tenant-local. `df["created_at"].dt.tz_convert(tenant_tz)`.
- **Currency:** if invoices have mixed currencies, convert to a common reporting currency at the tenant's chosen rate (spot vs transaction-date rate — decide and document).
- **Dates vs datetimes:** trimming to midnight UTC can shift dates by a day in the user's TZ. Always round-trip through the tenant TZ before truncating.
- **Division by zero in rates:** guard every rate calc (`retention_rate = active / cohort_size if cohort_size else 0`).
- **Tenant isolation:** every DataFrame loaded must be filtered by tenant at SQL level. Never filter in Python after loading — you'll paginate the wrong data.
## Read next
- `python-document-generation` — render the DataFrame as a branded Excel dashboard or PDF.
- `python-ml-predictive` — when the question is "what happens next" instead of "what happened".
- `python-data-pipelines` — when data needs to be fetched/cleaned before analysis.
## References
- `references/analytics-method-selection-and-governance.md`
- `references/pandas-idioms.md`
- `references/loading-data.md`
- `references/cohort-funnel-retention.md`
- `references/financial-math.md`
- `references/accounting-finance-analytics.md`
- `references/statistics-and-anomalies.md`
- `references/geospatial-analytics.md`
- `references/performance-and-polars.md`Related Skills
saas-tenant-data-portability-and-erasure
Use when designing the GDPR/POPIA/CCPA-compliant data export (right to portability) and erasure (right to be forgotten) workflows for a multi-tenant SaaS — cascade through every data store including warehouse/backups, retention policy, requester verification, audit trail, multi-tenant nuances of erasing one tenant's data without affecting others, and the engineering for African market regulations (Uganda DPPA, Kenya DPA, POPIA).
python-modern-standards
Use when writing or reviewing any Python code in our SaaS projects — defines Python version, project layout, tooling (uv, ruff, mypy), typing, Pydantic v2, logging, configuration, async rules, error handling, testing, and security baseline. Load this before any other Python skill.
python-ml-predictive
Use when adding forecasting, classification, regression, or anomaly detection to a SaaS feature — demand/sales/cash-flow forecasting, churn and risk scoring, anomaly detection — with scikit-learn, Prophet, and statsmodels. Covers data prep, model serving, monitoring, and explainability.
python-data-pipelines
Use when building ETL jobs, document intelligence pipelines, OCR, PDF/Excel ingestion, image/media processing, or external-API sync pipelines in Python — idempotent scheduled jobs with validation, dead-letter queues, and multi-tenant isolation.
ios-data-persistence
iOS data persistence standards with SwiftData as primary local storage and custom API backends for cloud sync. Covers UserDefaults, Keychain, SwiftData (models, queries, relationships, migrations), file storage, offline-first architecture, and...
data-visualization
Data visualization patterns for transforming raw data into compelling visual stories, dashboards, charts, and decision-ready analytical exhibits.
database-reliability
Database reliability engineering: SLI/SLO design and error-budget policy for the data tier, blameless postmortems, escalation tiers and on-call hand-off, game days for MySQL/PostgreSQL, operational runbooks, change management, capacity planning, and backup verification. Use when setting up production database SRE practice, defining database SLOs/error budgets, running database postmortems, or hardening on-call for MySQL/PostgreSQL.
database-design-engineering
Use when designing or reviewing relational or document-backed data architecture for SaaS platforms, ERP systems, APIs, analytics stores, or mobile sync. Covers domain modeling, tenancy, indexing, migrations, integrity, retention, and performance tradeoffs.
android-data-persistence
Android data persistence standards with Room as primary local storage and custom API backends for cloud sync. Covers SharedPreferences, DataStore, Room (entities, DAOs, relations, migrations), file storage, offline-first architecture, and...
ai-analytics
Use when designing AI analytics, dashboards, SaaS AI metrics, NLP analytics, predictive analytics, or executive AI insight workflows. Orchestrates the former granular AI analytics skills as references.
web-app-security-audit
Use when auditing a PHP/JavaScript/HTML web application for security vulnerabilities. Covers configuration, authentication, authorization, input validation, XSS, API security, HTTP headers, and dependency scanning. Produces a severity-rated audit...
vibe-security-skill
Use when designing or reviewing security for a web application, API, or multi-tenant SaaS — produces threat model, abuse case list, auth/authz matrix, and secret handling plan; covers OWASP Top 10 2025 and the AI-code-generation blind spots. Neighbours — api-design-first owns auth model fields, deployment-release-engineering owns secret rotation choreography, ai-security and llm-security own model-specific threats.