python-data-analytics

Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.

8 stars

bypeterbamuhigire

View on GitHub Installation ↓

Best use case

python-data-analytics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using python-data-analytics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/python-data-analytics/SKILL.md --create-dirs "https://raw.githubusercontent.com/peterbamuhigire/skills-web-dev/main/skills/languages/python-data-analytics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/python-data-analytics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How python-data-analytics Compares

Feature / Agent	python-data-analytics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# Python Data Analytics
Acknowledgement: Shared by Peter Bamuhigire, techguypeter.com, +256 784 464178.

<!-- dual-compat-start -->
## Use When

- Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

## Do Not Use When

- The task is unrelated to `python-data-analytics` or would be better handled by a more specific companion skill.
- The request only needs a trivial answer and none of this skill's constraints or references materially help.

## Required Inputs

- Gather relevant project context, constraints, and the concrete problem to solve; load `references` only as needed.
- Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

## Workflow

- Read this `SKILL.md` first, then load only the referenced deep-dive files that are necessary for the task.
- Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

## Quality Standards

- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

## Anti-Patterns

- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.

## Outputs

- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.

## Evidence Produced

| Category | Artifact | Format | Example |
|----------|----------|--------|---------|
| Correctness | Analytics test plan | Markdown doc covering KPI computation, cohort/funnel/retention, and edge-case (empty / sparse) tests | `docs/python/analytics-tests.md` |
| Performance | Pandas/Polars performance note | Markdown doc covering DataFrame size, vectorisation choices, and memory footprint per query | `docs/python/analytics-perf-note.md` |

## References

- Use the `references/` directory for deep detail after reading the core workflow below.
<!-- dual-compat-end -->
Use Python when SQL stops being the right tool: multi-step transformations, statistical tests, time-series analysis, anomaly detection, complex financial math, or geospatial computation. Pandas / numpy / Polars as the compute engine; results feed back to PHP via FastAPI sidecar or into Excel/PDF via document-generation.

**Prerequisites:** Load `python-modern-standards` and `python-saas-integration` before this skill.

## When this skill applies

- Computing cohort retention, funnel conversion, churn curves.
- Financial math: IRR, NPV, amortization, depreciation schedules at scale.
- Accounting analytics: trial balance, AR/AP aging, bank reconciliation, inventory valuation, fixed-asset reconciliation, cost variance, and finance dashboard calculations.
- Statistical tests: A/B test significance, distribution comparisons, trend detection.
- Analytics method selection: descriptive, diagnostic, predictive, and prescriptive
  analysis with explicit data-quality and decision-handoff controls.
- Outlier/anomaly detection on SaaS metrics.
- Geospatial analytics: territory stats, route distances, geofence compliance reports.
- Any workload where SQL requires >3 subqueries or a window function chain that's hard to read.

## When Python beats raw SQL (decision rule)

```text
Single-table aggregation, < 3 group-bys       -> SQL
Joins + window functions, still readable       -> SQL
Iterative calculation (running state, rolling) -> Python
Statistical test / hypothesis / distribution  -> Python
Financial schedule (amortization, IRR)         -> Python
Multi-source merge (DB + API + file)           -> Python
Requires chart output                          -> Python
Result is a matrix / pivot > 10 columns        -> Python
```

Rule of thumb: if explaining the SQL to a teammate takes more than 60 seconds, move to Python.

## Core stack

- **pandas 2.x** — default DataFrame library. Arrow-backed dtypes on 3.11+.
- **numpy** — array math, used by pandas under the hood.
- **Polars** — use for datasets > 1M rows or where pandas is too slow. Lazy API is great.
- **SQLAlchemy 2.x** — for DB access. Never mix with raw DB cursors in the same module.
- **Shapely + GeoPandas** — for geospatial work.
- **statsmodels + scipy** — statistical tests and time-series.
- **Decimal** (stdlib) — for all currency math; never `float`.

## Loading data

Use SQLAlchemy engine + `pandas.read_sql` with parameters. Chunk for large results.

```python
from sqlalchemy import create_engine, text
import pandas as pd

engine = create_engine(settings.database_url, pool_pre_ping=True)

sql = text("""
    SELECT customer_id, invoice_date, total_amount
    FROM invoices
    WHERE tenant_id = :tenant_id
      AND invoice_date >= :start
      AND invoice_date <  :end
""")
df = pd.read_sql(sql, engine, params={"tenant_id": tenant_id, "start": start, "end": end},
                 parse_dates=["invoice_date"])
```

For > 1M rows, chunked reads with `chunksize=100_000` and a generator pipeline. See `references/loading-data.md`.

## Pandas idioms (the ones that matter)

**Vectorize, don't loop.**

```python
# SLOW
df["total"] = df.apply(lambda r: r["qty"] * r["unit_price"], axis=1)

# FAST
df["total"] = df["qty"] * df["unit_price"]
```

**Avoid `inplace=True`.** Leads to chained assignment bugs in 2.x. Use explicit reassignment.

**Use dtypes on load.** `pd.read_sql(..., dtype={"customer_id": "int64"})`. Cast dates, categoricals, booleans early.

**`copy()` when slicing, always.** `df2 = df[df.status == "paid"].copy()` — prevents `SettingWithCopyWarning`.

**`groupby().agg()` with named columns** is clearer than dict form:

```python
summary = (
    df.groupby("tenant_id", as_index=False)
      .agg(
          revenue=("total_amount", "sum"),
          invoices=("id", "count"),
          avg_invoice=("total_amount", "mean"),
      )
)
```

See `references/pandas-idioms.md` for the full catalog (copy vs view, dtypes, datetime handling, common pitfalls).

## Cohort, funnel, retention

The three most-requested SaaS analytics patterns. Full recipes in `references/cohort-funnel-retention.md`. Summary here:

**Cohort retention:**
1. Assign each customer to their signup cohort (month).
2. For each month after signup, count active customers per cohort.
3. Divide by cohort size → retention %.

```python
customers["cohort"] = customers["signup_date"].dt.to_period("M")
activity = activity.merge(customers[["customer_id", "cohort"]], on="customer_id")
activity["period"] = activity["activity_date"].dt.to_period("M")
activity["offset"] = (activity["period"] - activity["cohort"]).apply(lambda x: x.n)

cohort_size = customers.groupby("cohort").size()
retention = (
    activity.groupby(["cohort", "offset"])["customer_id"].nunique()
            .unstack(fill_value=0)
            .div(cohort_size, axis=0) * 100
)
```

**Funnel conversion:** sequential step counts → percentage drop-offs.

**Retention curves:** for each cohort, plot offset vs %retained — identify where users churn.

## Financial math — always Decimal, never float

Currency math is Decimal. Period.

```python
from decimal import Decimal, ROUND_HALF_UP

def amortization_schedule(principal: Decimal, annual_rate: Decimal, months: int) -> list[dict]:
    monthly_rate = annual_rate / Decimal(12) / Decimal(100)
    # Payment formula using Decimal throughout
    ...
```

Recipes for IRR, NPV, amortization, depreciation (straight-line, declining, units-of-production), and day-count conventions (30/360, actual/365) in `references/financial-math.md`. For accounting and bookkeeping analytics, load `references/accounting-finance-analytics.md` before calculating trial balances, aging schedules, reconciliations, cost variances, or finance dashboards.

## Statistical tests and anomalies

- **A/B test significance:** two-sample t-test, chi-square for proportions, Mann-Whitney U when non-normal.
- **Outliers:** z-score (thresholds: 2.5–3), IQR method (1.5×IQR), Modified Z-score with median.
- **Trend detection:** Mann-Kendall for monotonic trends; decompose seasonality with STL.
- **Anomaly detection:** rolling-window mean ± k·std; isolation forest when rules aren't enough.

See `references/statistics-and-anomalies.md` for choosing the right test per question, and `python-ml-predictive` for model-based anomaly detection.

## Geospatial analytics

Use Shapely for geometry, GeoPandas when you have many features. Coordinate reference systems matter — always tag CRS (EPSG:4326 for WGS84 lat/lng).

```python
import geopandas as gpd
from shapely.geometry import Point

deliveries = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.lng, df.lat), crs="EPSG:4326"
).to_crs("EPSG:3857")   # project to meters for distance

zones = gpd.read_file("zones.geojson").to_crs("EPSG:3857")
joined = gpd.sjoin(deliveries, zones, how="left", predicate="within")
```

Distance calculations in EPSG:3857 (meters), not 4326 (degrees — degrees ≠ distance). See `references/geospatial-analytics.md`.

## Performance — when to leave pandas

- **Rows < 100K:** pandas is fine.
- **Rows 100K–5M:** pandas works; profile and optimize dtypes (`category`, Arrow-backed).
- **Rows > 5M or memory tight:** switch to **Polars**. Its lazy API (`pl.scan_csv`/`scan_parquet`) beats pandas substantially on aggregations.
- **Truly large (> 100M rows):** this doesn't belong in a sidecar. Move to worker + chunked processing, or to a warehouse (ClickHouse / DuckDB / BigQuery) and let Python orchestrate.

Always profile with `cProfile` or `py-spy` before optimizing. See `references/performance-and-polars.md`.

## Output handoff

Analytics results don't live in Python — they return to PHP or become documents.

- **JSON back to PHP:** Pydantic models → FastAPI response. Decimal → string (IEEE754 will bite you).
- **Persist to MySQL:** `df.to_sql(...)` is OK for < 10K rows; beyond that, use `SQLAlchemy Core` with `INSERT ... ON DUPLICATE KEY UPDATE` and chunked inserts.
- **Feed into document-generation:** pass the DataFrame directly to the skill's Excel/Word/PDF generators. See `python-document-generation`.

## Pitfalls specific to SaaS analytics

- **Timezone:** store UTC, report in tenant-local. `df["created_at"].dt.tz_convert(tenant_tz)`.
- **Currency:** if invoices have mixed currencies, convert to a common reporting currency at the tenant's chosen rate (spot vs transaction-date rate — decide and document).
- **Dates vs datetimes:** trimming to midnight UTC can shift dates by a day in the user's TZ. Always round-trip through the tenant TZ before truncating.
- **Division by zero in rates:** guard every rate calc (`retention_rate = active / cohort_size if cohort_size else 0`).
- **Tenant isolation:** every DataFrame loaded must be filtered by tenant at SQL level. Never filter in Python after loading — you'll paginate the wrong data.

## Read next

- `python-document-generation` — render the DataFrame as a branded Excel dashboard or PDF.
- `python-ml-predictive` — when the question is "what happens next" instead of "what happened".
- `python-data-pipelines` — when data needs to be fetched/cleaned before analysis.

## References

- `references/analytics-method-selection-and-governance.md`
- `references/pandas-idioms.md`
- `references/loading-data.md`
- `references/cohort-funnel-retention.md`
- `references/financial-math.md`
- `references/accounting-finance-analytics.md`
- `references/statistics-and-anomalies.md`
- `references/geospatial-analytics.md`
- `references/performance-and-polars.md`

Related Skills

saas-tenant-data-portability-and-erasure

from peterbamuhigire/skills-web-dev

Use when designing the GDPR/POPIA/CCPA-compliant data export (right to portability) and erasure (right to be forgotten) workflows for a multi-tenant SaaS — cascade through every data store including warehouse/backups, retention policy, requester verification, audit trail, multi-tenant nuances of erasing one tenant's data without affecting others, and the engineering for African market regulations (Uganda DPPA, Kenya DPA, POPIA).

python-modern-standards

from peterbamuhigire/skills-web-dev

Use when writing or reviewing any Python code in our SaaS projects — defines Python version, project layout, tooling (uv, ruff, mypy), typing, Pydantic v2, logging, configuration, async rules, error handling, testing, and security baseline. Load this before any other Python skill.

python-ml-predictive

from peterbamuhigire/skills-web-dev

Use when adding forecasting, classification, regression, or anomaly detection to a SaaS feature — demand/sales/cash-flow forecasting, churn and risk scoring, anomaly detection — with scikit-learn, Prophet, and statsmodels. Covers data prep, model serving, monitoring, and explainability.

python-data-pipelines

from peterbamuhigire/skills-web-dev

Use when building ETL jobs, document intelligence pipelines, OCR, PDF/Excel ingestion, image/media processing, or external-API sync pipelines in Python — idempotent scheduled jobs with validation, dead-letter queues, and multi-tenant isolation.

ios-data-persistence

from peterbamuhigire/skills-web-dev

iOS data persistence standards with SwiftData as primary local storage and custom API backends for cloud sync. Covers UserDefaults, Keychain, SwiftData (models, queries, relationships, migrations), file storage, offline-first architecture, and...

data-visualization

from peterbamuhigire/skills-web-dev

Data visualization patterns for transforming raw data into compelling visual stories, dashboards, charts, and decision-ready analytical exhibits.

database-reliability

from peterbamuhigire/skills-web-dev

Database reliability engineering: SLI/SLO design and error-budget policy for the data tier, blameless postmortems, escalation tiers and on-call hand-off, game days for MySQL/PostgreSQL, operational runbooks, change management, capacity planning, and backup verification. Use when setting up production database SRE practice, defining database SLOs/error budgets, running database postmortems, or hardening on-call for MySQL/PostgreSQL.

database-design-engineering

from peterbamuhigire/skills-web-dev

Use when designing or reviewing relational or document-backed data architecture for SaaS platforms, ERP systems, APIs, analytics stores, or mobile sync. Covers domain modeling, tenancy, indexing, migrations, integrity, retention, and performance tradeoffs.

android-data-persistence

from peterbamuhigire/skills-web-dev

Android data persistence standards with Room as primary local storage and custom API backends for cloud sync. Covers SharedPreferences, DataStore, Room (entities, DAOs, relations, migrations), file storage, offline-first architecture, and...

ai-analytics

from peterbamuhigire/skills-web-dev

Use when designing AI analytics, dashboards, SaaS AI metrics, NLP analytics, predictive analytics, or executive AI insight workflows. Orchestrates the former granular AI analytics skills as references.

web-app-security-audit

from peterbamuhigire/skills-web-dev

Use when auditing a PHP/JavaScript/HTML web application for security vulnerabilities. Covers configuration, authentication, authorization, input validation, XSS, API security, HTTP headers, and dependency scanning. Produces a severity-rated audit...

vibe-security-skill

from peterbamuhigire/skills-web-dev

Use when designing or reviewing security for a web application, API, or multi-tenant SaaS — produces threat model, abuse case list, auth/authz matrix, and secret handling plan; covers OWASP Top 10 2025 and the AI-code-generation blind spots. Neighbours — api-design-first owns auth model fields, deployment-release-engineering owns secret rotation choreography, ai-security and llm-security own model-specific threats.