python-data-analytics

Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.

Best use case

python-data-analytics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.

Teams using python-data-analytics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/python-data-analytics/SKILL.md --create-dirs "https://raw.githubusercontent.com/peterbamuhigire/skills-web-dev/main/skills/languages/python-data-analytics/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/python-data-analytics/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How python-data-analytics Compares

Feature / Agentpython-data-analyticsStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Python Data Analytics
Acknowledgement: Shared by Peter Bamuhigire, techguypeter.com, +256 784 464178.

<!-- dual-compat-start -->
## Use When

- Use when computing complex analytics, KPIs, cohort/funnel/retention metrics, financial math (IRR/NPV/amortization), statistical tests, anomaly detection, or geospatial analytics in Python — for cases where SQL alone gets unwieldy.
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

## Do Not Use When

- The task is unrelated to `python-data-analytics` or would be better handled by a more specific companion skill.
- The request only needs a trivial answer and none of this skill's constraints or references materially help.

## Required Inputs

- Gather relevant project context, constraints, and the concrete problem to solve; load `references` only as needed.
- Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

## Workflow

- Read this `SKILL.md` first, then load only the referenced deep-dive files that are necessary for the task.
- Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

## Quality Standards

- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

## Anti-Patterns

- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.

## Outputs

- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.

## Evidence Produced

| Category | Artifact | Format | Example |
|----------|----------|--------|---------|
| Correctness | Analytics test plan | Markdown doc covering KPI computation, cohort/funnel/retention, and edge-case (empty / sparse) tests | `docs/python/analytics-tests.md` |
| Performance | Pandas/Polars performance note | Markdown doc covering DataFrame size, vectorisation choices, and memory footprint per query | `docs/python/analytics-perf-note.md` |

## References

- Use the `references/` directory for deep detail after reading the core workflow below.
<!-- dual-compat-end -->
Use Python when SQL stops being the right tool: multi-step transformations, statistical tests, time-series analysis, anomaly detection, complex financial math, or geospatial computation. Pandas / numpy / Polars as the compute engine; results feed back to PHP via FastAPI sidecar or into Excel/PDF via document-generation.

**Prerequisites:** Load `python-modern-standards` and `python-saas-integration` before this skill.

## When this skill applies

- Computing cohort retention, funnel conversion, churn curves.
- Financial math: IRR, NPV, amortization, depreciation schedules at scale.
- Accounting analytics: trial balance, AR/AP aging, bank reconciliation, inventory valuation, fixed-asset reconciliation, cost variance, and finance dashboard calculations.
- Statistical tests: A/B test significance, distribution comparisons, trend detection.
- Analytics method selection: descriptive, diagnostic, predictive, and prescriptive
  analysis with explicit data-quality and decision-handoff controls.
- Outlier/anomaly detection on SaaS metrics.
- Geospatial analytics: territory stats, route distances, geofence compliance reports.
- Any workload where SQL requires >3 subqueries or a window function chain that's hard to read.

## When Python beats raw SQL (decision rule)

```text
Single-table aggregation, < 3 group-bys       -> SQL
Joins + window functions, still readable       -> SQL
Iterative calculation (running state, rolling) -> Python
Statistical test / hypothesis / distribution  -> Python
Financial schedule (amortization, IRR)         -> Python
Multi-source merge (DB + API + file)           -> Python
Requires chart output                          -> Python
Result is a matrix / pivot > 10 columns        -> Python
```

Rule of thumb: if explaining the SQL to a teammate takes more than 60 seconds, move to Python.

## Core stack

- **pandas 2.x** — default DataFrame library. Arrow-backed dtypes on 3.11+.
- **numpy** — array math, used by pandas under the hood.
- **Polars** — use for datasets > 1M rows or where pandas is too slow. Lazy API is great.
- **SQLAlchemy 2.x** — for DB access. Never mix with raw DB cursors in the same module.
- **Shapely + GeoPandas** — for geospatial work.
- **statsmodels + scipy** — statistical tests and time-series.
- **Decimal** (stdlib) — for all currency math; never `float`.

## Loading data

Use SQLAlchemy engine + `pandas.read_sql` with parameters. Chunk for large results.

```python
from sqlalchemy import create_engine, text
import pandas as pd

engine = create_engine(settings.database_url, pool_pre_ping=True)

sql = text("""
    SELECT customer_id, invoice_date, total_amount
    FROM invoices
    WHERE tenant_id = :tenant_id
      AND invoice_date >= :start
      AND invoice_date <  :end
""")
df = pd.read_sql(sql, engine, params={"tenant_id": tenant_id, "start": start, "end": end},
                 parse_dates=["invoice_date"])
```

For > 1M rows, chunked reads with `chunksize=100_000` and a generator pipeline. See `references/loading-data.md`.

## Pandas idioms (the ones that matter)

**Vectorize, don't loop.**

```python
# SLOW
df["total"] = df.apply(lambda r: r["qty"] * r["unit_price"], axis=1)

# FAST
df["total"] = df["qty"] * df["unit_price"]
```

**Avoid `inplace=True`.** Leads to chained assignment bugs in 2.x. Use explicit reassignment.

**Use dtypes on load.** `pd.read_sql(..., dtype={"customer_id": "int64"})`. Cast dates, categoricals, booleans early.

**`copy()` when slicing, always.** `df2 = df[df.status == "paid"].copy()` — prevents `SettingWithCopyWarning`.

**`groupby().agg()` with named columns** is clearer than dict form:

```python
summary = (
    df.groupby("tenant_id", as_index=False)
      .agg(
          revenue=("total_amount", "sum"),
          invoices=("id", "count"),
          avg_invoice=("total_amount", "mean"),
      )
)
```

See `references/pandas-idioms.md` for the full catalog (copy vs view, dtypes, datetime handling, common pitfalls).

## Cohort, funnel, retention

The three most-requested SaaS analytics patterns. Full recipes in `references/cohort-funnel-retention.md`. Summary here:

**Cohort retention:**
1. Assign each customer to their signup cohort (month).
2. For each month after signup, count active customers per cohort.
3. Divide by cohort size → retention %.

```python
customers["cohort"] = customers["signup_date"].dt.to_period("M")
activity = activity.merge(customers[["customer_id", "cohort"]], on="customer_id")
activity["period"] = activity["activity_date"].dt.to_period("M")
activity["offset"] = (activity["period"] - activity["cohort"]).apply(lambda x: x.n)

cohort_size = customers.groupby("cohort").size()
retention = (
    activity.groupby(["cohort", "offset"])["customer_id"].nunique()
            .unstack(fill_value=0)
            .div(cohort_size, axis=0) * 100
)
```

**Funnel conversion:** sequential step counts → percentage drop-offs.

**Retention curves:** for each cohort, plot offset vs %retained — identify where users churn.

## Financial math — always Decimal, never float

Currency math is Decimal. Period.

```python
from decimal import Decimal, ROUND_HALF_UP

def amortization_schedule(principal: Decimal, annual_rate: Decimal, months: int) -> list[dict]:
    monthly_rate = annual_rate / Decimal(12) / Decimal(100)
    # Payment formula using Decimal throughout
    ...
```

Recipes for IRR, NPV, amortization, depreciation (straight-line, declining, units-of-production), and day-count conventions (30/360, actual/365) in `references/financial-math.md`. For accounting and bookkeeping analytics, load `references/accounting-finance-analytics.md` before calculating trial balances, aging schedules, reconciliations, cost variances, or finance dashboards.

## Statistical tests and anomalies

- **A/B test significance:** two-sample t-test, chi-square for proportions, Mann-Whitney U when non-normal.
- **Outliers:** z-score (thresholds: 2.5–3), IQR method (1.5×IQR), Modified Z-score with median.
- **Trend detection:** Mann-Kendall for monotonic trends; decompose seasonality with STL.
- **Anomaly detection:** rolling-window mean ± k·std; isolation forest when rules aren't enough.

See `references/statistics-and-anomalies.md` for choosing the right test per question, and `python-ml-predictive` for model-based anomaly detection.

## Geospatial analytics

Use Shapely for geometry, GeoPandas when you have many features. Coordinate reference systems matter — always tag CRS (EPSG:4326 for WGS84 lat/lng).

```python
import geopandas as gpd
from shapely.geometry import Point

deliveries = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.lng, df.lat), crs="EPSG:4326"
).to_crs("EPSG:3857")   # project to meters for distance

zones = gpd.read_file("zones.geojson").to_crs("EPSG:3857")
joined = gpd.sjoin(deliveries, zones, how="left", predicate="within")
```

Distance calculations in EPSG:3857 (meters), not 4326 (degrees — degrees ≠ distance). See `references/geospatial-analytics.md`.

## Performance — when to leave pandas

- **Rows < 100K:** pandas is fine.
- **Rows 100K–5M:** pandas works; profile and optimize dtypes (`category`, Arrow-backed).
- **Rows > 5M or memory tight:** switch to **Polars**. Its lazy API (`pl.scan_csv`/`scan_parquet`) beats pandas substantially on aggregations.
- **Truly large (> 100M rows):** this doesn't belong in a sidecar. Move to worker + chunked processing, or to a warehouse (ClickHouse / DuckDB / BigQuery) and let Python orchestrate.

Always profile with `cProfile` or `py-spy` before optimizing. See `references/performance-and-polars.md`.

## Output handoff

Analytics results don't live in Python — they return to PHP or become documents.

- **JSON back to PHP:** Pydantic models → FastAPI response. Decimal → string (IEEE754 will bite you).
- **Persist to MySQL:** `df.to_sql(...)` is OK for < 10K rows; beyond that, use `SQLAlchemy Core` with `INSERT ... ON DUPLICATE KEY UPDATE` and chunked inserts.
- **Feed into document-generation:** pass the DataFrame directly to the skill's Excel/Word/PDF generators. See `python-document-generation`.

## Pitfalls specific to SaaS analytics

- **Timezone:** store UTC, report in tenant-local. `df["created_at"].dt.tz_convert(tenant_tz)`.
- **Currency:** if invoices have mixed currencies, convert to a common reporting currency at the tenant's chosen rate (spot vs transaction-date rate — decide and document).
- **Dates vs datetimes:** trimming to midnight UTC can shift dates by a day in the user's TZ. Always round-trip through the tenant TZ before truncating.
- **Division by zero in rates:** guard every rate calc (`retention_rate = active / cohort_size if cohort_size else 0`).
- **Tenant isolation:** every DataFrame loaded must be filtered by tenant at SQL level. Never filter in Python after loading — you'll paginate the wrong data.

## Read next

- `python-document-generation` — render the DataFrame as a branded Excel dashboard or PDF.
- `python-ml-predictive` — when the question is "what happens next" instead of "what happened".
- `python-data-pipelines` — when data needs to be fetched/cleaned before analysis.

## References

- `references/analytics-method-selection-and-governance.md`
- `references/pandas-idioms.md`
- `references/loading-data.md`
- `references/cohort-funnel-retention.md`
- `references/financial-math.md`
- `references/accounting-finance-analytics.md`
- `references/statistics-and-anomalies.md`
- `references/geospatial-analytics.md`
- `references/performance-and-polars.md`

Related Skills

saas-tenant-data-portability-and-erasure

8
from peterbamuhigire/skills-web-dev

Use when designing the GDPR/POPIA/CCPA-compliant data export (right to portability) and erasure (right to be forgotten) workflows for a multi-tenant SaaS — cascade through every data store including warehouse/backups, retention policy, requester verification, audit trail, multi-tenant nuances of erasing one tenant's data without affecting others, and the engineering for African market regulations (Uganda DPPA, Kenya DPA, POPIA).

python-modern-standards

8
from peterbamuhigire/skills-web-dev

Use when writing or reviewing any Python code in our SaaS projects — defines Python version, project layout, tooling (uv, ruff, mypy), typing, Pydantic v2, logging, configuration, async rules, error handling, testing, and security baseline. Load this before any other Python skill.

python-ml-predictive

8
from peterbamuhigire/skills-web-dev

Use when adding forecasting, classification, regression, or anomaly detection to a SaaS feature — demand/sales/cash-flow forecasting, churn and risk scoring, anomaly detection — with scikit-learn, Prophet, and statsmodels. Covers data prep, model serving, monitoring, and explainability.

python-data-pipelines

8
from peterbamuhigire/skills-web-dev

Use when building ETL jobs, document intelligence pipelines, OCR, PDF/Excel ingestion, image/media processing, or external-API sync pipelines in Python — idempotent scheduled jobs with validation, dead-letter queues, and multi-tenant isolation.

ios-data-persistence

8
from peterbamuhigire/skills-web-dev

iOS data persistence standards with SwiftData as primary local storage and custom API backends for cloud sync. Covers UserDefaults, Keychain, SwiftData (models, queries, relationships, migrations), file storage, offline-first architecture, and...

data-visualization

8
from peterbamuhigire/skills-web-dev

Data visualization patterns for transforming raw data into compelling visual stories, dashboards, charts, and decision-ready analytical exhibits.

database-reliability

8
from peterbamuhigire/skills-web-dev

Database reliability engineering: SLI/SLO design and error-budget policy for the data tier, blameless postmortems, escalation tiers and on-call hand-off, game days for MySQL/PostgreSQL, operational runbooks, change management, capacity planning, and backup verification. Use when setting up production database SRE practice, defining database SLOs/error budgets, running database postmortems, or hardening on-call for MySQL/PostgreSQL.

database-design-engineering

8
from peterbamuhigire/skills-web-dev

Use when designing or reviewing relational or document-backed data architecture for SaaS platforms, ERP systems, APIs, analytics stores, or mobile sync. Covers domain modeling, tenancy, indexing, migrations, integrity, retention, and performance tradeoffs.

android-data-persistence

8
from peterbamuhigire/skills-web-dev

Android data persistence standards with Room as primary local storage and custom API backends for cloud sync. Covers SharedPreferences, DataStore, Room (entities, DAOs, relations, migrations), file storage, offline-first architecture, and...

ai-analytics

8
from peterbamuhigire/skills-web-dev

Use when designing AI analytics, dashboards, SaaS AI metrics, NLP analytics, predictive analytics, or executive AI insight workflows. Orchestrates the former granular AI analytics skills as references.

web-app-security-audit

8
from peterbamuhigire/skills-web-dev

Use when auditing a PHP/JavaScript/HTML web application for security vulnerabilities. Covers configuration, authentication, authorization, input validation, XSS, API security, HTTP headers, and dependency scanning. Produces a severity-rated audit...

vibe-security-skill

8
from peterbamuhigire/skills-web-dev

Use when designing or reviewing security for a web application, API, or multi-tenant SaaS — produces threat model, abuse case list, auth/authz matrix, and secret handling plan; covers OWASP Top 10 2025 and the AI-code-generation blind spots. Neighbours — api-design-first owns auth model fields, deployment-release-engineering owns secret rotation choreography, ai-security and llm-security own model-specific threats.