feature-engineering

Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features

7 stars

byagiprolabs

View on GitHub Installation ↓

Best use case

feature-engineering is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features

Teams using feature-engineering should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/feature-engineering/SKILL.md --create-dirs "https://raw.githubusercontent.com/agiprolabs/claude-trading-skills/main/skills/feature-engineering/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/feature-engineering/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How feature-engineering Compares

Feature / Agent	feature-engineering	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Feature Engineering for Trading ML

Feature engineering is the single highest-leverage activity in building ML trading
models. Model selection (XGBoost vs. neural net vs. logistic regression) matters far
less than the quality and diversity of input features. A simple model on great
features will outperform a complex model on raw prices every time.

This skill covers constructing, validating, and selecting features from market data
for use in classification (signal-classification) and regression models targeting
crypto/Solana token trading.

## Why Features Beat Models

Raw OHLCV data is non-stationary, noisy, and high-dimensional. Models trained
directly on price series will overfit. Feature engineering transforms raw data into
stationary, informative signals that capture distinct aspects of market behavior:

- **Compression**: Reduce thousands of price bars to dozens of descriptive statistics
- **Stationarity**: Convert non-stationary prices into stationary returns and ratios
- **Domain knowledge**: Encode trader intuition (support/resistance, volume climax)
  as computable quantities
- **Regime awareness**: Features that behave differently in trending vs. ranging
  markets help models adapt

## Feature Categories

### 1. Price Features

Derived purely from OHLCV price columns. These capture trend, momentum, and
volatility from the price series itself.

| Feature | Formula | Lookback |
|---------|---------|----------|
| `log_return` | `ln(close_t / close_{t-1})` | 1 bar |
| `abs_return` | `abs(log_return)` | 1 bar |
| `return_volatility` | `std(log_return, N)` | 20 bars |
| `momentum_N` | `close_t / close_{t-N} - 1` | 5, 10, 20 |
| `acceleration` | `momentum_5 - momentum_5[5]` | 10 bars |
| `high_low_range` | `(high - low) / close` | 1 bar |
| `close_position` | `(close - low) / (high - low)` | 1 bar |
| `gap` | `open_t / close_{t-1} - 1` | 1 bar |
| `rolling_skew` | `skew(log_return, N)` | 20 bars |
| `rolling_kurtosis` | `kurtosis(log_return, N)` | 20 bars |

### 2. Volume Features

Volume confirms or contradicts price movements. Divergences between price and
volume are among the most reliable signals in short-term trading.

| Feature | Formula | Lookback |
|---------|---------|----------|
| `volume_ratio` | `volume_t / mean(volume, N)` | 20 bars |
| `volume_ma_ratio` | `sma(volume, 5) / sma(volume, 20)` | 20 bars |
| `obv_slope` | `slope(OBV, N)` | 10 bars |
| `vwap_deviation` | `(close - VWAP) / VWAP` | intraday |
| `volume_acceleration` | `volume_ratio_t - volume_ratio_{t-1}` | 21 bars |
| `buy_volume_ratio` | `buy_volume / total_volume` | 1 bar |
| `dollar_volume` | `close * volume` | 1 bar |
| `volume_cv` | `std(volume, N) / mean(volume, N)` | 20 bars |

### 3. Technical Features

Standard technical indicators computed via `pandas-ta`. Use the `pandas-ta` skill
for full parameter documentation.

| Feature | Source | Lookback |
|---------|--------|----------|
| `rsi` | RSI(14) | 14 bars |
| `macd_histogram` | MACD(12,26,9) histogram | 33 bars |
| `bb_position` | `(close - BB_lower) / (BB_upper - BB_lower)` | 20 bars |
| `bb_width` | `(BB_upper - BB_lower) / BB_mid` | 20 bars |
| `atr_ratio` | `ATR(14) / close` | 14 bars |
| `adx` | ADX(14) | 14 bars |
| `stoch_k` | Stochastic %K(14,3) | 14 bars |
| `cci` | CCI(20) | 20 bars |
| `mfi` | MFI(14) | 14 bars |
| `supertrend_direction` | Supertrend direction (+1/-1) | 10 bars |

### 4. Microstructure Features

Derived from trade-level data (individual swaps/transactions). Require on-chain
or DEX API data.

| Feature | Description |
|---------|-------------|
| `trade_count_ratio` | Trades this bar / avg trades per bar |
| `avg_trade_size` | Mean trade size in USD |
| `large_trade_pct` | % of volume from trades > $10k |
| `unique_traders` | Count of distinct wallet addresses |
| `buy_count_ratio` | Buy trades / total trades |
| `trade_size_entropy` | Shannon entropy of trade size distribution |

### 5. On-Chain Features

Derived from blockchain state changes. Require Helius or Solana RPC data.

| Feature | Description |
|---------|-------------|
| `holder_count_change` | Change in unique holders over N periods |
| `whale_net_flow` | Net tokens moved by top-10 holders |
| `token_velocity` | Transfer volume / circulating supply |
| `liquidity_change` | Change in DEX liquidity pool TVL |

### 6. Cross-Asset Features

Capture relationships between the target token and broader market.

| Feature | Description |
|---------|-------------|
| `sol_correlation` | Rolling correlation with SOL price |
| `btc_beta` | Rolling beta to BTC returns |
| `sector_momentum` | Average return of tokens in same sector |

### 7. Time Features

Cyclical encoding of calendar time. Use sin/cos encoding to preserve cyclical
continuity (hour 23 is close to hour 0).

```python
import numpy as np

hour_sin = np.sin(2 * np.pi * hour / 24)
hour_cos = np.cos(2 * np.pi * hour / 24)
day_of_week = np.sin(2 * np.pi * day / 7)
```

## Stationarity

**Non-stationary features will cause your model to fail on new data.** A feature
is stationary if its statistical properties (mean, variance) don't change over time.

### Testing for Stationarity

Use the Augmented Dickey-Fuller (ADF) test:

```python
from scipy.stats import adfuller

result = adfuller(feature_series.dropna())
p_value = result[1]
is_stationary = p_value < 0.05
```

### Making Features Stationary

| Non-Stationary | Stationary Transform |
|----------------|---------------------|
| Price | Log return |
| Volume | Volume ratio (vol / avg vol) |
| OBV | OBV slope (regression coefficient) |
| Holder count | Holder count change |
| RSI | Already stationary (bounded 0-100) |
| Dollar volume | Dollar volume / rolling mean |

**Rule**: If a feature trends upward or downward over time, it is non-stationary.
Transform it into a ratio, difference, or rate of change.

## Normalization

After computing features, normalize them so that all features have comparable
scales. This is critical for distance-based models (KNN, SVM) and helpful for
tree models.

| Method | Formula | When to Use |
|--------|---------|-------------|
| Z-score | `(x - mean) / std` | Gaussian-like distributions |
| Min-max | `(x - min) / (max - min)` | Bounded features (RSI, BB position) |
| Rank | `rank(x) / len(x)` | Heavy-tailed distributions |

**Critical**: Use **rolling** statistics for normalization. Never use full-sample
mean/std — that introduces lookahead bias.

```python
# CORRECT: rolling z-score
z = (feature - feature.rolling(60).mean()) / feature.rolling(60).std()

# WRONG: full-sample z-score (lookahead bias!)
z = (feature - feature.mean()) / feature.std()
```

## No-Lookahead Guarantee

The most dangerous bug in trading ML is lookahead bias — using future information
to compute features or targets. Follow these rules absolutely:

1. **Rolling calculations only**: Never use `.mean()` or `.std()` on the full
   series. Always use `.rolling(N).mean()`.
2. **Shift targets forward, not features backward**: The target is
   `close.shift(-N) / close - 1` (future return), not `close / close.shift(N) - 1`
   (past return used as target).
3. **No future index alignment**: When joining feature and target DataFrames,
   verify that feature row `t` is paired with target row `t` (where target already
   contains the forward shift).
4. **Train/test split by time**: Never random split. Always
   `train = data[:split_idx]`, `test = data[split_idx:]`.

## Feature Selection

After computing many features, select the most predictive and least redundant:

### Step 1: Remove Low-Variance Features

```python
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
```

### Step 2: Correlation Filter

Remove features with > 0.9 correlation to another feature (keep the one with
higher target correlation):

```python
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.9)]
```

### Step 3: Feature Importance

Train a random forest and rank by importance:

```python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
```

### Step 4: Mutual Information

Non-linear alternative to correlation:

```python
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train, y_train, random_state=42)
mi_scores = pd.Series(mi, index=X.columns).sort_values(ascending=False)
```

## Label Creation

Labels (targets) define what the model learns to predict.

### Binary Classification

```python
forward_return = close.shift(-N) / close - 1
label = (forward_return > threshold).astype(int)  # 1 = up, 0 = not up
```

Typical thresholds: 1% for 1h bars, 3% for 4h bars, 5% for daily bars.

### Multi-Class Classification

```python
label = pd.cut(forward_return,
               bins=[-np.inf, -threshold, threshold, np.inf],
               labels=[0, 1, 2])  # 0=down, 1=flat, 2=up
```

### Regression

```python
target = forward_return  # Predict exact return magnitude
```

Binary classification is recommended for initial models — it's simpler and
more robust to noise.

## Integration with Other Skills

- **`pandas-ta`**: Compute technical indicators that become features
- **`birdeye-api`**: Fetch OHLCV and trade data for feature computation
- **`helius-api`**: Fetch on-chain data for holder/whale features
- **`signal-classification`**: Use engineered features as model inputs
- **`regime-detection`**: Regime labels as features or for regime-conditional models
- **`ohlcv-processing`**: Clean and resample raw data before feature computation

## Files

### References
- `references/feature_catalog.md` — Complete catalog of ~40 features with formulas,
  lookbacks, stationarity status, and interpretation notes
- `references/pitfalls.md` — Common mistakes in trading feature engineering:
  lookahead bias, overfitting, survivorship bias, data snooping, non-stationarity

### Scripts
- `scripts/build_features.py` — Compute 25+ features from OHLCV data with
  stationarity testing and quality reporting. Supports demo mode with synthetic data
  or live data via Birdeye API.
- `scripts/feature_importance.py` — Rank features by predictive power using
  tree-based importance and permutation importance. Identifies redundant features
  via correlation analysis.

Related Skills

yield-analysis

from agiprolabs/claude-trading-skills

DeFi yield evaluation including fee APR, real vs nominal yield, net APY after costs, and yield sustainability analysis

yellowstone-grpc

from agiprolabs/claude-trading-skills

Real-time Solana transaction and account streaming via Yellowstone gRPC (Geyser plugin)

whale-tracking

from agiprolabs/claude-trading-skills

Large wallet monitoring, accumulation and distribution detection, and smart money signal generation for Solana tokens

wash-sale-detection

from agiprolabs/claude-trading-skills

Wash sale detection under 2025 US crypto rules with 61-day window monitoring, disallowed loss tracking, and safe re-entry countdown

wallet-profiling

from agiprolabs/claude-trading-skills

Behavioral classification, performance analysis, and trading style detection for Solana wallets

walk-forward-validation

from agiprolabs/claude-trading-skills

Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation

volatility-modeling

from agiprolabs/claude-trading-skills

Volatility estimation, forecasting, and regime classification using GARCH, EWMA, realized volatility, and volatility cones

vectorbt

from agiprolabs/claude-trading-skills

High-performance vectorized backtesting with parameter optimization, portfolio simulation, and rich performance metrics

trading-visualization

from agiprolabs/claude-trading-skills

Professional trading charts including candlesticks, equity curves, drawdowns, correlation heatmaps, and return distributions

trade-journal

from agiprolabs/claude-trading-skills

Structured trade logging, performance review, behavioral pattern detection, and strategy attribution for systematic improvement

trade-accounting

from agiprolabs/claude-trading-skills

Double-entry bookkeeping for trading operations with ledger management, P&L statements, balance sheets, and cash flow reporting

token-holder-analysis

from agiprolabs/claude-trading-skills

Token holder distribution, concentration metrics, insider detection, and supply analysis for Solana tokens