Lambda Labs GPU Cloud

Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.

25 stars

Best use case

Lambda Labs GPU Cloud is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.

Teams using Lambda Labs GPU Cloud should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/lambda-labs/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/Orchestra-Research/AI-Research-SKILLs/lambda-labs/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/lambda-labs/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Lambda Labs GPU Cloud Compares

Feature / Agent	Lambda Labs GPU Cloud	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Lambda Labs GPU Cloud

Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.

## When to use Lambda Labs

**Use Lambda Labs when:**
- Need dedicated GPU instances with full SSH access
- Running long training jobs (hours to days)
- Want simple pricing with no egress fees
- Need persistent storage across sessions
- Require high-performance multi-node clusters (16-512 GPUs)
- Want pre-installed ML stack (Lambda Stack with PyTorch, CUDA, NCCL)

**Key features:**
- **GPU variety**: B200, H100, GH200, A100, A10, A6000, V100
- **Lambda Stack**: Pre-installed PyTorch, TensorFlow, CUDA, cuDNN, NCCL
- **Persistent filesystems**: Keep data across instance restarts
- **1-Click Clusters**: 16-512 GPU Slurm clusters with InfiniBand
- **Simple pricing**: Pay-per-minute, no egress fees
- **Global regions**: 12+ regions worldwide

**Use alternatives instead:**
- **Modal**: For serverless, auto-scaling workloads
- **SkyPilot**: For multi-cloud orchestration and cost optimization
- **RunPod**: For cheaper spot instances and serverless endpoints
- **Vast.ai**: For GPU marketplace with lowest prices

## Quick start

### Account setup

1. Create account at https://lambda.ai
2. Add payment method
3. Generate API key from dashboard
4. Add SSH key (required before launching instances)

### Launch via console

1. Go to https://cloud.lambda.ai/instances
2. Click "Launch instance"
3. Select GPU type and region
4. Choose SSH key
5. Optionally attach filesystem
6. Launch and wait 3-15 minutes

### Connect via SSH

```bash
# Get instance IP from console
ssh ubuntu@<INSTANCE-IP>

# Or with specific key
ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
```

## GPU instances

### Available GPUs

| GPU | VRAM | Price/GPU/hr | Best For |
|-----|------|--------------|----------|
| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
| GH200 | 96 GB | $1.49 | Single-GPU large models |
| A100 80GB | 80 GB | $1.79 | Production training |
| A100 40GB | 40 GB | $1.29 | Standard training |
| A10 | 24 GB | $0.75 | Inference, fine-tuning |
| A6000 | 48 GB | $0.80 | Good VRAM/price ratio |
| V100 | 16 GB | $0.55 | Budget training |

### Instance configurations

```
8x GPU: Best for distributed training (DDP, FSDP)
4x GPU: Large models, multi-GPU training
2x GPU: Medium workloads
1x GPU: Fine-tuning, inference, development
```

### Launch times

- Single-GPU: 3-5 minutes
- Multi-GPU: 10-15 minutes

## Lambda Stack

All instances come with Lambda Stack pre-installed:

```bash
# Included software
- Ubuntu 22.04 LTS
- NVIDIA drivers (latest)
- CUDA 12.x
- cuDNN 8.x
- NCCL (for multi-GPU)
- PyTorch (latest)
- TensorFlow (latest)
- JAX
- JupyterLab
```

### Verify installation

```bash
# Check GPU
nvidia-smi

# Check PyTorch
python -c "import torch; print(torch.cuda.is_available())"

# Check CUDA version
nvcc --version
```

## Python API

### Installation

```bash
pip install lambda-cloud-client
```

### Authentication

```python
import os
import lambda_cloud_client

# Configure with API key
configuration = lambda_cloud_client.Configuration(
    host="https://cloud.lambdalabs.com/api/v1",
    access_token=os.environ["LAMBDA_API_KEY"]
)
```

### List available instances

```python
with lambda_cloud_client.ApiClient(configuration) as api_client:
    api = lambda_cloud_client.DefaultApi(api_client)

    # Get available instance types
    types = api.instance_types()
    for name, info in types.data.items():
        print(f"{name}: {info.instance_type.description}")
```

### Launch instance

```python
from lambda_cloud_client.models import LaunchInstanceRequest

request = LaunchInstanceRequest(
    region_name="us-west-1",
    instance_type_name="gpu_1x_h100_sxm5",
    ssh_key_names=["my-ssh-key"],
    file_system_names=["my-filesystem"],  # Optional
    name="training-job"
)

response = api.launch_instance(request)
instance_id = response.data.instance_ids[0]
print(f"Launched: {instance_id}")
```

### List running instances

```python
instances = api.list_instances()
for instance in instances.data:
    print(f"{instance.name}: {instance.ip} ({instance.status})")
```

### Terminate instance

```python
from lambda_cloud_client.models import TerminateInstanceRequest

request = TerminateInstanceRequest(
    instance_ids=[instance_id]
)
api.terminate_instance(request)
```

### SSH key management

```python
from lambda_cloud_client.models import AddSshKeyRequest

# Add SSH key
request = AddSshKeyRequest(
    name="my-key",
    public_key="ssh-rsa AAAA..."
)
api.add_ssh_key(request)

# List keys
keys = api.list_ssh_keys()

# Delete key
api.delete_ssh_key(key_id)
```

## CLI with curl

### List instance types

```bash
curl -u $LAMBDA_API_KEY: \
  https://cloud.lambdalabs.com/api/v1/instance-types | jq
```

### Launch instance

```bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
  -H "Content-Type: application/json" \
  -d '{
    "region_name": "us-west-1",
    "instance_type_name": "gpu_1x_h100_sxm5",
    "ssh_key_names": ["my-key"]
  }' | jq
```

### Terminate instance

```bash
curl -u $LAMBDA_API_KEY: \
  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
  -H "Content-Type: application/json" \
  -d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
```

## Persistent storage

### Filesystems

Filesystems persist data across instance restarts:

```bash
# Mount location
/lambda/nfs/<FILESYSTEM_NAME>

# Example: save checkpoints
python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
```

### Create filesystem

1. Go to Storage in Lambda console
2. Click "Create filesystem"
3. Select region (must match instance region)
4. Name and create

### Attach to instance

Filesystems must be attached at instance launch time:
- Via console: Select filesystem when launching
- Via API: Include `file_system_names` in launch request

### Best practices

```bash
# Store on filesystem (persists)
/lambda/nfs/storage/
  ├── datasets/
  ├── checkpoints/
  ├── models/
  └── outputs/

# Local SSD (faster, ephemeral)
/home/ubuntu/
  └── working/  # Temporary files
```

## SSH configuration

### Add SSH key

```bash
# Generate key locally
ssh-keygen -t ed25519 -f ~/.ssh/lambda_key

# Add public key to Lambda console
# Or via API
```

### Multiple keys

```bash
# On instance, add more keys
echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
```

### Import from GitHub

```bash
# On instance
ssh-import-id gh:username
```

### SSH tunneling

```bash
# Forward Jupyter
ssh -L 8888:localhost:8888 ubuntu@<IP>

# Forward TensorBoard
ssh -L 6006:localhost:6006 ubuntu@<IP>

# Multiple ports
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
```

## JupyterLab

### Launch from console

1. Go to Instances page
2. Click "Launch" in Cloud IDE column
3. JupyterLab opens in browser

### Manual access

```bash
# On instance
jupyter lab --ip=0.0.0.0 --port=8888

# From local machine with tunnel
ssh -L 8888:localhost:8888 ubuntu@<IP>
# Open http://localhost:8888
```

## Training workflows

### Single-GPU training

```bash
# SSH to instance
ssh ubuntu@<IP>

# Clone repo
git clone https://github.com/user/project
cd project

# Install dependencies
pip install -r requirements.txt

# Train
python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
```

### Multi-GPU training (single node)

```python
# train_ddp.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    device = rank % torch.cuda.device_count()

    model = MyModel().to(device)
    model = DDP(model, device_ids=[device])

    # Training loop...

if __name__ == "__main__":
    main()
```

```bash
# Launch with torchrun (8 GPUs)
torchrun --nproc_per_node=8 train_ddp.py
```

### Checkpoint to filesystem

```python
import os

checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

# Save checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
```

## 1-Click Clusters

### Overview

High-performance Slurm clusters with:
- 16-512 NVIDIA H100 or B200 GPUs
- NVIDIA Quantum-2 400 Gb/s InfiniBand
- GPUDirect RDMA at 3200 Gb/s
- Pre-installed distributed ML stack

### Included software

- Ubuntu 22.04 LTS + Lambda Stack
- NCCL, Open MPI
- PyTorch with DDP and FSDP
- TensorFlow
- OFED drivers

### Storage

- 24 TB NVMe per compute node (ephemeral)
- Lambda filesystems for persistent data

### Multi-node training

```bash
# On Slurm cluster
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
  torchrun --nnodes=4 --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
  train.py
```

## Networking

### Bandwidth

- Inter-instance (same region): up to 200 Gbps
- Internet outbound: 20 Gbps max

### Firewall

- Default: Only port 22 (SSH) open
- Configure additional ports in Lambda console
- ICMP traffic allowed by default

### Private IPs

```bash
# Find private IP
ip addr show | grep 'inet '
```

## Common workflows

### Workflow 1: Fine-tuning LLM

```bash
# 1. Launch 8x H100 instance with filesystem

# 2. SSH and setup
ssh ubuntu@<IP>
pip install transformers accelerate peft

# 3. Download model to filesystem
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
"

# 4. Fine-tune with checkpoints on filesystem
accelerate launch --num_processes 8 train.py \
  --model_path /lambda/nfs/storage/models/llama-2-7b \
  --output_dir /lambda/nfs/storage/outputs \
  --checkpoint_dir /lambda/nfs/storage/checkpoints
```

### Workflow 2: Batch inference

```bash
# 1. Launch A10 instance (cost-effective for inference)

# 2. Run inference
python inference.py \
  --model /lambda/nfs/storage/models/fine-tuned \
  --input /lambda/nfs/storage/data/inputs.jsonl \
  --output /lambda/nfs/storage/data/outputs.jsonl
```

## Cost optimization

### Choose right GPU

| Task | Recommended GPU |
|------|-----------------|
| LLM fine-tuning (7B) | A100 40GB |
| LLM fine-tuning (70B) | 8x H100 |
| Inference | A10, A6000 |
| Development | V100, A10 |
| Maximum performance | B200 |

### Reduce costs

1. **Use filesystems**: Avoid re-downloading data
2. **Checkpoint frequently**: Resume interrupted training
3. **Right-size**: Don't over-provision GPUs
4. **Terminate idle**: No auto-stop, manually terminate

### Monitor usage

- Dashboard shows real-time GPU utilization
- API for programmatic monitoring

## Common issues

| Issue | Solution |
|-------|----------|
| Instance won't launch | Check region availability, try different GPU |
| SSH connection refused | Wait for instance to initialize (3-15 min) |
| Data lost after terminate | Use persistent filesystems |
| Slow data transfer | Use filesystem in same region |
| GPU not detected | Reboot instance, check drivers |

## References

- **[Advanced Usage](references/advanced-usage.md)** - Multi-node training, API automation
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions

## Resources

- **Documentation**: https://docs.lambda.ai
- **Console**: https://cloud.lambda.ai
- **Pricing**: https://lambda.ai/instances
- **Support**: https://support.lambdalabs.com
- **Blog**: https://lambda.ai/blog

Related Skills

cloud-infrastructure

from ComeOnOliver/skillshub

Cloud infrastructure design and deployment patterns for AWS, Azure, and GCP. Use when designing cloud architectures, implementing IaC with Terraform, optimizing costs, or setting up multi-region deployments.

alicloud-vpc

from ComeOnOliver/skillshub

Manage Alibaba Cloud VPC networking using the @alicloud/vpc20160428 TypeScript SDK. Use when working with virtual private clouds, VSwitches, route tables, EIPs, NAT gateways, VPN gateways, Express Connect, BGP routing, network ACLs, flow logs, traffic mirroring, IPv6, HAVIP, gateway endpoints, and resource tagging. Covers all 396 APIs of the VPC 20160428 version.

alicloud-redis

from ComeOnOliver/skillshub

Manage Alibaba Cloud Redis (Tair / R-KVStore) using the @alicloud/r-kvstore20150101 TypeScript SDK. Use when working with Redis or Tair instances, accounts, backups, security (whitelist/SSL/TDE/audit), parameters, monitoring, cluster scaling, direct connection, Tair Custom instances, and resource tagging. Covers all 157 APIs of the R-KVStore 20150101 version.

alicloud-rds

from ComeOnOliver/skillshub

Manage Alibaba Cloud RDS using the @alicloud/rds20140815 TypeScript SDK. Use when working with relational database instances (MySQL, PostgreSQL, SQL Server, MariaDB), accounts, databases, backups, security, monitoring, parameters, read-only instances, database proxy, migration, cross-region DR, PostgreSQL extensions, RDS Custom instances, and resource tagging. Covers all 398 APIs of the RDS 20140815 version.

Alibaba Cloud RAM (Resource Access Management) Skill

from ComeOnOliver/skillshub

Manage Alibaba Cloud Resource Access Management (RAM) using the @alicloud/ram20150501 TypeScript SDK. Use when working with identity and access control on Alibaba Cloud, including RAM users, user groups, roles, policies, AccessKeys, MFA devices, login profiles, password policies, security preferences, account aliases, and resource tagging. Covers all 66 APIs of the RAM 20150501 version.

Alibaba Cloud OSS (Object Storage Service)

from ComeOnOliver/skillshub

Manage Alibaba Cloud OSS using the @alicloud/oss20190517 TypeScript SDK. Use when working with object storage buckets, objects, multipart uploads, bucket configurations (lifecycle, versioning, CORS, encryption, replication, WORM), live streaming channels, and static website hosting. Covers all 90 APIs of the OSS 2019-05-17 version.

alicloud-fc

from ComeOnOliver/skillshub

Manage Alibaba Cloud Function Compute (FC) 3.0 using the @alicloud/fc20230330 TypeScript SDK. Use when working with serverless functions on Alibaba Cloud, including function CRUD, invocation, versions, aliases, triggers (HTTP/Timer/OSS/CDN/MNS), async invocation, concurrency and scaling configs, provisioned instances, custom domains, layers, VPC bindings, sessions, and resource tagging. Covers all 67 APIs of the FC 20230330 version.

alicloud-ecs

from ComeOnOliver/skillshub

Manage Alibaba Cloud Elastic Compute Service (ECS) using the @alicloud/ecs20140526 TypeScript SDK. Use when working with cloud servers on Alibaba Cloud, including instance lifecycle (create, start, stop, reboot, delete), disks and snapshots, images, security groups, VPC networking, EIP, ENI, SSH key pairs, dedicated hosts, auto provisioning, launch templates, Cloud Assistant commands, tags, system events, diagnostics, storage capacity units, and prefix lists. Covers all 374 APIs of the ECS 20140526 version.

alicloud-cr

from ComeOnOliver/skillshub

Manage Alibaba Cloud Container Registry (ACR) Enterprise Edition using the @alicloud/cr20181201 TypeScript SDK. Use when working with container image registries on Alibaba Cloud, including instance management, namespaces, image repositories, image tags, build rules, image synchronization, security scanning, delivery chains, Helm charts, artifact lifecycle, and event notifications. Covers all 115 APIs of the CR 20181201 version.

alicloud-cdn

from ComeOnOliver/skillshub

Manage Alibaba Cloud CDN using the @alicloud/cdn20180510 TypeScript SDK. Use when working with CDN domain acceleration, domain configuration, SSL certificates, cache refresh/prefetch, real-time monitoring, traffic analysis, log management, usage/billing, IP tools, Function Compute triggers, delivery tasks, and resource tagging. Covers all 168 APIs of the CDN 20180510 version.

alicloud-alidns

from ComeOnOliver/skillshub

Manage Alibaba Cloud DNS (Alidns) using the @alicloud/alidns20150109 TypeScript SDK. Use when working with DNS resolution on Alibaba Cloud, including domain management, DNS record CRUD (A, AAAA, CNAME, MX, TXT, SRV, CAA, etc.), DNS load balancing (DNSSLB), custom resolution lines, DNSSEC, domain groups, batch operations, Cloud GTM (Global Traffic Manager), DNS GTM, GTM Classic, recursive DNS, DNS cache, Public DNS (PDNS), DNS over HTTPS (DoH), ISP cache flush, domain statistics, and resource tagging. Covers all 234 APIs of the Alidns 20150109 version.

Zeabur — Modern Cloud Deployment Platform

from ComeOnOliver/skillshub

## Overview