implementing-cloud-dlp-for-data-protection
Implementing Cloud Data Loss Prevention (DLP) using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and data pipelines.
Best use case
implementing-cloud-dlp-for-data-protection is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Implementing Cloud Data Loss Prevention (DLP) using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and data pipelines.
Teams using implementing-cloud-dlp-for-data-protection should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/implementing-cloud-dlp-for-data-protection/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How implementing-cloud-dlp-for-data-protection Compares
| Feature / Agent | implementing-cloud-dlp-for-data-protection | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Implementing Cloud Data Loss Prevention (DLP) using Amazon Macie, Azure Information Protection, and Google Cloud DLP API to discover, classify, and protect sensitive data across cloud storage, databases, and data pipelines.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# Implementing Cloud DLP for Data Protection
## When to Use
- When compliance frameworks (GDPR, HIPAA, PCI DSS) require automated sensitive data discovery and protection
- When building data governance programs that classify and label data across cloud storage
- When implementing data loss prevention controls for cloud-based data pipelines
- When auditing cloud environments for unprotected sensitive data (PII, PHI, financial data)
- When integrating DLP scanning into CI/CD pipelines to prevent sensitive data from reaching production
**Do not use** for endpoint DLP (use Microsoft Purview or Symantec DLP agents), for email DLP (use Microsoft 365 DLP or Google Workspace DLP), or for network-level data exfiltration prevention (use VPC endpoint policies and network firewalls).
## Prerequisites
- Amazon Macie enabled with appropriate S3 bucket permissions
- Google Cloud DLP API enabled (`gcloud services enable dlp.googleapis.com`)
- Azure Information Protection or Microsoft Purview configured
- IAM permissions for DLP service administration and data access
- Knowledge of data sensitivity categories relevant to the organization (PII, PHI, PCI, proprietary)
## Workflow
### Step 1: Deploy Amazon Macie for S3 Data Discovery
Enable Macie and configure automated sensitive data discovery jobs for S3 buckets.
```bash
# Enable Amazon Macie
aws macie2 enable-macie
# List all S3 buckets Macie can scan
aws macie2 describe-buckets \
--query 'buckets[*].[bucketName,classifiableSizeInBytes,unclassifiableObjectCount.total]' \
--output table
# Create a classification job for specific buckets
aws macie2 create-classification-job \
--job-type SCHEDULED \
--name "weekly-pii-scan" \
--schedule-frequency-details '{"weekly":{"dayOfWeek":"MONDAY"}}' \
--s3-job-definition '{
"bucketDefinitions": [{
"accountId": "ACCOUNT_ID",
"buckets": ["customer-data-bucket", "analytics-data-lake", "backup-bucket"]
}],
"scoping": {
"includes": {
"and": [{
"simpleScopeTerm": {
"key": "OBJECT_EXTENSION",
"values": ["csv", "json", "parquet", "txt", "xlsx"],
"comparator": "EQ"
}
}]
}
}
}' \
--managed-data-identifier-ids '["SSN","CREDIT_CARD_NUMBER","EMAIL_ADDRESS","AWS_CREDENTIALS","PHONE_NUMBER"]'
# Create custom data identifier for internal employee IDs
aws macie2 create-custom-data-identifier \
--name "EmployeeID" \
--regex "EMP-[0-9]{6}" \
--description "Internal employee ID format"
# Check job status and results
aws macie2 list-classification-jobs \
--query 'items[*].[name,jobStatus,statistics.approximateNumberOfObjectsToProcess]' \
--output table
```
### Step 2: Configure Google Cloud DLP API for Data Inspection
Use Google Cloud DLP to inspect and de-identify sensitive data across GCP resources.
```bash
# Inspect a Cloud Storage bucket for sensitive data
gcloud dlp inspect-content \
--content-type=TEXT_PLAIN \
--min-likelihood=LIKELY \
--info-types=PHONE_NUMBER,EMAIL_ADDRESS,CREDIT_CARD_NUMBER,US_SOCIAL_SECURITY_NUMBER \
--storage-type=CLOUD_STORAGE \
--gcs-uri="gs://sensitive-data-bucket/data/*.csv"
# Create an inspection job for BigQuery
cat > dlp-job.json << 'EOF'
{
"inspectJob": {
"storageConfig": {
"bigQueryOptions": {
"tableReference": {
"projectId": "PROJECT_ID",
"datasetId": "customer_data",
"tableId": "transactions"
},
"sampleMethod": "RANDOM_START",
"rowsLimit": 10000
}
},
"inspectConfig": {
"infoTypes": [
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "PERSON_NAME"}
],
"minLikelihood": "LIKELY",
"limits": {"maxFindingsPerRequest": 1000}
},
"actions": [{
"saveFindings": {
"outputConfig": {
"table": {
"projectId": "PROJECT_ID",
"datasetId": "dlp_results",
"tableId": "findings"
}
}
}
}]
}
}
EOF
gcloud dlp jobs create --project=PROJECT_ID --body-from-file=dlp-job.json
```
### Step 3: Implement Data De-identification with Cloud DLP
Configure de-identification transforms to mask, tokenize, or redact sensitive data.
```python
# deidentify_pipeline.py - De-identify sensitive data using Google Cloud DLP
from google.cloud import dlp_v2
def deidentify_data(project_id, text):
"""De-identify PII in text using Cloud DLP."""
client = dlp_v2.DlpServiceClient()
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
],
"min_likelihood": dlp_v2.Likelihood.LIKELY,
}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"info_types": [{"name": "EMAIL_ADDRESS"}],
"primitive_transformation": {
"character_mask_config": {
"masking_character": "*",
"number_to_mask": 0,
"characters_to_ignore": [
{"common_characters_to_ignore": "PUNCTUATION"}
],
}
},
},
{
"info_types": [{"name": "CREDIT_CARD_NUMBER"}],
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": "WRAPPED_KEY_BASE64",
"crypto_key_name": "projects/PROJECT/locations/global/keyRings/dlp/cryptoKeys/tokenization",
}
},
"common_alphabet": "NUMERIC",
}
},
},
{
"info_types": [{"name": "US_SOCIAL_SECURITY_NUMBER"}],
"primitive_transformation": {
"redact_config": {}
},
},
]
}
}
item = {"value": text}
parent = f"projects/{project_id}/locations/global"
response = client.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item,
}
)
return response.item.value
```
### Step 4: Configure Azure Information Protection
Set up sensitivity labels and DLP policies in Microsoft Purview for Azure resources.
```powershell
# Connect to Microsoft Purview compliance
Connect-IPPSSession
# Create sensitivity labels
New-Label -DisplayName "Confidential - PII" \
-Name "Confidential-PII" \
-Tooltip "Contains personally identifiable information" \
-ContentType "File, Email"
New-Label -DisplayName "Highly Confidential - Financial" \
-Name "HighlyConfidential-Financial" \
-Tooltip "Contains financial data subject to PCI DSS" \
-ContentType "File, Email"
# Create auto-labeling policy for Azure Storage
New-AutoSensitivityLabelPolicy -Name "Auto-Label-PII" \
-ExchangeLocation All \
-SharePointLocation All \
-OneDriveLocation All \
-Mode Enable
New-AutoSensitivityLabelRule -Policy "Auto-Label-PII" \
-Name "Detect-SSN" \
-ContentContainsSensitiveInformation @{
Name = "U.S. Social Security Number (SSN)";
MinCount = 1;
MinConfidence = 85
} \
-ApplySensitivityLabel "Confidential-PII"
```
```bash
# Azure: Configure DLP policy for Storage accounts
az security assessment create \
--name "storage-sensitive-data" \
--assessed-resource-type "Microsoft.Storage/storageAccounts"
# Enable Microsoft Defender for Storage with sensitive data threat detection
az security pricing create --name StorageAccounts --tier standard \
--subplan DefenderForStorageV2 \
--extensions '[{"name":"SensitiveDataDiscovery","isEnabled":"True"}]'
```
### Step 5: Integrate DLP into Data Pipelines
Add DLP scanning to ETL and data pipeline workflows to prevent sensitive data leakage.
```python
# pipeline_dlp_gate.py - DLP gate for data pipelines
import boto3
import json
macie_client = boto3.client('macie2')
s3_client = boto3.client('s3')
def scan_pipeline_output(bucket, prefix):
"""Scan pipeline output data for sensitive content before promotion."""
job_response = macie_client.create_classification_job(
jobType='ONE_TIME',
name=f'pipeline-scan-{prefix}',
s3JobDefinition={
'bucketDefinitions': [{
'accountId': boto3.client('sts').get_caller_identity()['Account'],
'buckets': [bucket]
}],
'scoping': {
'includes': {
'and': [{
'simpleScopeTerm': {
'key': 'OBJECT_KEY',
'comparator': 'STARTS_WITH',
'values': [prefix]
}
}]
}
}
},
managedDataIdentifierSelector='ALL'
)
return job_response['jobId']
def check_scan_results(job_id):
"""Check if DLP scan found sensitive data."""
response = macie_client.list_findings(
findingCriteria={
'criterion': {
'classificationDetails.jobId': {'eq': [job_id]},
'severity.description': {'eq': ['High', 'Critical']}
}
}
)
return len(response.get('findingIds', [])) > 0
def gate_decision(bucket, prefix):
"""DLP gate: block pipeline if sensitive data found."""
job_id = scan_pipeline_output(bucket, prefix)
has_sensitive_data = check_scan_results(job_id)
if has_sensitive_data:
return {
'decision': 'BLOCK',
'reason': 'Sensitive data detected in pipeline output',
'action': 'Apply de-identification before promoting to production'
}
return {'decision': 'ALLOW', 'reason': 'No sensitive data detected'}
```
### Step 6: Monitor DLP Findings and Generate Reports
Aggregate DLP findings across cloud providers and generate compliance reports.
```bash
# Macie: Get finding statistics
aws macie2 get-finding-statistics \
--group-by "severity.description" \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}'
# Macie: List findings by sensitivity type
aws macie2 list-findings \
--finding-criteria '{
"criterion": {
"classificationDetails.result.sensitiveData.category": {"eq": ["PERSONAL_INFORMATION"]},
"severity.description": {"eq": ["High"]}
}
}' \
--sort-criteria '{"attributeName": "updatedAt", "orderBy": "DESC"}'
# GCP DLP: List job results
gcloud dlp jobs list --project=PROJECT_ID --filter="state=DONE" \
--format="table(name, createTime, inspectDetails.result.processedBytes, inspectDetails.result.totalEstimatedTransformations)"
# Export Macie findings to S3 for compliance reporting
aws macie2 create-findings-report \
--finding-criteria '{"criterion":{"category":{"eq":["CLASSIFICATION"]}}}' \
--sort-criteria '{"attributeName":"severity.score","orderBy":"DESC"}'
```
## Key Concepts
| Term | Definition |
|------|------------|
| Data Loss Prevention | Security controls and technologies that detect and prevent unauthorized disclosure of sensitive data from cloud environments |
| Amazon Macie | AWS service using machine learning to discover, classify, and protect sensitive data stored in S3 buckets |
| Google Cloud DLP | GCP API for inspecting, classifying, and de-identifying sensitive data across Cloud Storage, BigQuery, and Datastore |
| Data De-identification | Transforming sensitive data using masking, tokenization, encryption, or redaction to remove identifying characteristics while preserving utility |
| Sensitivity Label | Classification tag applied to data (Confidential, Highly Confidential) that triggers DLP policy enforcement and access controls |
| Custom Data Identifier | Organization-specific pattern (regex or keyword) added to DLP services to detect proprietary sensitive data formats |
## Tools & Systems
- **Amazon Macie**: ML-powered sensitive data discovery and classification for S3 with automated finding generation
- **Google Cloud DLP API**: Programmable API for inspecting, classifying, de-identifying, and redacting sensitive data
- **Microsoft Purview**: Data governance platform with sensitivity labeling, auto-classification, and DLP policy enforcement
- **Azure Information Protection**: Data classification and labeling service integrated with Microsoft 365 and Azure storage
- **Nightfall AI**: Third-party cloud DLP tool supporting scanning across SaaS applications and cloud infrastructure
## Common Scenarios
### Scenario: Discovering PII in an Unprotected S3 Data Lake
**Context**: A compliance audit reveals that the analytics team's S3 data lake contains customer PII (names, emails, SSNs) in CSV files without encryption or access controls. The organization must classify all data and implement DLP controls.
**Approach**:
1. Enable Macie and create a one-time classification job against the data lake bucket
2. Review Macie findings to identify which objects contain PII and what types
3. Create custom data identifiers for organization-specific formats (employee IDs, account numbers)
4. Implement a weekly scheduled Macie job for ongoing discovery
5. Build a data pipeline gate that scans new data before promotion to the data lake
6. Apply de-identification transforms (masking SSNs, tokenizing emails) for analytics use cases
7. Configure S3 bucket policies to restrict access to classified data to authorized roles only
**Pitfalls**: Macie charges per GB scanned. Large data lakes can generate significant costs. Use scoping rules to focus on high-risk object types (CSV, JSON, Parquet) and exclude known-safe formats (compressed archives, binary files). De-identification must preserve data utility for analytics while removing re-identification risk.
## Output Format
```
Cloud DLP Compliance Report
==============================
Organization: Acme Corp
Scan Period: 2026-02-01 to 2026-02-23
Environments: AWS (12 buckets), GCP (3 datasets), Azure (5 storage accounts)
DATA DISCOVERY SUMMARY:
Total objects/records scanned: 2,847,000
Objects with sensitive data: 45,200 (1.6%)
Unique sensitivity categories: 8
SENSITIVE DATA FINDINGS:
PII (names, emails, phone): 23,400 objects
Financial (credit cards, bank): 8,700 objects
Health (PHI, medical records): 3,200 objects
Credentials (API keys, tokens): 1,400 objects
Government ID (SSN, passport): 5,800 objects
Custom (employee ID, account): 2,700 objects
FINDINGS BY SEVERITY:
Critical: 1,400 (exposed credentials)
High: 14,200 (unprotected PII/PHI)
Medium: 18,600 (standard PII)
Low: 11,000 (non-sensitive patterns)
PROTECTION STATUS:
Data with encryption at rest: 78%
Data with access controls: 65%
Data with sensitivity labels: 12%
Pipeline data with DLP gates: 30%
REMEDIATION ACTIONS:
Objects quarantined: 1,400
De-identification applied: 8,200
Access controls tightened: 14,200
Sensitivity labels applied: 45,200
```Related Skills
testing-for-sensitive-data-exposure
Identifying sensitive data exposure vulnerabilities including API key leakage, PII in responses, insecure storage, and unprotected data transmission during security assessments.
securing-kubernetes-on-cloud
This skill covers hardening managed Kubernetes clusters on EKS, AKS, and GKE by implementing Pod Security Standards, network policies, workload identity, RBAC scoping, image admission controls, and runtime security monitoring. It addresses cloud-specific security features including IRSA for EKS, Workload Identity for GKE, and Managed Identities for AKS.
performing-sqlite-database-forensics
Perform forensic analysis of SQLite databases to recover deleted records from freelists and WAL files, decode encoded timestamps, and extract evidence from browser history, messaging apps, and mobile device databases.
performing-cloud-storage-forensic-acquisition
Perform forensic acquisition and analysis of cloud storage services including Google Drive, OneDrive, Dropbox, and Box by collecting both API-based remote data and local sync client artifacts from endpoint devices.
performing-cloud-penetration-testing-with-pacu
Performing authorized AWS penetration testing using Pacu, the open-source AWS exploitation framework, to enumerate IAM configurations, discover privilege escalation paths, test credential harvesting, and validate security controls through systematic attack simulation.
performing-cloud-native-forensics-with-falco
Uses Falco YAML rules for runtime threat detection in containers and Kubernetes, monitoring syscalls for shell spawns, file tampering, network anomalies, and privilege escalation. Manages Falco rules via the Falco gRPC API and parses Falco alert output. Use when building container runtime security or investigating k8s cluster compromises.
performing-cloud-log-forensics-with-athena
Uses AWS Athena to query CloudTrail, VPC Flow Logs, S3 access logs, and ALB logs for forensic investigation. Covers CREATE TABLE DDL with partition projection, forensic SQL queries for detecting unauthorized access, data exfiltration, lateral movement, and privilege escalation. Use when investigating AWS security incidents or building cloud-native forensic workflows at scale.
performing-cloud-incident-containment-procedures
Execute cloud-native incident containment across AWS, Azure, and GCP by isolating compromised resources, revoking credentials, preserving forensic evidence, and applying security group restrictions to prevent lateral movement.
performing-cloud-forensics-with-aws-cloudtrail
Perform forensic investigation of AWS environments using CloudTrail logs to reconstruct attacker activity, identify compromised credentials, and analyze API call patterns.
performing-cloud-forensics-investigation
Conduct forensic investigations in cloud environments by collecting and analyzing logs, snapshots, and metadata from AWS, Azure, and GCP services.
performing-cloud-asset-inventory-with-cartography
Perform comprehensive cloud asset inventory and relationship mapping using Cartography to build a Neo4j security graph of infrastructure assets, IAM permissions, and attack paths across AWS, GCP, and Azure.
managing-cloud-identity-with-okta
This skill covers implementing Okta as a centralized identity provider for cloud environments, configuring SSO integration with AWS, Azure, and GCP, deploying phishing- resistant MFA with Okta FastPass, managing lifecycle automation for user provisioning and deprovisioning, and enforcing adaptive access policies based on device posture and risk signals.