Archive Acquisition

Patterns for acquiring content from Internet Archive and archival sources

104 stars

Best use case

Archive Acquisition is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Patterns for acquiring content from Internet Archive and archival sources

Teams using Archive Acquisition should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/archive-acquisition/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/agentic/code/frameworks/media-curator/skills/archive-acquisition/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/archive-acquisition/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Archive Acquisition Compares

Feature / AgentArchive AcquisitionStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Patterns for acquiring content from Internet Archive and archival sources

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Archive Acquisition Skill

Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management.

## Internet Archive Overview

Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:

- **Open access**: Most content freely downloadable
- **Multiple formats**: Same item often available in FLAC, MP3, OGG, etc.
- **Collections**: Items grouped by uploader, topic, or curator
- **Metadata**: Rich JSON metadata for every item
- **API access**: Full programmatic access via REST API

## Discovery Patterns

### Search API

Find items matching specific criteria:

```bash
# Search for items by keyword and mediatype
search_archive() {
  local query="$1"
  local mediatype="$2"  # audio, movies, texts, etc.
  local rows="${3:-100}"  # Results per page

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND mediatype:$mediatype" \
    -d "fl[]=identifier,title,creator,date,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50
```

### Response Format

```json
{
  "identifier": "gd1977-05-08.sbd.miller.97065.flac16",
  "title": "Grateful Dead Live at Barton Hall on 1977-05-08",
  "creator": "Grateful Dead",
  "date": "1977-05-08",
  "format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}
```

### Collection Browsing

List all items in a collection:

```bash
# Get items from a specific collection
get_collection_items() {
  local collection="$1"
  local rows="${2:-1000}"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection" \
    -d "fl[]=identifier,title,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier'
}

# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500
```

### Advanced Query Patterns

```bash
# Find items by date range
search_by_date() {
  local creator="$1"
  local start_date="$2"  # YYYY-MM-DD
  local end_date="$3"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
    -d "fl[]=identifier,title,date" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Find items with specific format available
search_by_format() {
  local query="$1"
  local format="$2"  # FLAC, MP3, etc.

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND format:$format" \
    -d "fl[]=identifier,title,format" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"
```

## Item Metadata Retrieval

### Get Item Details

```bash
# Retrieve full metadata for an item
get_item_metadata() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq '.'
}

# Extract specific metadata fields
get_item_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}

# Example output:
# gd77-05-08d1t01.flac    Flac    45123456
# gd77-05-08d1t01.mp3     VBR MP3 12345678
```

### Filter by Format

```bash
# Get only FLAC files from an item
get_flac_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name'
}

# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
  local identifier="$1"

  # Try FLAC first
  local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name')

  if [[ -n "$flac_files" ]]; then
    echo "$flac_files"
    return 0
  fi

  # Fallback to highest bitrate MP3
  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}
```

## Download Patterns

### Single Item Download

```bash
# Download all files from a single item
download_item() {
  local identifier="$1"
  local output_dir="$2"
  local format_filter="$3"  # Optional: FLAC, MP3, etc.

  mkdir -p "$output_dir"

  # Get file list
  local files
  if [[ -n "$format_filter" ]]; then
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r ".files[] | select(.format == \"$format_filter\") | .name")
  else
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r '.files[] | select(.format != "Metadata") | .name')
  fi

  # Download each file
  while IFS= read -r file; do
    echo "Downloading: $file"
    wget -q --show-progress \
      "https://archive.org/download/$identifier/$file" \
      -P "$output_dir"
  done <<< "$files"
}

# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"
```

### Recursive Collection Download

```bash
# Download entire collection with format filtering
download_collection() {
  local collection="$1"
  local output_base="$2"
  local format="$3"  # FLAC, MP3, etc.
  local max_items="${4:-100}"

  # Get collection items
  local items=$(get_collection_items "$collection" "$max_items")

  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "[$count/$max_items] Downloading: $identifier"

    # Create item directory
    local item_dir="$output_base/$identifier"
    mkdir -p "$item_dir"

    # Download with format filter
    download_item "$identifier" "$item_dir" "$format"

    # Download metadata
    curl -s "https://archive.org/metadata/$identifier" \
      > "$item_dir/.curator/metadata.json"

    # Rate limiting (be nice to archive.org)
    sleep 2
  done <<< "$items"
}

# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50
```

### Parallel Collection Download

```bash
# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local max_concurrent="${4:-3}"

  # Get collection items
  local items=$(get_collection_items "$collection" 1000)

  # Create job queue
  local queue_file="/tmp/curator-archive-queue-$$.txt"
  echo "$items" > "$queue_file"

  # Process queue with concurrency limit
  cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
    identifier={}
    item_dir=\"$output_base/\$identifier\"
    mkdir -p \"\$item_dir\"

    echo \"Downloading: \$identifier\"

    # Download best quality audio
    files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
      | jq -r '.files[] | select(.format == \"$format\") | .name')

    while IFS= read -r file; do
      wget -q --show-progress \
        \"https://archive.org/download/\$identifier/\$file\" \
        -P \"\$item_dir\"
    done <<< \"\$files\"

    # Save metadata
    curl -s \"https://archive.org/metadata/\$identifier\" \
      > \"\$item_dir/.curator/metadata.json\"

    sleep 2  # Rate limiting
  "

  rm "$queue_file"
}
```

### wget Recursive Download

```bash
# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
  local identifier="$1"
  local output_dir="$2"
  local format_pattern="$3"  # e.g., "*.flac" or "*.mp3"

  mkdir -p "$output_dir"

  wget --recursive \
       --no-parent \
       --no-directories \
       --accept "$format_pattern" \
       --directory-prefix="$output_dir" \
       --wait=2 \
       --random-wait \
       "https://archive.org/download/$identifier/"
}

# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
  "/mnt/archive/gd-1977-05-08" \
  "*.flac"
```

## Quality Selection Strategy

### Format Preference Hierarchy

For audio archival, prefer formats in this order:

1. **FLAC** - Lossless, widely supported
2. **320Kbps MP3** - High quality lossy, universal compatibility
3. **VBR MP3** - Variable bitrate, good quality/size ratio
4. **Ogg Vorbis** - Good quality, open format
5. **128Kbps MP3** - Acceptable for spoken word, not music

### Automatic Quality Selection

```bash
# Select best available format for an item
select_best_format() {
  local identifier="$1"

  # Get available formats
  local formats=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | .format' | sort -u)

  # Check in preference order
  if echo "$formats" | grep -q "Flac"; then
    echo "Flac"
  elif echo "$formats" | grep -q "320Kbps MP3"; then
    echo "320Kbps MP3"
  elif echo "$formats" | grep -q "VBR MP3"; then
    echo "VBR MP3"
  elif echo "$formats" | grep -q "Ogg Vorbis"; then
    echo "Ogg Vorbis"
  else
    echo "$(echo "$formats" | grep -i mp3 | head -1)"
  fi
}

# Download using best available format
download_best_quality() {
  local identifier="$1"
  local output_dir="$2"

  local best_format=$(select_best_format "$identifier")

  echo "Selected format: $best_format"
  download_item "$identifier" "$output_dir" "$best_format"
}
```

### Size vs Quality Trade-offs

```bash
# Estimate download size before acquiring
estimate_download_size() {
  local identifier="$1"
  local format="$2"

  local total_size=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add")

  # Convert to human-readable
  local size_mb=$((total_size / 1048576))
  local size_gb=$((size_mb / 1024))

  if [[ $size_gb -gt 0 ]]; then
    echo "${size_gb}GB"
  else
    echo "${size_mb}MB"
  fi
}

# Check if size is acceptable before downloading
download_if_acceptable() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"
  local max_size_gb="${4:-5}"  # Default 5GB limit

  local estimated_size=$(estimate_download_size "$identifier" "$format")
  local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g')
  local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g')

  if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then
    echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)"
    return 1
  fi

  echo "Downloading: $identifier ($estimated_size)"
  download_item "$identifier" "$output_dir" "$format"
}
```

## Metadata Extraction and Storage

### Extract Archive Metadata

```bash
# Convert Archive.org metadata to curator format
extract_archive_metadata() {
  local identifier="$1"
  local output_file="$2"

  curl -s "https://archive.org/metadata/$identifier" | jq '{
    source: "archive.org",
    identifier: .metadata.identifier,
    title: .metadata.title,
    creator: .metadata.creator,
    date: .metadata.date,
    venue: .metadata.venue,
    coverage: .metadata.coverage,
    description: .metadata.description,
    collection: .metadata.collection,
    downloads: .item_size.downloads,
    avg_rating: .reviews.info.avg_rating,
    num_reviews: .reviews.info.num_reviews,
    files: [
      .files[] | {
        name: .name,
        format: .format,
        size: .size,
        md5: .md5,
        sha1: .sha1
      }
    ]
  }' > "$output_file"
}

# Example usage
extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \
  ".curator/metadata/gd-1977-05-08.json"
```

### Merge with Curator Metadata

```bash
# Merge Archive.org metadata with existing curator metadata
merge_metadata() {
  local curator_metadata="$1"
  local archive_metadata="$2"
  local output_file="$3"

  jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file"
}
```

## Verification and Quality Assurance

### Checksum Verification

Archive.org provides MD5 and SHA1 checksums for all files:

```bash
# Verify downloaded files against Archive.org checksums
verify_archive_downloads() {
  local identifier="$1"
  local download_dir="$2"

  # Get checksums from metadata
  local checksums=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | "\(.md5)  \(.name)"')

  # Verify each file
  local failed=0
  while IFS= read -r line; do
    local expected_md5="${line%%  *}"
    local filename="${line#*  }"
    local filepath="$download_dir/$filename"

    if [[ ! -f "$filepath" ]]; then
      echo "MISSING: $filename"
      failed=$((failed + 1))
      continue
    fi

    local actual_md5=$(md5sum "$filepath" | awk '{print $1}')

    if [[ "$actual_md5" != "$expected_md5" ]]; then
      echo "CHECKSUM FAILED: $filename"
      echo "  Expected: $expected_md5"
      echo "  Actual:   $actual_md5"
      failed=$((failed + 1))
    else
      echo "VERIFIED: $filename"
    fi
  done <<< "$checksums"

  if [[ $failed -gt 0 ]]; then
    echo "Verification failed for $failed files"
    return 1
  fi

  echo "All files verified successfully"
  return 0
}
```

### File Integrity Check

```bash
# Verify audio file integrity after download
verify_audio_integrity() {
  local file="$1"

  # Check file is not zero-byte
  if [[ ! -s "$file" ]]; then
    echo "FAILED: Zero-byte file"
    return 1
  fi

  # Verify with ffmpeg
  if command -v ffmpeg >/dev/null; then
    if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then
      echo "FAILED: Corrupted audio file"
      return 1
    fi
  fi

  # FLAC specific: verify with flac tool
  if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then
    if ! flac -t "$file" 2>&1 | grep -q "ok"; then
      echo "FAILED: FLAC verification failed"
      return 1
    fi
  fi

  echo "VERIFIED: $file"
  return 0
}

# Verify all files in directory
verify_directory_integrity() {
  local dir="$1"

  local total=0
  local verified=0
  local failed=0

  find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do
    total=$((total + 1))

    if verify_audio_integrity "$file"; then
      verified=$((verified + 1))
    else
      failed=$((failed + 1))
    fi
  done

  echo "Integrity check complete: $verified/$total verified, $failed failed"
}
```

## Rate Limiting and Politeness

Archive.org requests responsible use of their bandwidth:

```bash
# Implement polite download behavior
polite_download() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"

  # Add delays between requests
  local wait_time=2  # seconds

  # Get file list
  local files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r ".files[] | select(.format == \"$format\") | .name")

  while IFS= read -r file; do
    echo "Downloading: $file"

    wget --quiet \
         --show-progress \
         --wait="$wait_time" \
         --random-wait \
         --user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \
         "https://archive.org/download/$identifier/$file" \
         -P "$output_dir"

  done <<< "$files"
}
```

### Bulk Download Best Practices

1. **Rate limit**: Wait 2-5 seconds between requests
2. **Random wait**: Add jitter to avoid patterns
3. **User agent**: Identify your tool
4. **Off-peak hours**: Download during US night hours
5. **Resume support**: Use `wget -c` to resume interrupted downloads
6. **Error handling**: Retry failed downloads after delay

## Collection Management

### Build Local Collection Index

```bash
# Create searchable index of downloaded items
build_collection_index() {
  local collection_dir="$1"
  local index_file="$2"

  echo '{"items": []}' > "$index_file"

  find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do
    local item_dir=$(dirname "$metadata_file")

    # Extract key metadata fields
    local item_data=$(jq '{
      identifier: .metadata.identifier,
      title: .metadata.title,
      creator: .metadata.creator,
      date: .metadata.date,
      local_path: "'"$item_dir"'",
      file_count: (.files | length),
      total_size: ([.files[] | .size | tonumber] | add)
    }' "$metadata_file")

    # Append to index
    jq ".items += [$item_data]" "$index_file" > "$index_file.tmp"
    mv "$index_file.tmp" "$index_file"
  done

  echo "Index built: $index_file"
}

# Search local collection
search_local_collection() {
  local index_file="$1"
  local query="$2"

  jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \
    "$index_file"
}
```

### Deduplicate Collection

```bash
# Find and remove duplicate items in collection
deduplicate_collection() {
  local collection_dir="$1"

  # Build checksum database
  local checksum_file="/tmp/curator-checksums-$$.txt"

  find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do
    local checksum=$(md5sum "$file" | awk '{print $1}')
    echo "$checksum  $file" >> "$checksum_file"
  done

  # Find duplicates
  sort "$checksum_file" | uniq -d -w 32 | while read -r line; do
    local checksum="${line%%  *}"

    echo "Duplicate found (checksum: $checksum):"

    grep "^$checksum" "$checksum_file" | while read -r dup_line; do
      local dup_file="${dup_line#*  }"
      echo "  - $dup_file"
    done

    # Offer to delete all but first
    local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-))
    local keep="${files[0]}"

    echo "Keep: $keep"

    for ((i=1; i<${#files[@]}; i++)); do
      echo "Delete: ${files[$i]}"
      rm "${files[$i]}"
    done
  done

  rm "$checksum_file"
}
```

## Advanced Patterns

### Incremental Collection Sync

```bash
# Download only new items added to collection since last sync
sync_collection_incremental() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local sync_state_file="$output_base/.curator/sync-state.json"

  # Get last sync timestamp
  local last_sync
  if [[ -f "$sync_state_file" ]]; then
    last_sync=$(jq -r '.last_sync' "$sync_state_file")
  else
    last_sync="1970-01-01"
  fi

  # Search for items added/updated since last sync
  local new_items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection AND publicdate:[$last_sync TO null]" \
    -d "fl[]=identifier" \
    -d "rows=1000" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier')

  # Download new items
  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "Downloading new item [$count]: $identifier"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done <<< "$new_items"

  # Update sync state
  echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file"

  echo "Sync complete: $count new items added"
}
```

### Filtered Batch Download

```bash
# Download items matching complex filter criteria
download_filtered_batch() {
  local query="$1"
  local output_base="$2"
  local format="$3"
  local min_rating="${4:-4.0}"
  local min_year="${5:-1960}"

  # Search with filters
  local items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \
    -d "fl[]=identifier,avg_rating,year" \
    -d "rows=500" \
    -d "output=json")

  # Download each item
  echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do
    local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating")
    local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year")

    echo "Downloading: $identifier (rating: $rating, year: $year)"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done
}
```

## Integration with Media Curator Framework

### Source Plan Generation

```bash
# Generate source plan for /acquire command
generate_archive_source_plan() {
  local collection="$1"
  local output_file="$2"
  local format="${3:-Flac}"

  # Get collection items
  local items=$(get_collection_items "$collection" 100)

  # Build YAML plan
  cat > "$output_file" <<EOF
source_plan:
  id: archive-$(date +%Y%m%d-%H%M%S)
  source: archive.org
  collection: $collection
  format: $format
  created_at: $(date -Iseconds)

sources:
EOF

  while IFS= read -r identifier; do
    # Get item metadata
    local metadata=$(curl -s "https://archive.org/metadata/$identifier")
    local title=$(echo "$metadata" | jq -r '.metadata.title')
    local creator=$(echo "$metadata" | jq -r '.metadata.creator')
    local date=$(echo "$metadata" | jq -r '.metadata.date')

    cat >> "$output_file" <<EOF
  - identifier: $identifier
    title: "$title"
    creator: "$creator"
    date: "$date"
    url: "https://archive.org/details/$identifier"
    format: $format
EOF

  done <<< "$items"

  echo "Source plan generated: $output_file"
}

# Usage with /acquire
# generate_archive_source_plan "GratefulDead" ".curator/sources/gd-plan.yaml" "Flac"
# /acquire --plan .curator/sources/gd-plan.yaml
```

## Troubleshooting

### Common Issues and Solutions

| Issue | Cause | Solution |
|-------|-------|----------|
| **Download fails immediately** | Invalid identifier | Verify identifier on archive.org website |
| **Slow download speeds** | Network throttling | Use `--limit-rate` with wget |
| **Incomplete files** | Connection timeout | Use `wget -c` to resume |
| **Missing files** | Format not available | Check available formats with metadata API |
| **Checksum mismatch** | Corrupted download | Delete and re-download |
| **Rate limiting** | Too many requests | Increase wait time between downloads |

### Debug Mode

```bash
# Enable verbose output for troubleshooting
export CURATOR_DEBUG=1

download_item_debug() {
  local identifier="$1"
  local output_dir="$2"

  if [[ "$CURATOR_DEBUG" == "1" ]]; then
    set -x  # Enable bash tracing
  fi

  # Get metadata with verbose curl
  curl -v "https://archive.org/metadata/$identifier" 2>&1 | tee "$output_dir/.curator/metadata-debug.log"

  # Download with verbose wget
  wget --verbose \
       --debug \
       "https://archive.org/download/$identifier/" \
       2>&1 | tee "$output_dir/.curator/download-debug.log"

  set +x
}
```

## Performance Optimization

### Parallel Download Tuning

```bash
# Optimal concurrent downloads based on connection speed
determine_optimal_parallelism() {
  local speed_mbps="$1"

  if [[ $speed_mbps -lt 10 ]]; then
    echo 2  # Slow connection
  elif [[ $speed_mbps -lt 50 ]]; then
    echo 3  # Medium connection
  elif [[ $speed_mbps -lt 100 ]]; then
    echo 5  # Fast connection
  else
    echo 8  # Very fast connection
  fi
}

# Test connection speed
test_connection_speed() {
  # Download test file and measure speed
  local test_file="https://archive.org/download/test_item/test.mp3"
  local start_time=$(date +%s)

  wget -q -O /tmp/speed-test.tmp "$test_file"

  local end_time=$(date +%s)
  local duration=$((end_time - start_time))
  local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp)
  local speed_mbps=$(( (size * 8) / (duration * 1000000) ))

  rm /tmp/speed-test.tmp

  echo "$speed_mbps"
}
```

### Caching Strategy

```bash
# Cache metadata to reduce API calls
cache_metadata() {
  local identifier="$1"
  local cache_dir=".curator/cache/metadata"
  local cache_file="$cache_dir/$identifier.json"

  mkdir -p "$cache_dir"

  # Check cache age
  if [[ -f "$cache_file" ]]; then
    local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") ))

    # Cache valid for 7 days
    if [[ $cache_age -lt 604800 ]]; then
      cat "$cache_file"
      return 0
    fi
  fi

  # Fetch and cache
  curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file"
}
```

## Summary

Archive Acquisition skill provides:

- **Discovery**: Search API patterns for finding content
- **Quality Selection**: Automatic best-format selection (prefer FLAC, accept MP3 320K)
- **Bulk Download**: Recursive wget and parallel download patterns
- **Verification**: Checksum validation and integrity checks
- **Collection Management**: Indexing, deduplication, incremental sync
- **Integration**: Source plan generation for `/acquire` command
- **Performance**: Parallel tuning, caching, rate limiting

Use this skill whenever working with Internet Archive content in the Media Curator framework.

## References

- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/human-authorization.md — Seek explicit authorization before bulk downloads and irreversible archive operations
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Evaluate archive item quality and metadata before committing to download
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/acquire/SKILL.md — General acquisition skill that delegates to archive-acquisition for archive.org sources
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md — Verify checksums after archive downloads
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/quality-filtering/SKILL.md — Quality scoring used to select best format from archive items

Related Skills

YouTube Acquisition

104
from jmagly/aiwg

yt-dlp patterns for acquiring content from YouTube and video platforms

verify-archive

104
from jmagly/aiwg

Verify archive integrity with self-verifying SHA-256 checksums, generate VERIFY.md, and optionally create W3C PROV provenance

Codex

research-archive

104
from jmagly/aiwg

Package research artifacts for long-term archival

Codex

archive-answer

104
from jmagly/aiwg

Capture a query answer and file it as a persistent artifact in .aiwg/working/answers/ with promotion guidance to a permanent destination.

Codex

aiwg-orchestrate

104
from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104
from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

104
from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

104
from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

104
from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

104
from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.

pr-reviewer

104
from jmagly/aiwg

Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.

Quality Filtering

104
from jmagly/aiwg

Accept/reject logic and quality scoring heuristics for media content