Architecture

Cornucopia system architecture and design

Cornucopia Architecture

Overview

Cornucopia is a high-performance PyPI mirror written in Go. It proxies packages from PyPI with automatic caching, checksum validation, and support for publishing internal packages.

Layered Architecture

The codebase follows a clean, layered architecture:

┌────────────────────┐
│   cmd/cornucopia   │  Entrypoint: wiring, main()
└────────┬───────────┘

┌────────▼──────────────┐
│  server/server.go     │  HTTP routing, middleware, graceful shutdown
└────────┬──────────────┘

┌────────▼──────────────────────────────────┐
│  handler/                                  │  HTTP handlers
│  - simple.go (PEP 503/691 indexes)        │
│  - packages.go (file serving)             │
│  - upload.go (twine compatibility)        │
│  - health.go (liveness)                   │
└────────┬──────────────────────────────────┘

┌────────▼──────────────────────────────────┐
│  service/                                  │  Business logic
│  - proxy.go (PyPI caching + singleflight) │
│  - index.go (PEP 503/691 formatting)      │
│  - upload.go (internal package publishing)│
│  - checksum.go (validation)               │
└────────┬──────────────────────────────────┘

┌────────▼──────────────────────────────────┐
│  repository/                               │  Data access
│  - interfaces.go (MetadataRepository,     │
│                   FileRepository)         │
│  - disk/ (filesystem storage)             │
│  - memory/ (TTL cache)                    │
└────────┬──────────────────────────────────┘

┌────────▼──────────────────────────────────┐
│  domain/                                   │  Core types
│  - package.go (Package, PackageFile)      │
│  - errors.go (sentinel errors)            │
│  - normalize.go (PEP 503 normalization)   │
└────────────────────────────────────────────┘

Key Design Patterns

1. singleflight Deduplication

When 50 concurrent pip install commands fetch metadata for the same package:

Request 1 ──┐
Request 2 ──┤─→ singleflight.Do("requests") ─→ [single upstream call]
Request 3 ──┘    ↓
             All 3 requests get the same result

Implementation: service/proxy.go uses golang.org/x/sync/singleflight.Group

Benefit: Reduces upstream PyPI load by orders of magnitude during mirror bootstrap.

2. Atomic Disk Writes

Prevents readers from seeing partial files:

1. Write to /cache/packages/requests-2.31.0.tar.gz.tmp
2. Validate checksum
3. os.Rename(tmp, final)  ← atomic on POSIX

Implementation: repository/disk/cache.go

Benefit: No lock contention; race-free caching in high concurrency.

3. Concurrent Checksum Validation

Computes SHA256 and MD5 in a single pass:

         ┌─→ crypto/sha256.Hash
io.Reader─┤
         └─→ crypto/md5.Hash
          (via io.TeeReader + io.MultiWriter)

Implementation: service/checksum.go ValidatingReader

Benefit: No re-reading for checksum validation; constant memory.

4. Write-Through Cache

In-memory cache layered over disk storage:

GetPackage(ctx, "requests")

  Memory (RWMutex + map[name]*entry)
    ├─ Cache hit? → return
    └─ Cache miss?

      Disk (JSON files)
        ├─ Found? → update memory, return
        └─ Not found?

          Upstream PyPI (singleflight)
            ├─ Success? → update both, return
            └─ Failure?
                ├─ Stale in memory? → serve stale
                └─ Otherwise → error

Benefit: Sub-millisecond metadata lookups; graceful degradation.

5. Interface-Based Repositories

type MetadataRepository interface {
    GetPackage(ctx context.Context, name NormalizedName) (*Package, error)
    PutPackage(ctx context.Context, pkg *Package) error
    ListPackages(ctx context.Context) ([]NormalizedName, error)
    DeletePackage(ctx context.Context, name NormalizedName) error
}

Benefit: Easy to test (mock), easy to swap implementations (disk → database).

Request Flow

pip install requests

1. pip requests /simple/requests/

2. SimpleHandler.PackageIndex(w, r)
   ├─ Normalize name: "requests" → "requests"
   └─ Call IndexService.BuildPackageIndex(ctx, "requests")

3. ProxyService.GetPackageMetadata(ctx, "requests")
   ├─ Check memory cache (fast)
   ├─ If miss, singleflight.Do("requests", fetchFromUpstream)
   │  ├─ PyPI Client fetches https://pypi.org/pypi/requests/json
   │  ├─ Parse response, build Package type
   │  └─ Store in disk + memory repos
   └─ Return Package

4. SimpleHandler formats as HTML (PEP 503) or JSON (PEP 691)
   ├─ Include each file with #sha256={hash} fragment
   └─ Return with 200 OK

5. pip downloads /packages/requests-2.31.0.tar.gz

6. PackageHandler.ServeFile(w, r)
   ├─ Check disk cache
   ├─ If miss:
   │  ├─ ProxyService.FetchFile(ctx, pkg, "requests-2.31.0.tar.gz")
   │  ├─ Create ValidatingReader
   │  ├─ Stream from upstream to cache (computing hashes)
   │  ├─ Validate checksums
   │  └─ Serve from cache
   └─ Otherwise: stream from cache

7. pip verifies hash, extracts, installs

twine upload dist/*

1. twine POSTs multipart to /legacy/
   ├─ :action=file_upload
   ├─ name=mypackage
   ├─ version=1.0.0
   ├─ sha256_digest={computed by twine}
   ├─ content={file bytes}
   └─ Authorization: Basic __token__:{token}

2. UploadHandler.Upload(w, r)
   ├─ Authenticate (HTTP Basic Auth)
   ├─ Parse multipart form
   ├─ Call UploadService.Upload(ctx, req)

3. UploadService.Upload(ctx, req)
   ├─ Create ValidatingReader
   ├─ Store file to disk
   ├─ Validate checksum
   ├─ Fetch or create Package metadata
   ├─ Add PackageFile with IsLocal=true
   ├─ Store metadata to disk + memory
   └─ Return success

4. twine prints "uploaded successfully"

Concurrency Model

Goroutines

  • Main thread: signal handling, graceful shutdown
  • HTTP server: one goroutine per request (stdlib default)
  • No explicit worker pools (unbounded concurrency)

Synchronization

ComponentSync PrimitiveReason
In-memory metadata cachesync.RWMutexAllow concurrent reads, exclusive writes
Disk file writesos.Rename() + atomic opsAtomic on POSIX, no locking needed
Upstream dedupsingleflight.GroupShare single fetch across N concurrent requests
Context propagationcontext.ContextTimeout, cancellation propagated through stack

Lock Contention

Low by design:

  • Disk writes are atomic (no locks)
  • Memory cache uses RWMutex (readers don’t block each other)
  • singleflight avoids thundering herd on upstream

Testing Strategy

Unit Tests

  • Domain logic (normalization, errors)
  • Checksum validation
  • Config loading
  • HTTP handler content negotiation

Integration Tests

  • Full proxy flow: metadata fetch + cache + validation + serving
  • pip install via index
  • twine upload flow
  • Concurrent request deduplication
  • Cache TTL expiration
  • Stale-while-revalidate fallback

Location: internal/integration_test.go (run with -tags=integration)

Performance Characteristics

MetricValueNotes
Metadata latency (cache hit)<1msIn-memory map lookup + JSON unmarshal
Metadata latency (disk miss)10-50msDisk I/O + singleflight
Metadata latency (upstream)200-500msNetwork round-trip to pypi.org
File throughput~100 MB/sLimited by upstream + disk I/O
Memory (baseline)~50 MBGo runtime + dependencies
Memory (per 1000 packages)~1 MBMetadata cache size
Concurrent capacityUnboundedGoroutines, no artificial limits

Security Considerations

What We Validate

  • ✅ SHA256 hash of every downloaded file (matches PyPI)
  • ✅ MD5 hash (legacy, for completeness)
  • ✅ HTTP Basic Auth for uploads (token-based)

What We Don’t Validate

  • ✗ GPG signatures (delegated to pip)
  • ✗ Package dependencies (delegated to pip)
  • ✗ Python version specifiers (served as-is from PyPI)

Implications

This is a transport-layer mirror, not a security gateway. It ensures files match PyPI’s checksums, but doesn’t validate package contents or build integrity. Trust in PyPI security is inherited.

Deployment Considerations

Scaling

Vertical: Increase Go runtime GOMAXPROCS for CPU-bound operations (hash validation)

Horizontal: Stateless service; front with reverse proxy; shared cache volume required

Storage

Cache growth: ~10 GB per 1000 popular packages. Use local SSD or network block storage.

Monitoring

  • /health endpoint for liveness probes
  • Structured logging via log/slog (Go 1.21+)
  • Request latency via middleware

HA Setup

         ┌─→ cornucopia:1 ─┐
Reverse Proxy ─┤─→ cornucopia:2 ├→ Shared Cache Volume
         └─→ cornucopia:3 ─┘

All instances share a cache volume (NFS, EBS, etc.) to avoid re-downloads.