Architecture
Cornucopia system architecture and design
Cornucopia Architecture
Overview
Cornucopia is a high-performance PyPI mirror written in Go. It proxies packages from PyPI with automatic caching, checksum validation, and support for publishing internal packages.
Layered Architecture
The codebase follows a clean, layered architecture:
┌────────────────────┐
│ cmd/cornucopia │ Entrypoint: wiring, main()
└────────┬───────────┘
│
┌────────▼──────────────┐
│ server/server.go │ HTTP routing, middleware, graceful shutdown
└────────┬──────────────┘
│
┌────────▼──────────────────────────────────┐
│ handler/ │ HTTP handlers
│ - simple.go (PEP 503/691 indexes) │
│ - packages.go (file serving) │
│ - upload.go (twine compatibility) │
│ - health.go (liveness) │
└────────┬──────────────────────────────────┘
│
┌────────▼──────────────────────────────────┐
│ service/ │ Business logic
│ - proxy.go (PyPI caching + singleflight) │
│ - index.go (PEP 503/691 formatting) │
│ - upload.go (internal package publishing)│
│ - checksum.go (validation) │
└────────┬──────────────────────────────────┘
│
┌────────▼──────────────────────────────────┐
│ repository/ │ Data access
│ - interfaces.go (MetadataRepository, │
│ FileRepository) │
│ - disk/ (filesystem storage) │
│ - memory/ (TTL cache) │
└────────┬──────────────────────────────────┘
│
┌────────▼──────────────────────────────────┐
│ domain/ │ Core types
│ - package.go (Package, PackageFile) │
│ - errors.go (sentinel errors) │
│ - normalize.go (PEP 503 normalization) │
└────────────────────────────────────────────┘
Key Design Patterns
1. singleflight Deduplication
When 50 concurrent pip install commands fetch metadata for the same package:
Request 1 ──┐
Request 2 ──┤─→ singleflight.Do("requests") ─→ [single upstream call]
Request 3 ──┘ ↓
All 3 requests get the same result
Implementation: service/proxy.go uses golang.org/x/sync/singleflight.Group
Benefit: Reduces upstream PyPI load by orders of magnitude during mirror bootstrap.
2. Atomic Disk Writes
Prevents readers from seeing partial files:
1. Write to /cache/packages/requests-2.31.0.tar.gz.tmp
2. Validate checksum
3. os.Rename(tmp, final) ← atomic on POSIX
Implementation: repository/disk/cache.go
Benefit: No lock contention; race-free caching in high concurrency.
3. Concurrent Checksum Validation
Computes SHA256 and MD5 in a single pass:
┌─→ crypto/sha256.Hash
io.Reader─┤
└─→ crypto/md5.Hash
(via io.TeeReader + io.MultiWriter)
Implementation: service/checksum.go ValidatingReader
Benefit: No re-reading for checksum validation; constant memory.
4. Write-Through Cache
In-memory cache layered over disk storage:
GetPackage(ctx, "requests")
↓
Memory (RWMutex + map[name]*entry)
├─ Cache hit? → return
└─ Cache miss?
↓
Disk (JSON files)
├─ Found? → update memory, return
└─ Not found?
↓
Upstream PyPI (singleflight)
├─ Success? → update both, return
└─ Failure?
├─ Stale in memory? → serve stale
└─ Otherwise → error
Benefit: Sub-millisecond metadata lookups; graceful degradation.
5. Interface-Based Repositories
type MetadataRepository interface {
GetPackage(ctx context.Context, name NormalizedName) (*Package, error)
PutPackage(ctx context.Context, pkg *Package) error
ListPackages(ctx context.Context) ([]NormalizedName, error)
DeletePackage(ctx context.Context, name NormalizedName) error
}
Benefit: Easy to test (mock), easy to swap implementations (disk → database).
Request Flow
pip install requests
1. pip requests /simple/requests/
↓
2. SimpleHandler.PackageIndex(w, r)
├─ Normalize name: "requests" → "requests"
└─ Call IndexService.BuildPackageIndex(ctx, "requests")
↓
3. ProxyService.GetPackageMetadata(ctx, "requests")
├─ Check memory cache (fast)
├─ If miss, singleflight.Do("requests", fetchFromUpstream)
│ ├─ PyPI Client fetches https://pypi.org/pypi/requests/json
│ ├─ Parse response, build Package type
│ └─ Store in disk + memory repos
└─ Return Package
↓
4. SimpleHandler formats as HTML (PEP 503) or JSON (PEP 691)
├─ Include each file with #sha256={hash} fragment
└─ Return with 200 OK
↓
5. pip downloads /packages/requests-2.31.0.tar.gz
↓
6. PackageHandler.ServeFile(w, r)
├─ Check disk cache
├─ If miss:
│ ├─ ProxyService.FetchFile(ctx, pkg, "requests-2.31.0.tar.gz")
│ ├─ Create ValidatingReader
│ ├─ Stream from upstream to cache (computing hashes)
│ ├─ Validate checksums
│ └─ Serve from cache
└─ Otherwise: stream from cache
↓
7. pip verifies hash, extracts, installs
twine upload dist/*
1. twine POSTs multipart to /legacy/
├─ :action=file_upload
├─ name=mypackage
├─ version=1.0.0
├─ sha256_digest={computed by twine}
├─ content={file bytes}
└─ Authorization: Basic __token__:{token}
↓
2. UploadHandler.Upload(w, r)
├─ Authenticate (HTTP Basic Auth)
├─ Parse multipart form
├─ Call UploadService.Upload(ctx, req)
↓
3. UploadService.Upload(ctx, req)
├─ Create ValidatingReader
├─ Store file to disk
├─ Validate checksum
├─ Fetch or create Package metadata
├─ Add PackageFile with IsLocal=true
├─ Store metadata to disk + memory
└─ Return success
↓
4. twine prints "uploaded successfully"
Concurrency Model
Goroutines
- Main thread: signal handling, graceful shutdown
- HTTP server: one goroutine per request (stdlib default)
- No explicit worker pools (unbounded concurrency)
Synchronization
| Component | Sync Primitive | Reason |
|---|---|---|
| In-memory metadata cache | sync.RWMutex | Allow concurrent reads, exclusive writes |
| Disk file writes | os.Rename() + atomic ops | Atomic on POSIX, no locking needed |
| Upstream dedup | singleflight.Group | Share single fetch across N concurrent requests |
| Context propagation | context.Context | Timeout, cancellation propagated through stack |
Lock Contention
Low by design:
- Disk writes are atomic (no locks)
- Memory cache uses RWMutex (readers don’t block each other)
- singleflight avoids thundering herd on upstream
Testing Strategy
Unit Tests
- Domain logic (normalization, errors)
- Checksum validation
- Config loading
- HTTP handler content negotiation
Integration Tests
- Full proxy flow: metadata fetch + cache + validation + serving
- pip install via index
- twine upload flow
- Concurrent request deduplication
- Cache TTL expiration
- Stale-while-revalidate fallback
Location: internal/integration_test.go (run with -tags=integration)
Performance Characteristics
| Metric | Value | Notes |
|---|---|---|
| Metadata latency (cache hit) | <1ms | In-memory map lookup + JSON unmarshal |
| Metadata latency (disk miss) | 10-50ms | Disk I/O + singleflight |
| Metadata latency (upstream) | 200-500ms | Network round-trip to pypi.org |
| File throughput | ~100 MB/s | Limited by upstream + disk I/O |
| Memory (baseline) | ~50 MB | Go runtime + dependencies |
| Memory (per 1000 packages) | ~1 MB | Metadata cache size |
| Concurrent capacity | Unbounded | Goroutines, no artificial limits |
Security Considerations
What We Validate
- ✅ SHA256 hash of every downloaded file (matches PyPI)
- ✅ MD5 hash (legacy, for completeness)
- ✅ HTTP Basic Auth for uploads (token-based)
What We Don’t Validate
- ✗ GPG signatures (delegated to pip)
- ✗ Package dependencies (delegated to pip)
- ✗ Python version specifiers (served as-is from PyPI)
Implications
This is a transport-layer mirror, not a security gateway. It ensures files match PyPI’s checksums, but doesn’t validate package contents or build integrity. Trust in PyPI security is inherited.
Deployment Considerations
Scaling
Vertical: Increase Go runtime GOMAXPROCS for CPU-bound operations (hash validation)
Horizontal: Stateless service; front with reverse proxy; shared cache volume required
Storage
Cache growth: ~10 GB per 1000 popular packages. Use local SSD or network block storage.
Monitoring
/healthendpoint for liveness probes- Structured logging via log/slog (Go 1.21+)
- Request latency via middleware
HA Setup
┌─→ cornucopia:1 ─┐
Reverse Proxy ─┤─→ cornucopia:2 ├→ Shared Cache Volume
└─→ cornucopia:3 ─┘
All instances share a cache volume (NFS, EBS, etc.) to avoid re-downloads.