Skip to main content

Caching

How to use content-addressed caching for improved performance.

When to Use Caching

Caching improves performance in these scenarios:

  • Repeated access: Reading the same files multiple times (e.g., rebuilding a project)
  • Shared content: Multiple archives containing identical files (automatic deduplication via content hashing)
  • Remote archives: Avoiding repeated network round trips to OCI registries

The cache uses SHA256 hashes of uncompressed file content as keys. This provides:

  • Automatic deduplication across archives
  • Implicit integrity verification on cache hits
  • Efficient storage of shared dependencies

Disk Cache

To create a disk-backed cache:

import (
"github.com/meigma/blob/cache"
"github.com/meigma/blob/cache/disk"
)

diskCache, err := disk.New("/path/to/cache")
if err != nil {
return err
}

The disk cache automatically creates the directory if it does not exist and uses a sharded directory structure to avoid filesystem performance issues with many files.

Cache Options

Configure the disk cache with options:

diskCache, err := disk.New("/path/to/cache",
disk.WithShardPrefixLen(3), // Use 3-character prefix for sharding (default: 2)
disk.WithDirPerm(0o750), // Set directory permissions (default: 0o700)
)

The shard prefix length determines directory distribution:

  • 2 (default): Creates 256 subdirectories (00-ff)
  • 3: Creates 4096 subdirectories (000-fff)
  • 0: Disables sharding (all files in one directory)

Managing Cache Size

Both disk caches (content cache and block cache) support size limits and automatic pruning.

Setting Size Limits

Specify a maximum cache size when creating the cache:

// Content cache with 1 GB limit
diskCache, err := disk.New("/path/to/cache",
disk.WithMaxBytes(1 << 30), // 1 GB
)

// Block cache with 256 MB limit
blockCache, err := disk.NewBlockCache("/path/to/blocks",
disk.WithBlockMaxBytes(256 << 20), // 256 MB
)

When the cache exceeds its limit, it automatically prunes old entries before adding new ones.

How Pruning Works

Pruning removes entries by modification time (LRU-style eviction):

  1. Entries are sorted by modification time (oldest first)
  2. Oldest entries are removed until the cache is under the target size
  3. The cache tracks its size in memory for fast capacity checks

Manual Pruning

To manually prune a cache to a specific size:

// Prune to 100 MB
freed, err := diskCache.Prune(100 << 20)
if err != nil {
return err
}
fmt.Printf("Freed %d bytes\n", freed)

Monitoring Cache Size

Check the current cache size:

// Get configured limit (0 = unlimited)
maxBytes := diskCache.MaxBytes()

// Get current size
currentBytes := diskCache.SizeBytes()

fmt.Printf("Cache: %d / %d bytes (%.1f%%)\n",
currentBytes, maxBytes,
float64(currentBytes)/float64(maxBytes)*100,
)

Sizing Guidelines

Use CaseRecommended SizeRationale
Development workstation256 MB - 1 GBBalance performance with disk usage
CI/CD ephemeral0 (unlimited)Disk is reclaimed after job
Production server2-10 GBBased on working set size
Memory-constrained64-128 MBMinimum useful size

The optimal size depends on your access patterns. Monitor cache hit rates and adjust accordingly.

Wrapping a Blob with Caching

To add caching to an existing blob:

import (
"github.com/meigma/blob"
"github.com/meigma/blob/cache"
"github.com/meigma/blob/cache/disk"
)

func openCachedArchive(indexData []byte, source blob.ByteSource) (*cache.Blob, error) {
// Create the base blob
base, err := blob.New(indexData, source)
if err != nil {
return nil, err
}

// Create disk cache
diskCache, err := disk.New("/var/cache/blob")
if err != nil {
return nil, err
}

// Wrap with caching
return cache.New(base, diskCache), nil
}

Using BlobFile with Caching

When using OpenFile, extract the embedded *Blob for caching:

blobFile, err := blob.OpenFile("index.blob", "data.blob")
if err != nil {
return nil, err
}
// Note: caller is responsible for closing blobFile when done

diskCache, err := disk.New("/var/cache/blob")
if err != nil {
blobFile.Close()
return nil, err
}

// Wrap the embedded Blob with caching
cached := cache.New(blobFile.Blob, diskCache)

The cached blob implements the same fs.FS interfaces as the base blob, so you can use it as a drop-in replacement.

Reading Files

The cached blob automatically handles cache lookups:

// First read: fetches from source, caches result
content, err := cachedBlob.ReadFile("lib/utils.go")

// Second read: returns from cache, no network request
content, err = cachedBlob.ReadFile("lib/utils.go")

For streaming reads via Open(), behavior depends on the cache type:

  • Disk cache (StreamingCache): Content streams directly to cache during read
  • Basic cache: Content is buffered in memory, then cached on Close

Prefetching

To warm the cache with files you will access soon, use prefetch:

// Prefetch specific files
err := cachedBlob.Prefetch("go.mod", "go.sum", "main.go")

// Prefetch an entire directory
err = cachedBlob.PrefetchDir("pkg")

Prefetching is especially useful for remote archives because:

  • Adjacent files are fetched with batched range requests
  • Content is cached for subsequent access
  • You can prefetch during idle time

Prefetch Concurrency

By default, prefetch runs serially. To parallelize:

cachedBlob := cache.New(base, diskCache,
cache.WithPrefetchConcurrency(4), // Use 4 workers
)

Custom Cache Implementations

To implement a custom cache, satisfy the cache.Cache interface:

type Cache interface {
// Get retrieves content by its SHA256 hash.
// Returns nil, false if the content is not cached.
Get(hash []byte) ([]byte, bool)

// Put stores content indexed by its SHA256 hash.
Put(hash []byte, content []byte) error
}

Example in-memory cache:

type MemoryCache struct {
mu sync.RWMutex
data map[string][]byte
}

func NewMemoryCache() *MemoryCache {
return &MemoryCache{data: make(map[string][]byte)}
}

func (c *MemoryCache) Get(hash []byte) ([]byte, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
content, ok := c.data[string(hash)]
return content, ok
}

func (c *MemoryCache) Put(hash, content []byte) error {
c.mu.Lock()
defer c.mu.Unlock()
c.data[string(hash)] = content
return nil
}

Streaming Cache Interface

For large files, implement cache.StreamingCache to avoid buffering entire files in memory:

type StreamingCache interface {
Cache

// Writer returns a Writer for streaming content into the cache.
// The hash is the expected SHA256 of the content being written.
Writer(hash []byte) (Writer, error)
}

type Writer interface {
io.Writer

// Commit finalizes the cache entry after successful verification.
Commit() error

// Discard aborts the write and cleans up temporary data.
Discard() error
}

The disk cache implements this interface, writing to a temporary file and atomically renaming on commit.

Complete Example

A complete setup with disk caching and prefetch:

func setupCachedArchive(indexData []byte, dataURL string) (*cache.Blob, error) {
// Create HTTP source
source, err := http.NewSource(dataURL,
http.WithHeader("Authorization", "Bearer "+token),
)
if err != nil {
return nil, fmt.Errorf("create source: %w", err)
}

// Create base blob
base, err := blob.New(indexData, source)
if err != nil {
return nil, fmt.Errorf("open archive: %w", err)
}

// Create disk cache in user cache directory
cacheDir, err := os.UserCacheDir()
if err != nil {
cacheDir = "/tmp"
}
diskCache, err := disk.New(filepath.Join(cacheDir, "blob"))
if err != nil {
return nil, fmt.Errorf("create cache: %w", err)
}

// Wrap with caching
cached := cache.New(base, diskCache,
cache.WithPrefetchConcurrency(4),
)

// Prefetch commonly accessed directories
if err := cached.PrefetchDir("src"); err != nil {
// Non-fatal: prefetch is opportunistic
log.Printf("prefetch warning: %v", err)
}

return cached, nil
}

See Also