System Design Guide

Object Storage: Scalable Unstructured Data Storage

Object storage is a data storage architecture that manages data as objects rather than blocks or files. Each object contains data, metadata, and a unique identifier, making object storage ideal for large volumes of unstructured data like images, videos, backups, and logs. Cloud services like Amazon S3, Google Cloud Storage, and Azure Blob Storage have made object storage the de facto standard for cloud-based data storage.

Core Concepts

Objects are the fundamental storage units, consisting of the data itself (the object’s payload), metadata (key-value pairs describing the object), and a unique identifier (key) used to retrieve the object. Objects can range from bytes to terabytes in size.

Buckets (or containers) are top-level namespaces that organize objects. Think of buckets as analogous to file system directories, though objects within buckets are stored flat without hierarchical directories (though keys can simulate hierarchies using path-like naming).

Keys uniquely identify objects within buckets. They’re essentially the object’s name and can include path-like components (e.g., images/2024/01/photo.jpg) to simulate directory structures, though these are logical rather than physical hierarchies.

Metadata provides context about objects: content type, creation time, custom application data, or caching directives. Some metadata is system-defined (size, checksum); other metadata is user-defined (tags, descriptions).

Architecture

Object storage uses a flat address space rather than hierarchical file systems. Objects are accessed via HTTP APIs using URLs that combine bucket name and object key. There’s no concept of mounting or directory traversal—each object is independently addressable.

Distributed Storage underlies object storage systems, with objects automatically replicated across multiple storage nodes and availability zones for durability and availability. This distribution is transparent to users who simply store and retrieve objects via APIs.

Eventually Consistent semantics are common in object storage. After uploading or updating an object, there might be brief periods where different requests see old versions. Most systems provide read-after-write consistency for new objects but eventual consistency for updates and deletes.

Benefits

Scalability is virtually unlimited. Object storage scales to billions of objects and exabytes of data without performance degradation. There’s no need to provision capacity—just store data and pay for what you use.

Durability is exceptionally high, typically 99.999999999% (11 nines). Objects are automatically replicated across multiple facilities and geographic regions, protecting against hardware failures, facility outages, and disasters.

Cost-Effectiveness results from efficient operation at massive scale. Object storage is typically much cheaper than block storage, especially for infrequently accessed data. Storage tiers provide further cost optimization.

HTTP Access through REST APIs makes object storage universally accessible. Any language with HTTP libraries can access object storage without specialized drivers or protocols.

Use Cases

Static Asset Hosting for websites serves images, videos, CSS, and JavaScript directly from object storage. CDNs integrate seamlessly, caching objects globally for low-latency delivery.

Backup and Archive leverage object storage’s durability and low cost. Backups stored in object storage benefit from automatic replication and long-term retention without tape management complexity.

Media Storage for streaming services, photo sharing, and content platforms stores massive media libraries. Object storage handles billions of files economically while supporting high-throughput access.

Data Lakes build on object storage for big data analytics. Store raw data in object storage and process it with frameworks like Spark or Presto, separating storage from compute for cost and flexibility.

Application Data including user uploads, generated reports, and temporary files stores naturally in object storage, offloading file management from application servers and databases.

Storage Classes and Tiers

Standard/Hot Storage provides immediate access with low latency, suitable for frequently accessed data. This is the most expensive tier but offers best performance.

Infrequent Access tiers cost less for storage but charge for retrieval. They’re suitable for data accessed occasionally—monthly or quarterly. Retrieval latency is typically similar to standard storage.

Archive Storage (like Amazon Glacier) offers lowest cost for rarely accessed data, with retrieval times from minutes to hours. This suits compliance archives and long-term backups where rapid access isn’t required.

Intelligent Tiering automatically moves objects between tiers based on access patterns, optimizing cost without manual intervention. Objects are automatically moved to cheaper tiers after periods of inactivity.

Versioning

Object Versioning retains multiple versions of objects, protecting against accidental deletions or overwrites. Each write creates a new version, and all versions are retained until explicitly deleted.

This enables recovering previous object versions, implementing audit trails, and protecting against application bugs or user errors. However, versioning increases storage costs since multiple versions consume space.

Security

Access Control uses bucket policies and access control lists (ACLs) defining who can read, write, or delete objects. Policies can grant public access, require authentication, or limit access to specific users or roles.

Encryption protects data at rest and in transit. Server-side encryption automatically encrypts objects before storing them. Client-side encryption encrypts data before sending it to object storage for maximum security.

Signed URLs provide temporary, time-limited access to private objects without requiring authentication credentials. Generate a signed URL for a private file, and anyone with the URL can access it until expiration.

Audit Logging tracks all access and operations for compliance and security monitoring. Logs capture who accessed which objects, when, and what operations were performed.

Performance Considerations

Parallel Transfers improve throughput for large objects. Upload or download in multiple parts concurrently to saturate network bandwidth and overcome single-connection limits.

Multipart Upload breaks large objects into parts uploaded independently and assembled into a single object. This enables resumable uploads, parallel transfer, and uploading objects larger than what a single HTTP request can handle.

Request Rate limits can apply to buckets. Object storage is eventually consistent partly because writes propagate asynchronously. High request rates to a small keyspace might experience throttling.

Key Design affects performance. Sequential keys (timestamps, counters) can create hotspots. Randomized key prefixes distribute load across storage partitions for better performance.

Consistency Models

Read-After-Write Consistency for new objects means immediately after uploading an object, you can read it back. This is provided by most modern object storage systems.

Eventual Consistency for updates and deletes means after modifying or deleting an object, there might be brief periods where requests return old versions. Most systems propagate changes within seconds, but no immediate consistency guarantee exists.

Strong Consistency is offered by some systems (like Amazon S3 as of 2020) for all operations, eliminating eventual consistency delays. This simplifies application logic but may have slight performance implications.

Lifecycle Management

Lifecycle Policies automate object management based on rules. Automatically delete objects after specified periods, transition to cheaper storage tiers as objects age, or delete previous versions.

This simplifies data retention compliance, reduces storage costs, and eliminates manual data management. Configure policies once, and the system applies them continuously.

Best Practices

Use object storage for data that fits its access patterns: large files, immutable data, or data accessed via keys rather than queries. Don’t use it as a database—lack of transactional semantics and query capabilities make it unsuitable for structured, frequently updated data.

Implement retry logic with exponential backoff for transient failures. Object storage systems may throttle requests or experience brief service disruptions. Retries ensure resilience.

Leverage CDNs for frequently accessed content. Serving directly from object storage works but benefits from CDN caching for reduced latency and cost.

Monitor costs carefully, especially data transfer costs. Transferring data out of object storage can be expensive. Optimize data locality and transfer patterns to control costs.

Design keys thoughtfully. Use prefixes for logical organization, but avoid purely sequential keys that create hotspots. Include randomness in key prefixes for better distribution.

Enable versioning for critical data to protect against accidental deletion or corruption. Combined with lifecycle policies to eventually remove old versions, this provides safety without unbounded cost growth.

Object storage has become fundamental infrastructure for modern applications, providing scalable, durable, cost-effective storage for unstructured data. Understanding its characteristics—flat addressing, eventual consistency, HTTP access—and best practices enables effectively leveraging object storage for diverse use cases from static website hosting to data lakes. While not suitable for all data types, object storage excels at what it’s designed for: managing massive volumes of unstructured data economically and reliably.