System Design Guide

API Pagination: Handling Large Data Sets

Pagination is the practice of dividing large data sets into smaller chunks or “pages” that can be retrieved sequentially. It’s essential for APIs returning potentially large collections, protecting both servers and clients from excessive data transfer, memory consumption, and processing time. Choosing an appropriate pagination strategy impacts performance, user experience, and implementation complexity.

Why Paginate?

Returning all results for large collections is impractical. Imagine an API endpoint returning all users in a system with millions of users. The response would be gigabytes in size, take minutes to generate, consume massive server memory and network bandwidth, and likely timeout before completing.

Even for smaller collections, pagination improves performance. Fetching 20 items is faster than 10,000, reduces time-to-first-byte, and allows clients to display results immediately while loading additional pages in the background. Users see responsive interfaces instead of staring at loading spinners.

Pagination also enables efficient resource utilization. Databases can use indexes effectively for small page queries but struggle with massive result sets. Network bandwidth is conserved, server memory isn’t exhausted, and clients handle manageable data volumes.

Offset-Based Pagination

Offset-based pagination uses limit and offset parameters: /users?limit=20&offset=40 returns 20 users starting from the 41st. This is intuitive, allows random page access (jump directly to page 5), and maps naturally to SQL’s LIMIT and OFFSET clauses.

Implementation is straightforward: SELECT * FROM users LIMIT 20 OFFSET 40. Most databases and ORMs support this pattern, making it easy to add pagination to existing endpoints.

However, offset-based pagination has significant drawbacks. Performance degrades with large offsets since databases must scan and skip offset rows before returning results. Fetching page 1000 (offset 20,000) is much slower than page 1 (offset 0).

Inconsistent Results occur when data changes between page requests. If items are added or deleted, the same item might appear on multiple pages or be skipped entirely. For example, if an item is added while paginating, subsequent pages shift, potentially duplicating items.

Not suitable for real-time data since results become stale quickly in frequently updated datasets. By the time users reach page 10, the first page’s data may have changed significantly.

Cursor-Based Pagination

Cursor-based pagination uses an opaque cursor identifying the position in the result set: /users?cursor=eyJpZCI6MTAwfQ&limit=20. The cursor encodes enough information to resume from that position, typically the last item’s identifier or sorting key.

After retrieving a page, the response includes a cursor for the next page. Clients use this cursor to fetch subsequent pages. The cursor is opaque to clients—implementation details can change without affecting clients.

Advantages include consistent results despite data changes. The cursor identifies exact position regardless of insertions or deletions. Performance is stable regardless of position since queries use indexed columns (typically IDs) rather than offsets.

Implementation typically encodes the last item’s ID or sorting keys. For example, WHERE id > last_id ORDER BY id LIMIT 20. This uses indexes efficiently for excellent performance.

Limitations include inability to jump to arbitrary pages—you can only navigate forward (and sometimes backward with previous cursors). This prevents “jump to page X” UI patterns but is fine for infinite scrolling or sequential navigation.

Keyset Pagination

Keyset pagination is a specific cursor-based approach using the actual sorting keys rather than opaque cursors. For users sorted by creation date: /users?created_after=2024-01-15T10:30:00Z&limit=20.

This is even more transparent than cursor-based pagination, with the “cursor” being meaningful data. Clients can construct queries without needing a previous response, enabling bookmarking and sharing.

Implementation requires careful handling of ties (multiple items with identical sorting keys) by including secondary sorting keys. For example, sort by creation date then ID: WHERE (created_at, id) > (last_created_at, last_id) ORDER BY created_at, id LIMIT 20.

This approach performs excellently with proper indexes but exposes more implementation details to clients. Changing the sorting strategy requires client changes since the pagination parameters are tied to the sort keys.

Page Number-Based Pagination

Page number pagination uses explicit page numbers: /users?page=3&page_size=20. This is familiar to users from websites with numbered page controls and allows direct navigation to any page.

Implementation calculates offset from page number: offset = (page - 1) * page_size. This is essentially offset-based pagination with more user-friendly parameters.

It inherits offset-based pagination’s performance issues with high page numbers and inconsistent results with changing data. However, for moderately sized datasets where performance isn’t critical and page number UI is desired, it’s a reasonable choice.

Hypermedia Pagination

Hypermedia pagination includes navigation links in responses: first, prev, next, and last URLs. Clients follow these links rather than constructing URLs themselves.

{
  "data": [...],
  "links": {
    "first": "/users?page=1",
    "prev": "/users?page=2",
    "next": "/users?page=4",
    "last": "/users?page=10"
  }
}

This follows REST’s HATEOAS principle, making APIs more discoverable and evolvable. Server-side URL structure can change without breaking clients. However, it adds response payload size and isn’t universally supported by client libraries.

Pagination Metadata

Include metadata about the result set and pagination state: total count, current page, page size, whether more results exist. This enables clients to display “showing X of Y” or implement progress indicators.

{
  "data": [...],
  "pagination": {
    "page": 2,
    "page_size": 20,
    "total_pages": 10,
    "total_count": 200,
    "has_next": true,
    "has_prev": true
  }
}

Total Count is expensive to compute for large datasets, potentially requiring a full table scan. Consider caching counts, estimating them, or omitting them entirely for very large datasets. Sometimes “more than 10,000 results” is sufficient rather than the exact count.

Handling Edge Cases

Empty Results: Return empty arrays with pagination metadata showing no results exist. Don’t return errors for legitimate queries that happen to have no results.

Invalid Cursors: If a cursor is invalid or expired, return a 400 Bad Request with a clear error message. Allow clients to restart pagination from the beginning.

Changed Sorting: If sort parameters change between pages, results will be inconsistent. Consider including sort parameters in cursors to detect this and return errors if clients change sorting mid-pagination.

Client Considerations

Cursor Storage: Clients must store cursors to navigate pagination. Don’t assume cursors are short or human-readable. Design for opaque, potentially long cursor strings.

Stateless Pagination: Servers shouldn’t maintain session state for pagination. All necessary information should be in the cursor or request parameters, allowing any server instance to handle any page request.

Default Limits: Provide sensible default page sizes (20-50 items typically) while allowing clients to override. Enforce maximum page sizes to prevent abuse—don’t allow clients to request 1,000,000 items per page.

Choosing a Strategy

Use cursor-based pagination for large datasets, frequently changing data, or infinite scroll UIs. The consistent results and stable performance make it ideal for most modern APIs.

Use offset-based pagination for small to medium datasets where page numbers are valuable UI elements and performance with high offsets isn’t concerning.

Use keyset pagination when transparency is valuable and the dataset has good natural sorting keys that clients can understand and use.

Consider hybrid approaches: cursor-based for forward navigation with page numbers for user-facing display, where the “page number” is derived from position but actual navigation uses cursors.

Pagination is fundamental to API design for collections. The right strategy balances performance, user experience, implementation complexity, and data characteristics. For most modern APIs, cursor-based pagination provides the best balance, but understanding all approaches enables choosing appropriately for specific requirements. The goal is providing efficient, consistent access to large datasets while maintaining excellent performance and user experience.