SegmentDesign Struct Reference

Detailed Description

Overview

The Fennel segment library defines mechanisms for mapping from logical storage spaces to the physical storage interfaces defined by the device layer (as described in DeviceDesign). Here's a rough analogy to non-persistent memory:

The device layer is like persistent physical memory.
BlockIds are like physical memory addresses.
The segment layer is like persistent virtual memory.
A segment is like a single virtual address space.
PageIds are like virtual memory addresses. Unlike a BlockId, a PageId is only meaningful in the context of a particular segment.

However, the analogy is not exact, since the mapping from virtual to physical can be arbitrarily defined by each segment implementation, along with extra semantics such as versioning. Also, there may be multiple layers of segment virtualization rather than a direct logical-to-physical mapping.

The segment library also relies on the cache library (TBD: link to cache design) for all device access, and for imposing rules such as write ordering.

For an example of how the segment library can be applied to create page-based persistent data structures, see this HOWTO doc.

Segment Interface

The Segment interface provides the basic definition of a logical storage space. Some aspects of this interface are optional (meaning implementations are free to raise exceptions when an unsupported or irrelevant capability is requested). Derived classes augment the Segment interface with additional methods. The base public Segment interface methods can be categorized as follows:

Page allocation. Each segment has an associated set of allocated pages; each page of a segment has a different PageId. Segment::allocatePageId allocates a new page of storage and returns its PageId. Segment::getAllocationOrder returns information about the nature of the PageId's returned by allocatePageId, as defined by the Segment::AllocationOrder enumeration. Most common are linear PageId's, which are allocated consecutively starting from 0. Segment::deallocatePageRange frees allocated pages. Segment::isPageIdAllocated can be used to test the allocation status of particular PageId's.
Segment size. Segment::getAllocatedSizeInPages returns the cardinality of a segment's set of allocated pages. Segment::ensureAllocatedSize can be used to allocate a large number of pages at once.
Page size. Segments may reserve footer space on each page for their own bookkeeping purposes. Segment::getUsablePageSize allows a caller to determine how much of a page can safely be used. Segment::getFullPageSize returns the full page size including all footers (this is currently the same as the cache page size).
PageId/BlockId translation. Segment::translatePageId maps a logical PageId to a physical BlockId. Segment::translateBlockId maps a physical BlockId back to a logical PageId.
Checkpoint. The cached status of pages of a segment can be controlled with Segment::checkpoint.
Page successors. For some segments, there is a meaningful successorship function over the set of PageIds (e.g. for linear allocation segments, the successor is determined by interpreting the PageId as an integer). Segment::getPageSuccessor can be used to find the successor of a PageId. For segments where the ordering can be imposed (rather than being determined implicitly by allocation), Segment::setPageSuccessor can be used to change the successor relationship.
Segment derives from the MappedPageListener interface so that it can receive cache notifications for any of its pages whenever they are mapped into the cache. The default behavior is to ignore the notifications, but segment implementations can override this with arbitrary behavior.
Segment also implements the ClosableObject interface by calling checkpoint(CHECKPOINT_FLUSH_AND_UNMAP) when the segment is closed. This ensures that cache pages do not remain mapped to a segment which has already been closed.

SegmentFactory

Subsequent sections describe the available implementations of Segment. These classes cannot be instantiated directly; instead, SegmentFactory provides methods for this purpose, and for some related functionality:

Trace support is provided implicitly for all of the public Segment interface methods. SegmentFactory accomplishes this by wrapping Segment instances with TracingSegment (but only when tracing is turned on).
When SegmentFactory returns a shared_ptr to a new Segment, it ensures that the shared_ptr is constructed in such a way that the Segment's close method will be called automatically before destruction.
SegmentFactory manages DeviceId allocation for file devices associated with temporary segments, and ensures that the corresponding temporary files are deleted when a temporary segment is closed.

LinearDeviceSegment

LinearDeviceSegment maps a range of linear PageId's to contiguous blocks of a RandomAccessDevice. The diagram below shows the most general case:

In this example, the segment currently has five pages allocated (the light gray rectangles). The start of the segment does not have to coincide with the start of the device; this is controlled by the firstBlockId parameter. The white blocks are inaccessible via this segment; they could be unused, or allocated to other segments. The dark gray rectangles represent blocks which could be allocated in the future, up to the nPagesMax limit.

The device size only needs to be big enough to contain the allocated pages. If nPagesMax is beyond the end of the device, then the segment will attempt to expand the device when necessary for allocation of new pages, in chunks determined by the nPagesIncrement parameter.

Uses

LinearDeviceSegment can be used directly to store data structures with predictable allocation patterns (e.g. a fixed-size hash table), or as a basis for the more complex segments described later.

RandomAllocationSegment

RandomAllocationSegment implements random page allocation/deallocation in terms of an underlying linear segment (in the simplest case, directly on top of a LinearDeviceSegment). PageId 0 is reserved as a segment map. The rest of the segment is divided up into extents, each headed with an extent map page, as shown in the following diagram:

Here, each extent contains one extent map page followed by four data pages; real extents are much bigger. The entries in the PageId 0 segment map page contain the number of unallocated pages in each extent (light gray rectangles represent allocated blocks). Each extent map contains the allocation status of the corresponding data pages, together with successor PageId entries, forming singly-linked lists terminated with NULL_PAGE_ID). Storing the successor chain in the extent map allows for efficient prefetch even in the presence of fragmentation. In this example, the successor sequence is 2,3,4,5,8,14,10. The first extent is compact, but the last two extents are fragmented.

Not shown is an additional field which allows an owner ID to be recorded together with each page allocation in an extent map. This field is also interpreted as the actual allocation status (with a reserved owner ID representing an unallocated page).

TODO: This can be used for storage integrity verification and repair, as well as efficient wholesale deallocation of complex structures.

For very large segments, multiple segment map pages may be required. These are interspersed before each run of extents (rather than all at the beginning after PageId 0) in order to allow a segment to grow without bound. The last segment map page is marked to indicate that there are no more.

Size Calculations

The extent size and number of extents per segment map page are fixed based on:

page size
size of headers on map pages
size of entries on map pages

The exact formulas used are in the constructor (RandomAllocationSegment::RandomAllocationSegment). Here's an example calculation:

let page size be either 4096 bytes (4KB) or 16384 bytes (16KB)
assume no page footers required for underlying segment
typical segment map header size is 32 bytes
typical extent map header size is 8 bytes
typical segment map entry size is 4 bytes
typical extent map entry size is 16 bytes
number of pages mapped per extent is (4096 - 16)/8 = 510 for a 4KB page size, or 2046 for a 16KB page size
extent size is 510*4KB=2040KB (close to 2MB) for a 4KB page size, or 32MB for a 16KB page size
number of extents per segment map node is (4096 - 32)/4 = 1016 for a 4KB page size, or 4088 for a 16KB page size
number of pages mapped per segment map node is 1016*510=518160 for a 4KB page size, or 8364048 for a 16KB page size
data mapped per segment map node is 518160*4KB=2072640KB (close to 2GB) for a 4KB page size, or 133GB for a 16KB page size
a terabyte database would require over 500 segment map pages with a 4KB page size, but fewer than 10 with a 16KB page size

Growth

If the underlying segment supports it, a RandomAllocationSegment can grow whenever all of its initial allocation is exhausted. Growth is in increments of one extent at a time. When an entire segment map page is full (all extents allocated), a new segment map page with one empty extent is allocated.

Uses

RandomAllocationSegment is good for implementing dynamic data structures such as BTrees where the page allocation/deallocation pattern is unpredictable.

VersionedRandomAllocationSegment and SnapshotRandomAllocationSegment

VersionedRandomAllocationSegment and SnapshotRandomAllocationSegment are companion segments that implement snapshot-consistent, multi-versioned, random data access. VersionedRandomAllocationSegment provides the underlying multi-versioned, random data storage, while SnapshotRandomAllocationSegment maintains transaction private state, initiates the commit and rollback of changes in the underlying segment, and determines which snapshot pages to access for a specified transaction.

Data Storage

VersionedRandomAllocationSegment is derived from the same base class used to implement RandomAllocationSegment. Therefore, it uses the same page allocation/deallocation scheme. In fact, it uses the same segment map described earlier, and a similar extent map. The difference is the entries in a VersionedRandomAllocationSegment extent map include a commit sequence number, corresponding to when the page was committed, and a version chain that links a page to its snapshots. The first entry in the chain is the anchor page and corresponds to the original copy of the page. The page chained from that anchor page entry is the latest snapshot copy of the page. Chained from that is the next latest snapshot copy and so forth, until we chain back to the anchor page.

In order to support recovery and transaction rollbacks, changes made to the segment and extent maps of a VersionedRandomAllocationSegment are not written to the segment's pages until a transaction is committed. This is implemented by maintaining temporary copies of the pages containing these maps. As new pages are allocated, the allocations are reflected in these temporary pages. Only when a transaction commits are the changes corresponding to that transaction copied back to the permanent pages in the VersionedRandomAllocationSegment. In the case of rollback, any uncommitted changes corresponding to the transaction are removed from the temporary pages. All page allocations will go through these temporary pages. That way, all page allocations are still reserved to prevent other transactions from allocating the same page.

Maintaining Snapshots

SnapshotRandomAllocationSegment is responsible for maintaining and keeping track of page allocations, deallocations, and updates for a specific transaction. This is in contrast to VersionedRandomAllocationSegment which maintains page allocations and deallocations for all active transactions. Normally, you will have one SnapshotRandomAllocationSegment per transaction with more than one SnapshotRandomAllocationSegment associated with the same underlying companion VersionedRandomAllocationSegment. The SnapshotRandomAllocationSegment is created with an initial commit sequence number. This commit sequence number determines which pages should be read by the segment. In order to locate the appropriate snapshot page, it compares this commit sequence number against the commit sequence numbers stored in the extent map entries in the underlying VersionedRandomAllocationSegment. Pages are always accessed through their original anchor page. Therefore, the desired page is found by walking the page chain starting at the anchor until we find a page whose commit sequence number is less than or equal to the commit sequence number associated with the segment. We can walk the chain in this way because as noted above, the chain is constructed such that the pages that follow the anchor are arranged from newest to oldest.

A new snapshot page is created whenever a page is updated. With a SnapshotRandomAllocationSegment, the very first time a page is updated by the segment, it will create a new snapshot copy of the page, linking the new page into the page chain of the original anchor page in the VersionedRandomAllocationSegment. Subsequent updates made through the segment to that same page will not result in a new page allocation. Instead, the previous, newly allocated page will be used since the update is part of the same transaction.

Commits and rollbacks of changes stored in the temporary pages of the underlying VersionedRandomAllocationSegment are initiated through the SnapshotRandomAllocationSegment. When a commit is called on a SnapshotRandomAllocationSegment, it will call its underlying VersionedRandomAllocationSegment to make any page allocations and deallocations that it initiated permanent. When committing these changes, a new commit sequence number is passed down to VersionedRandomAllocationSegment so the extent page entries corresponding to these modifications will reflect the sequence number at the time of the commit, rather than at the start of the transaction.

The following example illustrates these concepts. Suppose the following:

Transaction 1 creates pages 1, 2 and 3.
Transaction 3 updates page 3, thereby creating snapshot page 4. It also creates a new page -- page 5.
Transaction 5 updates pages 4 and 5, creating snapshot pages 6 and 7, and creates page 8.

The extent map entries for these pages would be as noted in the diagram below. For simplicity, for each page entry, I only show the commit sequence number and the version chain pointer. Note that because pages 1, 2, and 8 have not been updated, their page entries chain back to themselves.

Using the above example:

A SnapshotRandomAllocationSegment with a commit sequence number of 2 would read pages 1, 2, and 3.
A SnapshotRandomAllocationSegment with commit sequence number 4 would read pages 1, 2, 4, and 5.
A SnapshotRandomAllocationSegment with commit sequence 6 would read pages 1, 2, 6, 7, and 8.

Associating Different Snapshots to the Same Object

If you want to tie more than one SnapshotRandomAllocationSegment (each corresponding to a different transaction over time) to the same object, one way of doing this is to make use of a DynamicDelegatingSegment. Unlike a DelegatingSegment, you can dynamically set the underlying delegate in a DynamicDelegatingSegment. In other words, as you switch to a new transaction, you can switch the delegate of a DynamicDelegatingSegment to the SnapshotRandomAllocationSegment corresponding to the new transaction.

Page Deallocation

One other noteworthy characteristic of VersionedRandomAllocationSegment is the manner in which page deallocations are handled. Requests to deallocate individual pages are still carried out by calling the deallocatePageRange method. However, for the VersionedRandomAllocationSegment, that method does not actually free the page, making it available for re-allocation. Instead, the page is marked as deallocation deferred. In order to free deallocation-deferred pages, VersionedRandomAllocationSegment::getOldPageIds must first be called to locate these pages. Then VersionedRandomAllocationSegment::deallocateOldPages must be called to do the actual frees. Normally, these two methods should be called in a loop until getOldPageIds can no longer find any old pages.

In addition to identifying pages marked for deallocation, VersionedRandomAllocationSegment::getOldPageIds also identifies pages in a page chain that are no longer being accessed by any active transactions. Some of the old Pages in a page chain can be deallocated provided there are at least three old pages in the page chain. Why three?

We can never deallocate the anchor page since there can be references to that anchor page id in btree entries and/or other page entries.
We cannot deallocate the newest, old page because it's still being referenced by active transactions.
Therefore, the newest page we can deallocate is the second newest, old page.

Using the example from earlier, if we consider the page chain with page 3 as its anchor and the oldest, active transaction id is 5, then no pages can be freed because transaction id 5 is still referencing page 4, and transaction id 6 and beyond are referencing page 6. On the other hand, if the oldest, active transaction id is 7, then we can free page 4. Again, we cannot free page 6 because transaction id 7 is still referencing it. If the page chain contained additional old pages following page 4 in the chain, those pages would also be freed.

CircularSegment

CircularSegment implements "infinite" page allocation in terms of a fixed-size underlying linear segment via wrap-around (so CircularSegment PageId x corresponds to PageId x modulo n in the underlying linear segment, where n is the fixed size). Allocation fails if old pages are not deallocated soon enough. The diagram below provides an example:

Uses

CircularSegment is good for implementing logs (but requires regular checkpointing in order to free log space).

VersionedSegment and WALSegment

VersionedSegment and WALSegment collaborate to provide a rudimentary page-based versioning and recovery scheme. (Currently the versioning is only used internally for recovery, and only two versions are maintained, but eventually online multi-versioning may be exposed for application use). The recovery guarantee made by VersionedSegment is that after a crash it can be restored to the exact physical state written at the time of a checkpoint prior to the crash. (This may be the last checkpoint or the penultimate checkpoint depending on the checkpoint type, as discussed later.)

VersionedSegment stores data pages in an underlying segment of any kind. It stores log pages in a separate WALSegment, which provides the bookkeeping needed to implement the write-ahead logging protocol. WALSegment requires its underlying segment to support ascending (but not necessarily linear) allocation with successor support. The diagram below shows the state of a VersionedSegment after several checkpoints:

Some allocated pages are clean, meaning they have not been modified since the last checkpoint, and hence have no corresponding pages in the WALSegment. Others are dirty and have corresponding log entries. The association between data and log pages is two-way; the data-to-log direction is stored in an in-memory table (dataToLogMap), but the log-to-data direction is stored persistently via the footer of each logged page; this footer also records the page version number and checksum for use during recovery. In addition, WALSegment maintains a dirtyPageSet automatically by implementing the cache's MappedPageListener interface. This allows VersionedSegment to determine whether the log page for a given dirty data page has been flushed yet; if not, VersionedSegment prevents the cache from flushing the data page by overriding MappedPageListener::canFlushPage. VersionedSegment also overrides MappedPageListener::notifyPageDirty to intercept each write to a page, giving it a chance to log the clean page data before the write takes place. Logging for a particular page is only required before the first write after each checkpoint, since recovery is not responsible for restoring intermediate states.

Checkpoints

Two kinds of checkpoints are supported for VersionedSegments. The simplest is a sharp checkpoint, which has the following sequence:

All dirty log pages are flushed.
All dirty data pages are flushed.
The log is truncated (all log pages are deallocated). This step is encapsulated in a separate method (VersionedSegment::deallocateCheckpointedLog) to allow VersionedSegment to participate in higher-level checkpoint processes.

The large amount of synchronous I/O required for a sharp checkpoint may result in poor throughput, so fuzzy checkpoints are also supported. When fuzzy checkpointing is used, log data spans two checkpoint intervals rather than just one. When a checkpoint is requested, the only data pages that are flushed are those that were also dirty at the time of the previous checkpoint; any others are left dirty in the hope that they will be flushed for other reasons, such as cache victimization, before the next checkpoint. The bookkeeping for this is encapsulated by FuzzyCheckpointSet. After a fuzzy checkpoint, the log cannot be truncated completely; instead, three log PageId pointers are maintained:

oldestLogPageId records the earliest log page allocated
lastCheckpointLogPageId records the latest log page allocated before the previous checkpoint
newestLogPageId records the latest log page allocated

During a fuzzy checkpoint, all pages earlier than lastCheckpointLogPageId are deallocated, and the log pointers are shifted so that oldestLogPageId advances to one page past lastCheckpointLogPageId, and lastCheckpointLogPageId advances to newestLogPageId.

TODO: a version/log history diagram is needed to help explain this

Recovery

Regardless of whether sharp or fuzzy checkpoints were used online, the recovery procedure is always the same:

The caller to VersionedSegment::recover must provide the PageId from which to start reading the log. This must be recorded while online by calling VersionedSegment::getRecoveryPageId during each checkpoint and storing the result in some other durable location. VersionedSegment also requires the caller to store and provide the latest segment version number, available from VersionedSegment::getVersionNumber.
When a log page is read, its footer is used to determine what to do with it.
If the checksum recorded in the footer does not match a recovery checksum computed from the page contents, the page is considered invalid and recovery terminates (successfully, under the assumption that the page represents the result of an incomplete write at the end of the log). TBD: onlineUuid.
If the page version is older than the latest version number, the page is skipped and recovery continues (TBD: why does this happen? I think fuzzy checkpointing).
If the previous tests pass, the page is assumed to be a good log page, and the contents are copied over to the corresponding data page.
Recovery advances to the next log page using the successor relationship. If NULL_PAGE_ID is encountered, recovery terminates successfully.

Uses

VersionedSegment can be used for UNDO recovery to restore an action-consistent checkpoint state. More complex transactional guarantees, such as REDO recovery to a last committed state, must be programmed at a higher level (TBD: link to txn design docs). WALSegment can be used independently as part of other logging schemes.

LinearViewSegment

LinearViewSegment allows a linear view to be imposed on top of a random allocation segment. The linear sequence is defined in terms of the successor function, so given the first page in a successor chain, the linear view is constructed by following the chain and building up an in-memory page table for random access, as in the following diagram:

Uses

LinearViewSegment is good for storing large dynamically allocated random-access arrays, such as LOB data, in an underlying RandomAllocationSegment.

ScratchSegment

ScratchSegment is like a ramdisk. Rather than storing data in a device, it allocates scratch pages from the cache. Page contents persist only until the scratch segment is closed, and remain implicitly locked into cache until then.

Scratch page BlockIds reference the cache's singleton instance of the RandomAccessNullDevice. However, these BlockId's are not mapped by the cache, so all access to ScratchSegment pages must go through a special CacheAccessor implementation.

Uses

ScratchSegment is useful for creating efficient temporary structures, since it eliminates the overhead associated with persistent page allocation. It can also be useful as a means of managing miscellaneous scratch memory required by a complex process such as a query execution.

Segment Layering

Several of the segment types defined above are built on top of another underlying segment. This segment layering can be of arbitrary depth; abstract base class DelegatingSegment makes it easy to create new segment implementations which use layering. The diagram below shows an example of how segments can be layered in interesting ways:

At the lowest level are two LinearDeviceSegments. The first accesses a data device, and the second accesses a log device. The CircularSegment imposes a particular page allocation pattern on the log, while the WALSegment adds the write-ahead logging support required by VersionedSegment. The VersionedSegment combines the data device with the log for physical recovery. On top of this, the pages of the VersionedSegment are interpreted as a RandomAllocationSegment, which is further subdivided by two LinearViewSegments.

Layering order can make a big difference. In this example, since the RandomAllocationSegment is above the VersionedSegment, all pages are versioned, including the segment and extent maps. This is good, since recovery needs to restore these to be consistent with the data page contents. However, in other contexts it might make sense to use versioning above a RandomAllocationSegment, for example to version one fixed-allocation object independently of everything else.

When footers are required by more than one layer, the footers are appended, with the lower layers towards the end of the page. Accordingly, the usable page size is cumulatively reduced. DelegatingSegment ensures that events such as checkpoints and page notifications are propagated through all layers (unless a layer consumes it without forwarding it on to those beneath it). Since cache pages can only be mapped to one segment at a time, they are mapped to the highest layer to guarantee that all layers see notifications.

Layering is designed to allow for abstraction. In the example above, it should be possible to eliminate the CircularSegment to create an ever-growing log instead, without the layers above knowing anything about it.

Page Access

So far, this document has focused on how various segment implementations map logical pages to physical storage. The segment layer also provides infrastructure useful for the implementation of complex structures stored in segments (e.g. BTrees):

SegmentAccessor is a convenience class which combines a Segment with a CacheAccessor, since these are usually needed together in order to access pages via the Cache.
SegPageLock is a scoped guard for locking pages and guaranteeing that they get unlocked even if an exception occurs, as well as hiding the details of PageId/BlockId translation and MappedPageListener. Code above the segment layer should almost always use this class rather than accessing the cache layer directly.
SegNodeLock and StoredNode provide a generic framework for prefixing the data of each page with a fixed-size, class-specific header, including a magic number for verifying that a newly accessed page is of the expected type.
SegPageIter allows a chain of page successors to be prefetched automatically as part of iteration.
SegOutputStream and SegInputStream allow segments supporting the successor relationship to be used for direct storage of sequential-access data such as transaction logs. SpillOutputStream provides a buffering optimization. CrcSegOutputStream and CrcSegInputStream compute checksums on stored stream data.

TODO

Some segment implementations which will be provided in the future include:

RollingLogSegment: for logs which are stored in multiple files, with new files being created as the log grows, and old files being deleted as the corresponding logs are deallocated by checkpoints.
ArchivedCircularSegment: a specialization of CircularSegment which uses a background thread to archive old blocks before they are reallocated.
CopyOnWriteSegment: a union which allows multiple read-write views over a shared underlying read-only segment.

It would be easy to define a CompoundSegment allowing multiple Segments to be combined, but it's probably a better idea to rely on operating system volume management support (with the exception of special cases that can benefit from database-level semantics, such as ping-pong writes for automatic log mirroring).

Definition at line 806 of file SegmentDesign.cpp.

The documentation for this struct was generated from the following file:

/home/pub/open/dev/fennel/segment/SegmentDesign.cpp

Generated on Mon Jun 22 04:00:45 2009 for Fennel by

1.5.1