Small. Fast. Reliable.
Choose any three.
SQLite File IO Specification
Table Of Contents
Javascript is required for some features of this document, including table of contents, figure numbering and internal references (section numbers and hyper-links.

Overview

SQLite stores an entire database within a single file, the format of which is described in the SQLite File Database File Format document ff_sqlitert_requirements. Each database file is stored within a file system, presumably provided by the host operating system. Instead of interfacing with the operating system directly, the host application is required to supply an adaptor component that implements the SQLite Virtual File System interface (described in capi_sqlitert_requirements). The adaptor component is responsible for translating the calls made by SQLite to the VFS interface into calls to the file-system interface provided by the operating system. This arrangement is depicted in figure figure_vfs_role.

Figure - Virtual File System (VFS) Adaptor

Although it would be easy to design a system that uses the VFS interface to read and update the content of a database file stored within a file-system, there are several complicated issues that need to be addressed by such a system:

  1. SQLite is required to implement atomic and durable transactions (the 'A' and 'D' from the ACID acronym), even if an application, operating system or power failure occurs midway through or shortly after updating a database file.

    To implement atomic transactions in the face of potential application, operating system or power failures, database writers write a copy of those portions of the database file that they are going to modify into a second file, the journal file, before writing to the database file. If a failure does occur while modifying the database file, SQLite can reconstruct the original database (before the modifications were attempted) based on the contents of the journal file.

  2. SQLite is required to implement isolated transactions (the 'I' from the ACID acronym).

    This is done by using the file locking facililities provided by the VFS adaptor to serialize writers (write transactions) and preventing readers (read transactions) from accessing database files while writers are midway through updating them.

  3. For performance reasons, it is advantageous to minimize the quantity of data read and written to and from the file-system.

    As one might expect, the amount of data read from the database file is minimized by caching portions of the database file in main memory. Additionally, multiple updates to the database file that are part of the same write transaction may be cached in main memory and written to the file periodically, allowing for more efficient IO patterns and eliminating the redundant write operations that could take place if part of the database file is modified more than once within a single write transaction.

System requirement references for the above points.

This document describes in detail the way that SQLite uses the API provided by the VFS adaptor component to solve the problems and implement the strategies enumerated above. It also specifies the assumptions made about the properties of the system that the VFS adaptor provides access to. For example, specific assumptions about the extent of data corruption that may occur if a power failure occurs while a database file is being updated are presented in section fs_characteristics.

This document does not specify the details of the interface that must be implemented by the VFS adaptor component, that is left to capi_sqlitert_requirements.

Document Structure

Section vfs_assumptions of this document describes the various assumptions made about the system to which the VFS adaptor component provides access. The basic capabilities and functions required from the VFS implementation are presented along with the description of the VFS interface in capi_sqlitert_requirements. Section vfs_assumptions compliments this by describing in more detail the assumptions made about VFS implementations on which the algorithms presented in this document depend. Some of these assumptions relate to performance issues, but most concern the expected state of the file-system following a failure that occurs midway through modifying a database file.

Section database_connections introduces the concept of a database connection, a combination of a file-handle and in-memory cache used to access a database file. It also describes the VFS operations required when a new database connection is created (opened), and when one is destroyed (closed).

Section reading_data describes the steps required to open a read transaction and read data from a database file.

Section writing_data describes the steps required to open a write transaction and write data to a database file.

Section rollback describes the way in which aborted write transactions may be rolled back (reverted), either as a result of an explicit user directive or because an application, operating system or power failure occured while SQLite was midway through updating a database file.

Section page_cache_algorithms describes some of the algorithms used to determine exactly which portions of the database file are cached by a page cache, and the effect that they have on the quantity and nature of the required VFS operations.

Glossary

After this document is ready, make the vocabulary consistent and then add a glossary here.

VFS Adaptor Related Assumptions

This section documents those assumptions made about the system that the VFS adaptor provides access to. The assumptions noted in section fs_characteristics are particularly important. If these assumptions are not true, then a power or operating system failure may cause SQLite databases to become corrupted.

Performance Related Assumptions

SQLite uses the assumptions in this section to try to speed up reading from and writing to the database file.

It is assumed that writing a series of sequential blocks of data to a file in order is faster than writing the same blocks in an arbitrary order.

System Failure Related Assumptions

In the event of an operating system or power failure, the various combinations of file-system and storage hardware available provide varying levels of guarantee as to the integrity of the data written to the file system just before or during the failure. The exact combination of IO operations that SQLite is required to perform in order to safely modify a database file depend on the exact characteristics of the target platform.

This section describes the assumptions that SQLite makes about the the content of a file-system following a power or system failure. In other words, it describes the extent of file and file-system corruption that such an event may cause.

SQLite queries an implementation for file-system characteristics using the xDeviceCharacteristics() and xSectorSize() methods of the database file file-handle. These two methods are only ever called on file-handles open on database files. They are not called for journal files, master-journal files or temporary database files.

The file-system sector size value determined by calling the xSectorSize() method is a power of 2 value between 512 and 32768, inclusive reference to exactly how this is determined. SQLite assumes that the underlying storage device stores data in blocks of sector-size bytes each, sectors. It is also assumed that each aligned block of sector-size bytes of each file is stored in a single device sector. If the file is not an exact multiple of sector-size bytes in size, then the final device sector is partially empty.

Normally, SQLite assumes that if a power failure occurs while updating any portion of a sector then the contents of the entire device sector is suspect following recovery. After writing to any part of a sector within a file, it is assumed that the modified sector contents are held in a volatile buffer somewhere within the system (main memory, disk cache etc.). SQLite does not assume that the updated data has reached the persistent storage media, until after it has successfully synced the corresponding file by invoking the VFS xSync() method. Syncing a file causes all modifications to the file up until that point to be committed to persistent storage.

Based on the above, SQLite is designed around a model of the file-system whereby any sector of a file written to is considered to be in a transient state until after the file has been successfully synced. Should a power or system failure occur while a sector is in a transient state, it is impossible to predict its contents following recovery. It may be written correctly, not written at all, overwritten with random data, or any combination thereof.

For example, if the sector-size of a given file-system is 2048 bytes, and SQLite opens a file and writes a 1024 byte block of data to offset 3072 of the file, then according to the model the second sector of the file is in the transient state. If a power failure or operating system crash occurs before or during the next call to xSync() on the file handle, then following system recovery SQLite assumes that all file data between byte offsets 2048 and 4095, inclusive, is invalid. It also assumes that since the first sector of the file, containing the data from byte offset 0 to 2047 inclusive, is valid, since it was not in a transient state when the crash occured.

Assuming that any and all sectors in the transient state may be corrupted following a power or system failure is a very pessimistic approach. Some modern systems provide more sophisticated guarantees than this. SQLite allows the VFS implementation to specify at runtime that the current platform supports zero or more of the following properties:

Failure Related Assumption Details

This section describes how the assumptions presented in the parent section apply to the individual API functions and operations provided by the VFS to SQLite for the purposes of modifying the contents of the file-system.

SQLite manipulates the contents of the file-system using a combination of the following four types of operation:

Additionally, all VFS implementations are required to provide the sync file operation, accessed via the xSync() method of the sqlite3_file object, used to flush create, write and truncate operations on a file to the persistent storage medium.

The formalized assumptions in this section refer to system failure events. In this context, this should be interpreted as any failure that causes the system to stop operating. For example a power failure or operating system crash.

SQLite does not assume that a create file operation has actually modified the file-system records within perisistent storage until after the file has been successfully synced.

If a system failure occurs during or after a "create file" operation, but before the created file has been synced, then SQLite assumes that it is possible that the created file may not exist following system recovery.

Of course, it is also possible that it does exist following system recovery.

If a "create file" operation is executed by SQLite, and then the created file synced, then SQLite assumes that the file-system modifications corresponding to the "create file" operation have been committed to persistent media. It is assumed that if a system failure occurs any time after the file has been successfully synced, then the file is guaranteed to appear in the file-system following system recovery.

A delete file operation (invoked by a call to the VFS xDelete() method) is assumed to be an atomic and durable operation.

If a system failure occurs at any time after a "delete file" operation (call to the VFS xDelete() method) returns successfully, it is assumed that the file-system will not contain the deleted file following system recovery.

If a system failure occurs during a "delete file" operation, it is assumed that following system recovery the file-system will either contain the file being deleted in the state it was in before the operation was attempted, or not contain the file at all. It is assumed that it is not possible for the file to have become corrupted purely as a result of a failure occuring during a "delete file" operation.

The effects of a truncate file operation are not assumed to be made persistent until after the corresponding file has been synced.

If a system failure occurs during or after a "truncate file" operation, but before the truncated file has been synced, then SQLite assumes that the size of the truncated file is either as large or larger than the size that it was to be truncated to.

If a system failure occurs during or after a "truncate file" operation, but before the truncated file has been synced, then it is assumed that the contents of the file up to the size that the file was to be truncated to are not corrupted.

The above two assumptions may be interpreted to mean that if a system failure occurs after file truncation but before the truncated file is synced, the contents of the file following the point at which it was to be truncated may not be trusted. They may contain the original file data, or may contain garbage.

If a "truncate file" operation is executed by SQLite, and then the truncated file synced, then SQLite assumes that the file-system modifications corresponding to the "truncate file" operation have been committed to persistent media. It is assumed that if a system failure occurs any time after the file has been successfully synced, then the effects of the file truncation are guaranteed to appear in the file system following recovery.

A write file operation modifies the contents of an existing file within the file-system. It may also increase the size of the file. The effects of a write file operation are not assumed to be made persistent until after the corresponding file has been synced.

If a system failure occurs during or after a "write file" operation, but before the corresponding file has been synced, then it is assumed that the content of all sectors spanned by the write file operation are untrustworthy following system recovery. This includes regions of the sectors that were not actually modified by the write file operation.

If a system failure occurs on a system that supports the atomic-write property for blocks of size N bytes following an aligned write of N bytes to a file but before the file has been succesfully synced, then is is assumed following recovery that all sectors spanned by the write operation were correctly updated, or that none of the sectors were modified at all.

If a system failure occurs on a system that supports the safe-append following a write operation that appends data to the end of the file without modifying any of the existing file content but before the file has been succesfully synced, then is is assumed following recovery that either the data was correctly appended to the file, or that the file size remains unchanged. It is assumed that it is impossible that the file be extended but populated with incorrect data.

Following a system recovery, if a device sector is deemed to be untrustworthy as defined by A21008 and neither A21011 or A21012 apply to the range of bytes written, then no assumption can be made about the content of the sector following recovery. It is assumed that it is possible for such a sector to be written correctly, not written at all, populated with garbage data or any combination thereof.

Fix the requirement below. The idea is to say that extending a file cannot cause the file size to become corrupted and thereby cause the whole file to be lost.

If a system failure occurs during or after a "write file" operation that causes the file to grow, but before the corresponding file has been synced, then it is assumed that the size of the file following recovery is as large or larger than it was before the "write file" operation that, if successful, would cause the file to grow.

If a system supports the sequential-write property, then further assumptions may be made with respect to the state of the file-system following recovery from a system failure. Specifically, it is assumed that create, truncate, delete and write file operations are applied to the persistent representation in the same order as they are performed by SQLite. Furthermore, it is assumed that the file-system waits until one operation is safely written to the persistent media before the next is attempted, just as if the relevant file were synced following each operation.

If a system failure occurs on a system that supports the sequential-write property, then it is assumed that all operations completed before the last time any file was synced have been successfully committed to persistent media.

If a system failure occurs on a system that supports the sequential-write property, then it is assumed that the set of possible states that the file-system may be in following recovery is the same as if each of the write operations performed since the most recent time a file was synced was itself followed by a sync file operation, and that the system failure may have occured during any of the write or sync file operations.

Database Connections

Within this document, the term database connection has a slightly different meaning from that which one might assume. The handles returned by the sqlite3_open() and sqlite3_open16() APIs (reference) are referred to as database handles. A database connection is a connection to a single database file using a single file-handle, which is held open for the lifetime of the connection. Using the "ATTACH" syntax, multiple database connections may be accessed via a single database handle. Or, using SQLite's shared-cache mode feature, multiple database handles may access a single database connection.

Figure - Relationship between Database Connections and Database Handles.

As well as a file-handle open on the database file, each database connection has a page cache associated with it. The page cache is used to cache data read from the database file to reduce the amount of data that must be read from the file-handle. It is also used to accumulate data written to the database file so that write operations can be batched for greater efficiency. Figure figure_db_connection illustrates a system containing two database connections, each to a separate database file. The leftmost of the two depicted database connections is shared between two database handles. The connection illustrated towards the right of the diagram is used by a single database handle.

It may at first seem odd to mention the page cache, primarily an implementation detail, in this document. However, it is necessary to acknowledge and describe the page cache in order to provide a more complete explanation of the nature and quantity of IO performed by SQLite. Further description of the page cache is provided in section page_cache_descripton.

The Page Cache

The contents of an SQLite database file is formatted as a set of fixed size pages (see ff_sqlitert_requirements) for a complete description of the format used. The page size used for a particular database is stored as part of the database file header at a well-known offset within the first 100 bytes of the file.

As one might imagine, the page cache caches data read from the database file on a page basis. Whenever data is read from the database file to satisfy user queries, it is loaded in units of a page at a time (see section reading_data for further details). After being read, page content is stored by the page cache in main memory. The next time the page data is required, it may be read from the page cache instead of from the database file.

Data is also cached within the page cache before it is written to the database file. Usually, when a user issues a command that modifies the content of the database file, only the cached version of the page within the connection's page cache is modified. When the containing write transaction is committed, the content of all modified pages within the page cache are copied into the database file.

Some kind of reference to the 'page cache algorithms' section.

Opening a New Connection

This section describes the VFS operations that take place when a new database connection is created.

Opening a new database connection is a two-step process:

  1. A file-handle is opened on the database file.
  2. If step 1 was successful, an attempt is made to read the database file header from the database file using the new file-handle.

In step 2 of the procedure above, the database file is not locked before it is read from. This is the only exception to the locking rules described in section reading_data.

The reason for attempting to read the database file header is to determine the page-size used by the database file. Because it is not possible to be certain as to the page-size without holding at least a shared lock on the database file (because some other database connection might have changed it since the database file header was read), the value read from the database file header is known as the expected page size.

When a new database connection is required, SQLite shall attempt to open a file-handle on the database file. If the attempt fails, then no new database connection is created and an error returned.

When a new database connection is required, after opening the new file-handle, SQLite shall attempt to read the first 100 bytes of the database file. If the attempt fails for any other reason than that the opened file is less than 100 bytes in size, then the file-handle is closed, no new database connection is created and an error returned instead.

If the database file header is successfully read from a newly opened database file, the connections expected page-size shall be set to the value stored in the page-size field of the database header.

If the database file header cannot be read from a newly opened database file (because the file is less than 100 bytes in size), the connections expected page-size shall be set to the compile time value of the SQLITE_DEFAULT_PAGESIZE option.

Closing a Connection

This section describes the VFS operations that take place when an existing database connection is closed (destroyed).

Closing a database connection is a simple matter. The open VFS file-handle is closed and in-memory page cache related resources are released.

When a database connection is closed, SQLite shall close the associated file handle at the VFS level.

Reading Data

In order to return data from the database to the user, for example as the results of a SELECT query, SQLite must at some point read data from the database file. Usually, data is read from the database file in aligned blocks of page-size bytes. The exception is when the database file header fields are being inspected, before the page-size used by the database can be known.

With two exceptions, a database connection must have an open transaction (either a read-only transaction or a read/write transaction) on the database file before data may be read from the database connection. In this case, data "read from the database connection" includes data that is read from the database file and data that is already present in the page cache. Without an open transaction on the database file, the contents of the page cache may not be trusted.

The two exceptions are:

Once a transaction has been opened, reading data from a database connection is a simple operation. Using the xRead() method of the file-handle open on the database file, the required database file pages are read one at a time. SQLite never reads partial pages and always uses a single call to xRead() for each required page. After reading the data for a database page, SQLite adds it to the connections page cache so that it does not have to be read if required again. Refer to section page_cache_algorithms for a description of how this affects the IO performed by SQLite.

Except for the read operation required by H21007 and those reads made as part of opening a read-only transaction, SQLite shall only read data from a database connection while the database connection has an open read-only or read/write transaction.

In the above requirement, reading data from a database connection includes retrieving data from the connections page cache.

Aside from those read operations described by H21007 and H21XXX, SQLite shall read data from the database in aligned blocks of page-size bytes, where page-size is the database page size used by the database file.

Opening a Read-Only Transaction

Before data may be read from a database connection, a read-only transaction must be successfully opened (this is true even if the connection will eventually write to the database, as a read/write transaction may only be opened by upgrading from a read-only transaction). This section describes the procedure for opening a read-only transaction.

The key element of a read-only transaction is that the file-handle open on the database file obtains and holds a shared-lock on the database file. Because a connection requires an exclusive-lock before it may actually modify the contents of the database file, and by definition while one connection is holding a shared-lock no other connection may hold an exclusive-lock, holding a shared-lock guarantees that no other process may modify the database file while the read-only transaction remains open.

Obtaining the shared lock itself on the database file is quite simple, SQLite just calls the xLock() method of the database file handle. Some of the other processes that take place as part of opening the read-only transaction are quite complex. The list of steps SQLite is required to take to open a read-only transaction, in the order in which the must occur, is as follows:

  1. A shared-lock is obtained on the database file.
  2. The connection checks if a hot journal file exists in the file-system. If one does, then it is rolled back before continuing.
  3. The connection checks if the data in the page cache may still be trusted. If not, all page cache data is discarded.
  4. If the file-size is not zero bytes and the page cache does not contain valid data for the first page of the database, then the data for the first page must be read from the database.

Of course, an error may occur while attempting any of the 4 steps enumerated above. If this happens, then the shared-lock is released (if it was obtained) and an error returned to the user. Step 2 of the procedure above is described in more detail in section hot_journal_detection. Section cache_validation describes the process identified by step 3 above. Further detail on step 4 may be found in section read_page_one.

When required to open a read-only transaction using a database connection, SQLite shall first attempt to obtain a shared-lock on the file-handle open on the database file.

If, while opening a read-only transaction, SQLite fails to obtain the shared-lock on the database file, then the process is abandoned, no transaction is opened and an error returned to the user.

The most common reason an attempt to obtain a shared-lock may fail is that some other connection is holding an exclusive or pending lock. However it may also fail because some other error (e.g. IO, comms related) occurs within the call to the xLock() method.

While opening a read-only transaction, after successfully obtaining a shared lock on the database file, SQLite shall attempt to detect and roll back a hot journal file associated with the same database file.

If, while opening a read-only transaction, SQLite encounters an error while attempting to detect or roll back a hot journal file, then the shared-lock on the database file is released, no transaction is opened and an error returned to the user.

Section hot_journal_detection contains a description of and requirements governing the detection of a hot-journal file refered to in the above requirements.

Assuming no errors have occured, then after attempting to detect and roll back a hot journal file, if the connections page cache is not empty, then SQLite shall validate the contents of the page cache by testing the file change counter. This procedure is known as cache validiation.

If the contents of the page cache are found to be invalid by the check prescribed by F20040, SQLite shall discard the cache contents before continuing.

Hot Journal Detection

This section describes the procedure that SQLite uses to detect a hot journal file. If a hot journal file is detected, this indicates that at some point the process of writing a transaction to the database was interrupted and a recovery operation (hot journal rollback) needs to take place.

The procedure used to detect a hot-journal file is quite complex. The following steps take place:

  1. Using the VFS xAccess() method, SQLite queries the file-system to see if the journal file associated with the database exists. If it does not, then there is no hot-journal file.
  2. By invoking the xCheckReservedLock() method of the file-handle opened on the database file, SQLite checks if some other connection holds a reserved lock or greater. If some other connection does hold a reserved lock, this indicates that the other connection is midway through a read/write transaction (see section writing_data). In this case the journal file is not a hot-journal and must not be rolled back.
  3. Using the xFileSize() method of the file-handle opened on the database file, SQLite checks if the database file is 0 bytes in size. If it is, the journal file is not considered to be a hot journal file. Instead of rolling back the journal file, in this case it is deleted from the file-system by calling the VFS xDelete() method. Technically, there is a race condition here. This step should be moved to after the exclusive lock is held.
  4. An attempt is made to upgrade to an exclusive lock on the database file. If the attempt fails, then all locks, including the recently obtained shared lock are dropped. The attempt to open a read-only transaction has failed. This occurs when some other connection is also attempting to open a read-only transaction and the attempt to gain the exclusive lock fails because the other connection is also holding a shared lock. It is left to the other connection to roll back the hot journal.
    It is important that the file-handle lock is upgraded directly from shared to exclusive in this case, instead of first upgrading to reserved or pending locks as is required when obtaining an exclusive lock to write to the database file (section writing_data). If SQLite were to first upgrade to a reserved or pending lock in this scenario, then a second process also trying to open a read-transaction on the database file might detect the reserved lock in step 2 of this process, conclude that there was no hot journal, and commence reading data from the database file.
  5. The xAccess() method is invoked again to detect if the journal file is still in the file system. If it is, then it is a hot-journal file and SQLite tries to roll it back (see section rollback).

The following requirements describe step 1 of the above procedure in more detail.

When required to attempt to detect a hot-journal file, SQLite shall first use the xAccess() method of the VFS layer to check if a journal file exists in the file-system.

When required to attempt to detect a hot-journal file, if the call to xAccess() required by H21014 indicates that a journal file does not exist, then the attempt to detect a hot-journal file is finished. A hot-journal file was not detected.

The following requirements describe step 2 of the above procedure in more detail.

When required to attempt to detect a hot-journal file, if the call to xAccess() required by H21014 indicates that a journal file is present, then the xCheckReservedLock() method of the database file file-handle is invoked to determine whether or not some other process is holding a reserved or greater lock on the database file.

If the call to xCheckReservedLock() required by H21016 indicates that some other database connection is holding a reserved or greater lock on the database file,

Finish this section.

Cache Validation

When a database connection opens a read transaction, the associated page cache may already contain data. However, if another process has modified the database file since the cached pages were loaded it is possible that the cached data is invalid.

SQLite determines whether or not the contents of a page cache are valid or not using the file change counter, a field in the database file header. The file change counter is a 4-byte big-endian integer field stored starting at byte offset 24 of the database file header. Before the conclusion of a read/write transaction that modifies the contents of the database file in any way (see section writing_data), the value stored in the file change counter is incremented. When a database connection unlocks the database file, it stores the current value of the file change counter. Later, while opening a new read-only transaction, SQLite checks the value of the file change counter stored in the database file. If the value has not changed since the database file was unlocked, then the contents of the page cache can be trusted. If the value has changed, then the page cache cannot be trusted and all data is discarded.

When a file-handle open on a database file is unlocked, if the page cache belonging to the associated database connection is not empty, SQLite shall store the value of the file change counter internally.

When required to perform cache validation as part of opening a read transaction, SQLite shall read a 16 byte block starting at byte offset 24 of the database file using the xRead() method of the database connections file handle.

Why a 16 byte block? Why not 4? (something to do with encrypted databases).

While performing cache validation, after loading the 16 byte block as required by H21019, SQLite shall compare the 32-bit big-endian integer stored in the first 4 bytes of the block to the most recently stored value of the file change counter (see H21018). If the values are not the same, then SQLite shall conclude that the contents of the cache are invalid.

Requirement H21005 (section open_read_only_trans) specifies the action SQLite is required to take upon determining that the cache contents are invalid.

Page 1 and the Expected Page Size

As the last step in opening a read transaction on a database file that is more than 0 bytes in size, SQLite is required to load data for page 1 of the database into the page cache, if it is not already there. This is slightly more complicated than it seems, as the database page-size is no known at this point.

Even though the database page-size cannot be known for sure, SQLite is usually able to guess correctly by assuming it to be equal to the connections expected page size. The expected page size is the value of the page-size field read from the database file header while opening the database connection (see section open_new_connection), or the page-size stored of the database file when the most read transaction was concluded.

During the conclusing of a read transaction, before unlocking the database file, SQLite shall set the connections expected page size to the current database page-size.

As part of opening a new read transaction, immediately after performing cache validation, if there is no data for database page 1 in the page cache, SQLite shall read N bytes from the start of the database file using the xRead() method of the connections file handle, where N is the connections current expected page size value.

If page 1 data is read as required by H21023, then the value of the page-size field that appears in the database file header that consumes the first 100 bytes of the read block is not the same as the connections current expected page size, then the expected page size is set to this value, the database file is unlocked and the entire procedure to open a read transaction is repeated.

If page 1 data is read as required by H21023, then the value of the page-size field that appears in the database file header that consumes the first 100 bytes of the read block is the same as the connections current expected page size, then the block of data read is added to the connections page cache as page 1.

Ending a Read-only Transaction

To end a read-only transaction, SQLite simply relinquishes the shared lock on the file-handle open on the database file. No other action is required.

When required to end a read-only transaction, SQLite shall relinquish the shared lock held on the database file by calling the xUnlock() method of the file-handle.

See also requirements H21018 and H21021 above.

Writing Data

Safely writing data to a database file is also a complex procedure. The database file must be updated in such a way that if a power failure, operating system crash or application fault occurs while SQLite is midway through writing to the database file the database contents are still accessible and correct after system recovery.

Logically, an SQLite database file is modified using write transactions. Each write transaction may contain any number of modifications to the database files content or size. From the point of view of an external observer (a second database connection) an entire write transaction is applied to the database file atomically. If a failure of some sort occurs while SQLite is midway through applying a write transaction to a database file, then it must appear from the point of view of the next database connection that reads data from the database file that the aborted transaction was not applied.

SQLite accomplishes these goals using two techniques:

The page cache belonging to the database connection is used to buffer writes before they are written to the database file. Often, all changes for an entire write transaction are accumulated within the page cache. In this case no write operations are performed on the database file until the user commits the transaction.

Even if an application or system failure does not occur while a write transaction is in progress, a rollback operation to restore the database file to the state that it was in before the transaction started may be required. This may occur if the user explicitly requests transaction rollback (i.e. by issuing a "ROLLBACK" command), or automatically, as a result of encountering an SQL constraint (see sql_sqlitert_requirements). For this reason, the original page content is stored in the journal file before the page is even modified within the page cache.

Before modifying or adding any in-memory page cache pages in preparation for writing to the database file, the database connection shall open a write transaction on the database file.

Before modifying the page cache image of a database page that existed and was not a free-list leaf page when the current write transaction began, SQLite shall ensure that the original page content has been written to the journal file (journalled).

If the sector size is larger than the page-size, coresident pages must also be journalled.

The process of journalling a database page is described in detail in section journalling_a_page.

Eventually, the content of pages modified by a transaction must be copied from the page cache and into the actual database file. This may occur for either of the following two reasons:

  1. Because the write transaction is being committed (section committing_a_transaction), or
  2. To free up memory if the number of modified pages grows too large (see section page_cache_algorithms).

In both cases, the region of the journal file containing the original data for the pages being modified within the database file must be flushed through to the persistent media before the database file may be written to. This is to ensure that the original data is recoverable in the event of a system failure. This process is known as syncing the journal file and is described in section syncing_journal_file.

A write transaction may be terminated in one of two ways. It may be committed, meaning that the changes involved in the transaction are written to the database file, or rolled back, meaning no changes are applied. Committing a transaction is described in section committing_a_transaction. Transaction rollback is described in section rollback.

Figure figure_write_transaction depicts an overview of an entire write transaction. This is intended to be illustrative only, many operations are omitted.

Figure - Progression of a Write Transaction

Journal File Format

Journal Header Format

A journal header is sector-size bytes in size, where sector-size is the value returned by the xSectorSize method of the file handle opened on the database file.

Figure - Journal Header Format

Journal Record Format

Figure - Journal Record Format

Master Journal Pointer

Write Transactions

Beginning a Write Transaction

Before any database pages may be modified within the page cache, the database connection must open a write transaction. Opening a write transaction requires that the database connection obtains a reserved lock (or greater) on the database file. Because a obtaining a reserved lock on a database file guarantees that no other database connection may hold or obtain a reserved lock or greater, it follows that no other database connection may have an open write transaction.

A reserved lock on the database file may be thought of as an exclusive lock on the journal file. No database connection may read from or write to a journal file without a reserved or greater lock on the corresponding database file.

Before opening a write transaction, a database connection must have an open read transaction, opened via the procedure described in section open_read_only_trans. This ensures that there is no hot-journal file that needs to be rolled back and that the content of the page cache, if any, can be trusted.

Once a read transaction has been opened, upgrading to a write transaction is a two step process, as follows:

  1. A reserved lock is obtained on the database file.
  2. The journal file is opened and created if necessary (using the VFS xOpen method), and a journal file header written to the start of it using a single call to the file handles xWrite method.

Requirements describing step 1 of the above procedure in detail:

When required to open a write transaction on the database, SQLite shall first open a read transaction, if the database connection in question has not already opened one.

When required to open a write transaction on the database, after ensuring a read transaction has already been opened, SQLite shall obtain a reserved lock on the database file by calling the xLock method of the file-handle open on the database file.

Requirements describing step 2 of the above procedure in detail:

When required to open a write transaction on the database, after obtaining a reserved lock on the database file, SQLite shall open a read/write file-handle on the corresponding journal file.

When required to open a write transaction on the database, after opening a file-handle on the journal file, SQLite shall write a journal header into the first sector-size bytes of the journal file, using single call to the xWrite method of the recently opened file-handle.

Requirements describing the journal header written to the journal file:

The first 8 bytes of the journal header required to be written by H21038 shall be:

Reqirements describing the details of opening a write transaction.

Reqirement for error handling?

Journalling a Database Page

Before modifying a database page within the page cache, the page must be journalled. Journalling a page is the process of copying that pages original data into the journal file so that it can be recovered if the write transaction is rolled back.

A page is journalled by adding a journal record to the journal file. The format of a journal record is described in section journal_record_format.

When required to journal a database page, SQLite shall first append the page number of the page being journalled to the journal file, formatted as a 4-byte big-endian unsigned integer, using a single call to the xWrite method of the file-handle opened on the journal file.

When required to journal a database page, if the attempt to append the page number to the journal file is successful, then the current page data (page-size bytes) shall be appended to the journal file, using a single call to the xWrite method of the file-handle opened on the journal file.

When required to journal a database page, if the attempt to append the current page data to the journal file is successful, then SQLite shall append a 4-byte big-endian integer checksum value to the to the journal file, using a single call to the xWrite method of the file-handle opened on the journal file.

The checksum value written to the journal file immediately after the page data (requirement H21029), is a function of both the page data and the checksum initializer field stored in the journal header (see section journal_header_format). Specifically, it is the sum of the checksum initializer and the value of every 200th byte of page data interpreted as an 8-bit unsigned integer, starting with the (page-size % 200)'th byte of page data. For example, if the page-size is 1024 bytes, then a checksum is calculated by adding the values of the bytes at offsets 23, 223, 423, 623, 823 and 1023 (the last byte of the page) together with the value of the checksum initializer.

The checksum value written to the journal file by the write required by H21029 shall be equal to the sum of the checksum initializer field stored in the journal header (H21XXX) and every 200th byte of the page data, beginning with the (page-size % 200)th byte.

The '%' character is used in the two paragraphs to represent the modulo operator, just as it is in programming languages such as C, Java and Javascript.

Syncing the Journal File

Even after the original data of a database page has been written into the journal file using calls to the journal file file-handle xWrite method (section journalling_a_page), it is still not safe to write to the page within the database file. This is because in the event of a system failure the data written to the journal file may still be corrupted (see section fs_characteristics). Before the page can be updated within the database itself, the following procedure takes place:

  1. The xSync method of the file-handle opened on the journal file is called. This operation ensures that all journal records in the journal file have been written to persistent storage, and that they will not become corrupted as a result of a subsequent system failure.
  2. The journal record count field (see section journal_header_format) of the most recently written journal header in the journal file is updated to contain the number of journal records added to the journal file since the header was written.
  3. The xSync method is called again, to ensure that the update to the journal record count has been committed to persistent storage.

If all three of the steps enumerated above are executed succesfully, then it is safe to modify the content of the journalled database pages within the database file itself. The combination of the three steps above is refered to as syncing the journal file.

Upgrading to an Exclusive Lock

Before the content of a page modified within the page cache may be written to the database file, an exclusive lock must be held on the database file. The purpose of this lock is to prevent another connection from reading from the database file while the first connection is midway through writing to it. Whether the reason for writing to the database file is because a transaction is being committed, or to free up space within the page cache, upgrading to an exclusive lock always occurs immediately after syncing the journal file.

Committing a Transaction

Committing a write transaction is the final step in updating the database file.

  1. Update the change counter.
  2. Sync the journal file.
  3. Obtain exclusive lock
  4. Write the database file.
  5. Sync the database file.
  6. Delete the journal file.

Writing out the Page Cache

When a modification is made to the database, the change is first applied in-memory, to pages stored in the page cache. The process of copying the modified pages from the page cache to the actual database file in the file system is known as writing out the page cache. There are two circumstances in which this may occur:

  1. Because the write transaction is being committed (section committing_a_transaction), or
  2. To free up memory if the number of modified pages grows too large (see section page_cache_algorithms).

Before any data can be written into the database file, it must be locked with an exclusive lock if it is not already. This is to prevent any other database connection from reading the database after a subset of the modifications that have been or will be made by a write transaction have been written into the database file.

Journal header operations?

Unless a pending or exclusive lock has already been obtained, when SQLite is required to write out a page cache, it shall first upgrade the lock on the database file to a pending lock using a call to the xLock method of the file-handle open on the database file.

Unless one has already been obtained, when SQLite is required to write out a page cache, after successfully obtaining a pending lock it shall upgrade the lock on the database file to an exclusive lock using a call to the xLock method of the file-handle open on the database file.

If obtaining the lock fails?

When SQLite is required to write out a page cache, if the required exclusive lock is already held or successfully obtained, SQLite shall copy the contents of all pages that have been modified within the page cache to the database file, using a single write of page-size bytes for each.

When the modified contents of a page cache is copied into the database file, as required by H21033, the write operations shall occur in page number order, from lowest to highest.

The above requirement to write data to the database file in the order in which it occurs in the file is added to improve performance. On many systems, sorting the regions of the file to be written before writing to them allows the storage hardware to operate more efficiently.

Statement Transactions

Multi-File Transactions

Rollback

Hot Journal Rollback

Transaction Rollback

Statement Rollback

Page Cache Algorithms

References

[1] C API Requirements Document.
[2] SQL Requirements Document.
[3] File Format Requirements Document.

This page last modified 2008/08/07 11:33:33 UTC