NON-BLOCKING

SCHEMA CHANGE

For CMU's yet-to-be-named database

noun_database_411270.png

Tools

C++

SQL

Course and Teammates

Description

For this project, we added schema change functionality to CMU's yet-to-be-named DBMS, a research DBMS developed by CMU's database group. We implemented schema change in a non-blocking, lazy manner, and supported an arbitrary number of schema versions on each table. We also added support for the ALTER TABLE SQL command to the DBMS. Our benchmarks show that concurrent schema changes only slow down normal SQL operations on a table by a negligible amount. 

15721 Advanced Database

Chen Xu, Ling Zhang

Duration

Apr 2020

4 weeks

Links

Github

Project showcase page (see the Schema Changes project)

B+ Tree

Before our self-selected final project regarding schema change, in the previous project we implemented from scratch and optimized a B+ Tree. It was slightly more complicated than a standard B+ tree because we had to support duplicate keys, support forward and reverse iterators that work correctly with concurrent mutators, and also achieve performance better than Terrier's Bw tree implementation.

The main bottleneck is performance is lock contention, especially at the root of the tree. We used the crabbing technique to resolve lock contention: a thread can release latch on a parent node if its child node considered safe. A node is safe if it is not full on insertion or more than half full on deletion, meaning it won't split or merge when updated. Since nodes are safe most of the time, we used optimistic crabbing to further reduce contention at top levels of the tree: we always optimistically take read locks to reach the leaf first, and only if the leaf is not safe do we start over and take write locks.

We also had to pay attention to other locking issues caused by iterators. To avoid deadlock between a thread updating the tree and a thread scanning the leaves with an iterator, if a thread could not acquire a lock, we let it wait for a short time using exponential backoff, then abort and retry. One issue was, a B+ tree maintains left and right sibling pointers for its leaves. When updating a key in a leaf node, it is possible that we need to delete the node afterwards, and change the sibling pointers of its siblings. If we first update the leaf and merge necessary parent nodes, then attempt to acquire the lock for the sibling nodes and fail, it would be impossible to abort the operation and roll back. Therefore, when updating an unsafe leaf, we always attempt to acquire the locks of both the leaf and its siblings, and abort and retry if any of the locks are unavailable.

Background

CMU's yet-to-be-named-database (currently named Terrier) is a DBMS written in C++ under development of CMU's database group led by Professor Andy Pavlo. Our final project for 15721 Advanced Database is to pick a functionality to add to Terrier. Our group chose concurrent schema change.

Each table in a DBMS has a schema, which specifies the columns in the table, and the constraints on each column (type of the column, default value of the column, whether it is nullable). Sometimes we need to change the schema of a table, such as adding/dropping a column, or updating column constraints. Many DBMSs (e.g. PostgreSQL, SQLite, and RocksDB) block the table when they perform a schema change. This causes the table to be unavailable for a considerable duration, which is undesirable for systems requiring high availability. 

A schema change is invoked by an ALTER TABLE call. To support concurrent schema changes, we must handle multiple schema versions of a table coexisting in the database. This is because during a schema change, operations (such as SCAN) preceding ALTER TABLE should work on the older version of the table, while operations succeeding ALTER TABLE  should work on the newer version.  In fact, to handle consecutive schema changes, we must support arbitrary number of schema versions for each table.

Our design is to maintain separate underlying physical tables (called datatables), one for each schema version. On the layer abov datatables, Terrier uses a Sqltable layer to handle requests for each table in the DBMS. So we modified Sqltable so that it could manage multiple datatables. Our key changes are to the Sqltable layer, though we had to also modify the whole pipeline (parser, binder, optimizer, planner, and execution layers) to add support for ALTER TABLE to the DBMS.

SqlTable layer

Each Sqltable contains a list of datatables, each corresponding to a schema version. The datatables also store its corresponding version number. When Sqltable receives a request such as to update a key, a transaction first uses the B+ tree index on Sqltable to find the storage location (called tupleslot) of the key. We identify the datatable storing the tupleslot from the top 44 bits of the tupleslot, and read the schema version from the datatable. We call this version of the datatable storing the Tupleslot the storage_version.

 

Each request to the Sqltable also comes with an argument, intended_version, which is the schema version of the Sqltable that should be seen by the current transaction. If storage_version and intended_version match, we just perform the operation on the corresponding datatable.

The tricky case is when storage_version < intended_version. This occurs if a transaction accesses a tuple of a newer schema version, but the tuple has not yet been migrated over from datatable of old schema version. Different Sqltable operations deal with this discrepancy differently, as we describe below:

Sqltable::insert: There is no storage_version in this case. We directly insert to the table corresponding to intended_version.

Sqltable::delete: Always logically delete the tuple from the datatable it is currently stored in.

Sqltable::update: Perform tuple migration by logically deleting the tuple from the datatable corresponding to storage_version, then do tuple transformation to new schema and insert to the datatable corresponding to intended_version. We also return the new tupleslot to the execution layer which will update the B+ index for the Sqltable. Tuple transformation happens by consulting the maps from logical to physical column ids of the two datatables.

Sqltable::select: We choose to not do migration on reads, so in this case we will just read the tuple from the table corresponding to storage_version.

Sqltable::scan: Note that each tuple might physically exist in multiple datatables under a single Sqltable. However, for each transaction, only one copy of each tuple is visible across all datatables (we guarantee this by always logically deleting a tuple before migration). Therefore, scan works by scanning all datatables fully in order. We start scanning from the table with the largest version and go backwards, so that any tuples that are migrated while scanning will only be scanned once.

Catalog layer

We implemented a DDL executor for the ALTER TABLE command that updates the catalog when a table is altered. Schema information for each table (pointer to the schema) is stored in a column in the catalog tables. So after a new schema is generated by evaluating the ALTER TABLE commands, we update the schema pointer column for the table being altered. We also add a version column along side the schema pointer column for each table. Since the catalog tables are transactional, they disallow write-to-write conflict from MVCC. As a result, concurrent conflicting schema updates are caught here as one of the updating transactional will abort when trying to modify the catalog table.

Each transaction that accesses the SqlTable needs to know which layout version it should be looking at. We leverage on the transactional tables in the catalog to translate the correct visible layout version for an in-flight transaction. Transaction will read the catalog tables to retrieve the most recent visible layout version, and use it to access the SqlTable.

db structure.png

ALTER TABLE command

ALTER TABLE commands are parsed, binded and optimized just like other commands to generate a plan node tree. We had to read and understand the whole pipeline of Terrier, then added ALTER TABLE functionality to the whole pipleline including the parser, binder, optimizer, planner, and execution layers. As for now, we are only supporting schema updates that do not require scanning of the underlying SqlTable (such as adding columns, dropping columns etc). Such commands only require updating the catalog tables, and the SqlTable's meta data. So a DDL Executor is implemented to execute the generated plan.

Testing and Benchmarks

We wrote single-threaded and multi-threaded unit tests at all layers, and libraries to aid testing. We also wrote a update_schema_benchmark that performs consecutive update_schemas along with concurrent normal Sqltable operations, both to test the correctness of our changes to the sqltable layer, and to test the throughput with schema changes (under different workloads, read-write-delete, insert-heavy, etc.). Our benchmarking results show that adding schema versions to the Sqltable layer has a negligible effect on runtime of Sqltable queries, and lazy schema changes done in our way has little impact on the speed of concurrent normal transactions.

Design Decisions

1. How to track list of datatables within Sqltable?

Within Sqltable, we can either use an ordered concurrent map that maps schema version number to datatable, or use a vector of datatables (index i of vector corresponds to datatable with version number i, and the versions always start from 0).

We decided to use a fixed-size vector, where the datatable with version i is stored in the i-th position of the vector. We use a MAX_NUM_VERSIONS as the size of the vector, which is the max allowed number of versions for a single datatable. We use a fixed-size vector because of low synchronization overhead: there is no need for locking the vector since different versions access different indices. Also, we need not make a special case for tables with a single version, because accessing an index in the vector is as fast as reading a pointer. 

One potential issue with fixed-size vector is dealing with datatables with more than MAX_NUM_VERSIONS. This should be extremely rare if we set MAX_NUM_VERSIONS, since schema changes generally don't happen too often on a datatable. If we choose to support garbage collection of stale versions, we can implement recycling segments of the vector with stale versions, and allow using the vector in a cyclic fashion,

2. Should we do migration on reads?

Currently, we only migrate tuples on Sqltable::update, if the tuple belongs to a newer datatable. Another policy is to also do migration on reads: after reading a tuple that belongs to a newer datatable (in Sqltable::select or Sqltable::scan), we migrate it to the new datatable before reading it. The benefit of migration on reads is: if we read a tuple from an outdated table multiple times, transforming the tuple to its newest schema every time can be costly. With migration on reads, we need to do tuple transformation only once. However, migration on reads transforms each read operation potentially into a write operation, which would need to take write locks, causing lock contention. We decided to do lazy migration (don't do migration on reads), which works well with the garbage collection (GC) policy we chose below.

3. When to perform background migration and GC of old versions?

When we are sure that no transactions will ever access the schema version represented by an old datatable(inferring from low_watermark of transaction timestamps), we can safely GC the outdated datatable. However, since we do lazy migration, if most tuples in a datatable is not updated after a schema change, they will still be in the outdated datatable, and we can read and access them from the outdated datatable. In this case, it might be quite expensive to GC the outdated datatable, since we need to first migrate all its tuples to newer datatables. We reason that doing background migrations conservatively is good for read-heavy workloads, while doing background migrations aggressively is good for update-heavy workloads interleaved with schema changes.

Thus we decided to use a threshold to decide whether to do background migration: once the amount of tuples in an outdated datatable drops below the threshold, we start a background thread to migrate its tuples and GC the datatable when we're done.