Concurrency in RustDb

Introduction

This blog post discusses the way RustDb ( a high-performance database written entirely in Rust ) handles multiple concurrent queries.

Before talking about concurrency, it may help to state plainly what a database is. A database is essentially a substantial store of information, facts, or knowledge. Something like a large library. It could be holding books, weather data, astronomical data, historical data, business data.

The main function of a database is to answer questions, or queries from clients. A database can be large, and queries may need to access a substantial portion of the data. On the other hand, new information may be acquired and need to be added to the database.

Concurrency

The question arises, what should happen if multiple clients want to query and update the database. What should happen if a long query is in progress, but new information needs to be added to the database?

Possible solution – delay the update

The update is delayed until the long query has finished.

This may not be satisfactory. As a database can be large, it could be minutes before an update could be processed, and if new queries are continually arriving, an update could be deferred indefinitely.

Possible solution – locking

As the long query progresses, objects in the database ( tables, rows, indexes ) are locked. This is I believe the widely adopted approach, for example Microsoft SQL server uses locking. However, this is still not very satisfactory, if we are lucky the update can still happen, but if not we still have the long wait. Or perhaps the long query can be cancelled and retried later (there are also potential deadlock situations). If updates happen say every 10 seconds, and the long query takes several minutes, then the long query may have to be retried many times, and could end up taking hours instead of minutes.

Possible solution – use a copy of the database

The long query is performed on a copy of the database. This is essentially the solution adopted by RustDb. It sounds expensive and impractical : what if the database is many gigabytes of data, taking a copy of the entire database could take a long time, and in the mean-time no updates can happen. This leads use to the solution adopted by RustDb.

Possible solution – use a virtual copy of the database

The long query is performed on a “virtual copy” of the database. The database is divided into logical pages, and when an update is to be performed, before storing an updated page, the software takes a copy of the old page.

When a page is fetched, we specify not just the page number, but also which “virtual copy” of the database is required. When an update concludes, or when a read-only query finishes, it may be possible to remove some of the copied pages.

A large update will prevent other updates from running. However, due to the incremental nature of information, it should be possible to divide updates into shorter chunks. In such cases, queries may need to take into account incomplete data. For example, a report that summarises sales data should not include data for the current sales period until all data for that period has been inserted into the database, or alternatively note it as incomplete.

Implementation

Keeping track of the old copies is straight-forward in Rust. The data structure is shown below:

struct PageData {
    /// Current data for the page.
    current: Option<Data>,
    /// Historic data for the page.
    history: BTreeMap<u64, Data>,
}

Here the current data for the page is an option, because the value may be stored in backing storage ( Data is simply defined as Arc<Vec<u8>> ).

The page history is defined as a map from the database time to the data. The main complication is deciding when the history can be trimmed, due to there being no possibility that a reader could access the historic page. The central data structure ( slightly simplified ) is this :

pub struct Stash {
    /// Write time - number of writes.
    time: u64,
    /// Page number -> page info.
    pages: HashMap<u64, PageInfoPtr>,
    /// Time -> reader count. Number of readers for given time.
    rdrs: BTreeMap<u64, usize>,
    /// Time -> set of page numbers. Page copies held for given time.
    vers: BTreeMap<u64, HashSet<u64>>,  
}

Here “time” is the number of write transactions that have completed. “pages” is used to lookup the page information, “rdrs” counts the number of outstanding readers for a given time, and “vers” has for each time a set of pages copied at that time.

When a read or write transaction completes, “history” pages may be freed. The first step is to establish the range of times for which there is no longer any reader. Having done this, “vers” is used to check each potentially affected page to check whether a copy can be freed. The Rust code is shown below:

    fn trim(&mut self, time: u64) {
        let (s, r) = (self.start(time), self.retain(time));
        if s != r {
            let mut empty = Vec::<u64>::new();
            for (t, pl) in self.vers.range_mut(s..r) {
                pl.retain(|pnum| {
                    let p = self.pages.get(pnum).unwrap();
                    let mut p = p.d.lock().unwrap();
                    p.trim(*t, s)
                });
                if pl.is_empty() {
                    empty.push(*t);
                }
            }
            for t in empty {
                self.vers.remove(&t);
            }
        }
    }

Here the function parameter “time” is either the time for which the reader count has just gone to zero, or the time of the write transaction which has just finished. The slight complication is keeping track of the empty sets in vers, and subsequently removing them, using the “empty” list.

The complete code can be found here:
https://docs.rs/rustdb/2.14.1/src/rustdb/pstore.rs.html

Note: after implementing this, and writing the post, I discovered this article on wikipedia, which pretty much describes the same thing: https://en.wikipedia.org/wiki/Multiversion_concurrency_control

[ This is the first of a series of articles I am planning on writing about RustDb ]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s