[Schevo-devel] concurrency and copy-on-write snapshots

Matthew Scott mscott at springfieldtech.com
Thu May 11 16:52:48 EDT 2006


I've been thinking a lot about Schevo's "concurrency story" over the  
last couple of days, and it has been on the project's "back burner"  
list for many a month now.

I appreciate anyone who takes the time to skim/read this and remark  
on any pitfalls or suggestions I haven't thought of.

Overview
========

We can improve Schevo's flexibility with regards to thread  
concurrency if we adopt a copy-on-write "snapshot" approach, with as- 
yet-unknown speed decreases as a side-effect.  The term frequently  
used for this is MVCC (multi-version concurrency control).

The approach could be made to be compatible with the existing Schevo  
database structures, and would be optional upon opening a database  
file.  A database opened and modified in snapshot mode could be  
closed, and re-opened in "single" mode to regain performance lost  
when using "snapshot" mode.

Timeline
========

No timeline has been set for this, although it will probably be  
implemented in some fashion within the next few months.

Status quo
==========

Right now, Schevo has support for "MROW" (multiple-reader, one- 
writer) locking on one database structure.  This supports the  
following type of timeline:

          thread1          thread2          thread3         thread4
        -----------      -----------      -----------     -----------
         _________
   1     read lock                                        __________
   2         |            _________                       read lock
   3         |            read lock                           |
             |                |            _________          |
   4         |                |            read lock       ...v...
   5     ____v____            |                |          write lock
                              |                |           request
   6     read lock        ____v____            |              .
         request                               |              .
             .                             ____v____      __________
   7         .                                            write lock
             .                                                |
   8         .            read lock                           |
   9         .            request          read lock          |
             .                .            request            |
             .                .                .              |
             .                .                .              |
             .                .                .              |
             .                .                .              |
             .                .                .              |
         ....v....        ....v....        ....v....      ____v_____
   10    read lock        read lock        read lock
             |                |                |
             v                v                v

1: read lock obtained by thread1.

2: read lock obtained by thread4.

3: read lock obtained by thread2.

4: read lock obtained by thread3.

5: read lock released by thread1.
5: write lock requested by thread4, must wait for read locks to be  
released.

6: read lock requested by thread1, must wait for write lock to be  
released.
6: read lock released by thread2.

7: read lock released by thread3.
7: write lock obtained by thread4.

8, 9: read lock requested by thread2 and thread3, must wait for write  
lock to be released.

10: write lock released by thread4.
10: read locks obtained by thread1, thread2, thread3.

Problems
--------

This type of locking is undesirable in a few ways, including the  
following:

* They are detrimental to concurrency if proper read/write locks are  
being used.

* Database consistency is out of the question if you decide to ignore  
convention and read from the database while executing a transaction.


Snapshots
=========

What would be a more ideal timeline would be the following:


        thread1          thread2          thread3         thread4
      -----------      -----------      -----------     -----------
   1   ________A
   2     read                                            ________A
   3       |            ________A                        read/write
           |              read                               |
   4       |                |            ________A           |
           |                |              read              |
   5   ____v____            |                |               |
                            |                |               |
   6                    ____v____            |               |
                                             |               |
   7                                     ____v____           |
                                                             |
   8   ________A                                             |
   9     read           read/write                           |
           |            requested                            |
           |                .                                |
   10      |            ____v____B                       ____v___B
   11      |                |            ________B
           |                |              read
           |                |                |
   12  ____v____            |                |
   13                   ___xxx___B           |
                                             v

1: thread1 creates read snapshot with database state A

2: thread4 creates read/write snapshot with state A

3: thread2 creates read snapshot with state A

4: thread3 creates read snapshot with state A

5: thread1 destroys read snapshot

6: thread2 destroys read snapshot

7: thread3 destroys read snapshot

8: thread1 creates read snapshot with state A

9: thread2 requests read/write, must wait for thread4 to finish read/ 
write

10: thread4 finishes read/write successfully, database state now B
10: thread2 creates read/write snapshot with state B

11: thread3 creates read snapshot with state B

12: thread1 destroys read snapshot

13: thread2 transaction fails, database state still B

Advantages
----------

The advantages of snapshots are corollaries to the disadvantages of  
MROW locking:

* They make read and write threads asynchronous to eachother.

* Database consistency is guaranteed within a snapshot.

Disadvantages
-------------

* Only one writer is allowed at once, although there may be  
techniques that can piggy-back on the snapshot system to allow  
multiple concurrent writers.

* Write performance degrades when snapshot mode is used.

* Database growth speed between pack operations increases.


Implementation
==============

What I propose is that the Schevo database engine be altered to work  
natively in a "snapshot" based mode, using copy-on-write techniques.

Single thread
-------------

Let us take the simplest case of a single-threaded app that executes  
a transaction.  Here's what would happen in a non-snapshot mode  
(which, incidentally, could still be left in as an option when high- 
performance single-threaded operation is desired):

- Changes are written directly to the database structure in the top- 
level SCHEVO key of the Durus file.

- Upon success, Durus commit is called.

- Upon failure, Durus rollback is called.

Here's what would happen in snapshot mode:

- A reference to the current consistent database state (I'll refer to  
this as s0) is created in the Durus file.  I'll refer to it as s1.

- Changes are made directly to s1.  When each change occurs, any  
structures those changes are contained in are created anew within s1,  
so that s0 contains no references to them.  All unchanged information  
is still in the form of references to data in s0.

- Upon success, the s0 state is changed to refer to s1, the s1  
snapshot is removed, and Durus commit is called.  The old s0 no  
longer exists at this point.

- Upon failure, the top-level SCHEVO key is still s0, the s1 snapshot  
is removed, and Durus commit is called.

Multiple threads
----------------

The implementation gets more interesting when we mix read snapshots  
and read/write snapshots.  I'll use the second timeline above as a  
template for describing how this would generally work.

When the database starts out, there is only one snapshot, s0-1.

1: thread1 creates read snapshot with database state A
    - Database creates s1, referring to s0-1.

2: thread4 creates read/write snapshot with state A
    - thread4 obtains database write lock.
    - Database creates s2, referring to s0-1.
    - As changes are made, structures in s2 are converted from  
references to structures in s0 to distincly separate structures,  
referred to only in s2.

3: thread2 creates read snapshot with state A
    - Database creates s3, referring to s0-1.

4: thread3 creates read snapshot with state A
    - Database creates s4, referring to s0-1.

5: thread1 destroys read snapshot
    - Database destroys s1.

6: thread2 destroys read snapshot
    - Database destroys s3.

7: thread3 destroys read snapshot
    - Database destroys s4.

8: thread1 creates read snapshot with state A
    - Database creates s5, referring to s0-1.

9: thread2 requests read/write, must wait for thread4 to finish read/ 
write
    - thread2 waits for database write lock.

10: thread4 finishes read/write, database state now B
10: thread2 creates read/write snapshot with state B
     - thread4 releases database write lock.
       - Database makes s0 key refer to s2. (referred to as s0-2)
       - Database destroys s2.
     - thread2 receives write lock.
       - Database creates s6, referring to s0-2.

11: thread3 creates read snapshot with state B
     - Database creates s7, referring to s0-2.

12: thread1 destroys read snapshot
     - Database destroys s5.

13: thread2 transaction fails, database state still B
     - Database destroys s6
     - No change is made to s0 value, still refers to s0-2.

API
---

The proposed API for this would be the following, idealized with the  
upcoming 'with' keyword in Python 2.5::

     with db.reader() as db:
         # ... read operations go here ...
         db.execute(...)  # fails because we're not allowed to write

     with db.writer() as db:
         # ... read operations, etc. ...
         db.execute(...)  # works fine

To support Python 2.4 syntax, it would look like this::

     db = db.reader()
     try:
         # ... read operations go here ...
     finally:
         db = db.original()

     db = db.writer()
     try:
         # ... read operations, etc. ...
         db.execute(...)
     finally:
         db = db.original()

The inner value of 'db' is not the database itself.  Rather, it is a  
context manager wrapper around the database that pins the operations  
to the correct snapshot.

The outer value of 'db' is not necessarily the database itself  
either, although in many cases it is.  Where that wouldn't occur is  
if you nested context managers::

     with db.reader() as db:
         # ... do something ...
         # ... call something that ultimately does this: ...
         with db.reader() as db:
             # ... do stuff ...

In this case, let's assume that the outer 'db' is the database  
itself.  So, the first inner 'db' is a read snapshot context  
manager.  The second inner 'db' is that same context manager, and a  
reference counter is used to keep track of nesting.

Writers nested in readers is not allowed.  This code would raise an  
exception::

     with db.reader() as db:
         with db.writer() as db:     # <-- raises exception
             # ...

Readers nested in writers are allowed, and the context manager would  
maintain a reference counter much as in the first nested example  
above.  It should also maintain that you cannot execute transactions  
within the inner reader context manager, even though it is actually  
the same as the outer writer context manager.  This would be legal  
code::

     with db.writer() as db:
         # ... write stuff
         # ... call another function that does the following:
         with db.reader() as db:
             # ... read stuff
         # ... write stuff

This would not be allowed::

     with db.writer() as db:
         # ... write stuff
         # ... call another function that does the following:
         with db.reader() as db:
             # ... read stuff
             # ... write stuff <-- error

Details, details
----------------

So, given two snapshots, s0 and s1, that both refer to the same top- 
level Schevo database structure, how do we make a change to s1 such  
that everything that s0 refers to remains unmodified?

That is where I'll stop this, since when we answer that question, we  
might as well write the code to implement this idea.

It might get tricky, but I think we could pull it off with some  
effort.  Here's a rough draft of what it might take to update the  
field values of an entity, assuming that 'values' is already in  
{field-id: stored-value} form::

     def update_field_values(root, s0_id, s1_id, extent_id, oid,  
values):
         s0 = root[s0_id]
         s1 = root[s1_id]
         # Copy top-level structure if same.
         if s0 is s1:
             s1 = root[s1_id] = s1.copy()
         # Copy extents structure if same.
         s0_extents = s0['extents']
         s1_extents = s1['extents']
         if s0_extents is s1_extents:
             s1_extents = s1['extents'] = s1_extents.copy()
         # Copy extent structure if same.
         s0_extent = s0_extents[extent_id]
         s1_extent = s1_extents[extent_id]
         if s0_extent is s1_extent:
             s1_extent = s1_extents[extent_id] = s1_extent.copy()
         # Copy entities structure if same.
         s0_entities = s0_extent['entities']
         s1_entities = s1_extent['entities']
         if s0_entities is s1_entities:
             s1_entities = s1_extent['entities'] = copy_btree 
(s1_entities)
         # Copy entity structure if same.
         s0_entity = s0_entities.get(oid, None)   # (may not exist in  
s0)
         s1_entity = s1_entities[oid]
         if s0_entity is s1_entity:
             s1_entity = s1_entities[oid] = s1_entity.copy()
         # Insert new field values structure.
         s1_entity['fields'] = PersistentDict(values)

     def copy_btree(tree):
         # XXX: This could be expensive.  Maybe a copy-on-write
         # Btree variation could be written that has a built-in copy()
         # method.
         tree2 = BTree()
         for key, value in tree.iteritems():
             tree2[key] = value
         return tree2


Devil's advocate
================

Of course, we could take on a different challenge, and start using a  
backend other than Durus that already includes snapshotting  
capability.  For instance, if I recall correctly, sqlite would be  
useful for this (divmod's axiom uses it), and if data is structured  
right, might even boost Schevo's performance.

However, I personally feel this would detract from one of the primary  
goals of Schevo, which is to provide a first-class database engine in  
pure Python (the C extension in Durus is optional, after all).  It  
would also likely take more time to implement than staying with  
Durus, and would provide a migration challenge.

Another option would be to see if ZODB could help us with this, since  
it has MVCC.  According to http://www.zope.org/Products/ZODB3.6  
though, C extensions are a requirement, not an option.  Additionally,  
this would require a migration process for already-deployed databases.

Other "devil's advocate" ideas are welcome and appreciated in this  
discussion.


The bottom of the message
=========================

Thank you for reading this far. :-)


--
Matthew Scott
mscott at springfieldtech.com






More information about the Schevo-devel mailing list