[Schevo-devel] Re: concurrency and copy-on-write snapshots

Ilias Lazaridis ilias at lazaridis.com
Tue May 23 17:14:45 EDT 2006


Matthew Scott wrote:
> I've been thinking a lot about Schevo's "concurrency story" over the 
> last couple of days, and it has been on the project's "back burner" list 
> for many a month now.
> 
> I appreciate anyone who takes the time to skim/read this and remark on 
> any pitfalls or suggestions I haven't thought of.

I've just overflown this and have some general remarks:

This problem should be demonstrated by standard diagramms (e.g. UML), to 
allow a review from experts.

The stated problem should be a general one, and most possibly there is 
some documentation available.

> Overview
> ========
> 
> We can improve Schevo's flexibility with regards to thread concurrency 
> if we adopt a copy-on-write "snapshot" approach, with as-yet-unknown 
> speed decreases as a side-effect.  The term frequently used for this is 
> MVCC (multi-version concurrency control).
> 
> The approach could be made to be compatible with the existing Schevo 
> database structures, and would be optional upon opening a database 
> file.  A database opened and modified in snapshot mode could be closed, 
> and re-opened in "single" mode to regain performance lost when using 
> "snapshot" mode.
> 
> Timeline
> ========
> 
> No timeline has been set for this, although it will probably be 
> implemented in some fashion within the next few months.
> 
> Status quo
> ==========
> 
> Right now, Schevo has support for "MROW" (multiple-reader, one-writer) 
> locking on one database structure.  This supports the following type of 
> timeline:
> 
>          thread1          thread2          thread3         thread4
>        -----------      -----------      -----------     -----------
>         _________
>   1     read lock                                        __________
>   2         |            _________                       read lock
>   3         |            read lock                           |
>             |                |            _________          |
>   4         |                |            read lock       ...v...
>   5     ____v____            |                |          write lock
>                              |                |           request
>   6     read lock        ____v____            |              .
>         request                               |              .
>             .                             ____v____      __________
>   7         .                                            write lock
>             .                                                |
>   8         .            read lock                           |
>   9         .            request          read lock          |
>             .                .            request            |
>             .                .                .              |
>             .                .                .              |
>             .                .                .              |
>             .                .                .              |
>             .                .                .              |
>         ....v....        ....v....        ....v....      ____v_____
>   10    read lock        read lock        read lock
>             |                |                |
>             v                v                v
> 
> 1: read lock obtained by thread1.
> 
> 2: read lock obtained by thread4.
> 
> 3: read lock obtained by thread2.
> 
> 4: read lock obtained by thread3.
> 
> 5: read lock released by thread1.
> 5: write lock requested by thread4, must wait for read locks to be 
> released.
> 
> 6: read lock requested by thread1, must wait for write lock to be released.
> 6: read lock released by thread2.
> 
> 7: read lock released by thread3.
> 7: write lock obtained by thread4.
> 
> 8, 9: read lock requested by thread2 and thread3, must wait for write 
> lock to be released.
> 
> 10: write lock released by thread4.
> 10: read locks obtained by thread1, thread2, thread3.
> 
> Problems
> --------
> 
> This type of locking is undesirable in a few ways, including the following:
> 
> * They are detrimental to concurrency if proper read/write locks are 
> being used.
> 
> * Database consistency is out of the question if you decide to ignore 
> convention and read from the database while executing a transaction.
> 
> 
> Snapshots
> =========
> 
> What would be a more ideal timeline would be the following:
> 
> 
>        thread1          thread2          thread3         thread4
>      -----------      -----------      -----------     -----------
>   1   ________A
>   2     read                                            ________A
>   3       |            ________A                        read/write
>           |              read                               |
>   4       |                |            ________A           |
>           |                |              read              |
>   5   ____v____            |                |               |
>                            |                |               |
>   6                    ____v____            |               |
>                                             |               |
>   7                                     ____v____           |
>                                                             |
>   8   ________A                                             |
>   9     read           read/write                           |
>           |            requested                            |
>           |                .                                |
>   10      |            ____v____B                       ____v___B
>   11      |                |            ________B
>           |                |              read
>           |                |                |
>   12  ____v____            |                |
>   13                   ___xxx___B           |
>                                             v
> 
> 1: thread1 creates read snapshot with database state A
> 
> 2: thread4 creates read/write snapshot with state A
> 
> 3: thread2 creates read snapshot with state A
> 
> 4: thread3 creates read snapshot with state A
> 
> 5: thread1 destroys read snapshot
> 
> 6: thread2 destroys read snapshot
> 
> 7: thread3 destroys read snapshot
> 
> 8: thread1 creates read snapshot with state A
> 
> 9: thread2 requests read/write, must wait for thread4 to finish read/write
> 
> 10: thread4 finishes read/write successfully, database state now B
> 10: thread2 creates read/write snapshot with state B
> 
> 11: thread3 creates read snapshot with state B
> 
> 12: thread1 destroys read snapshot
> 
> 13: thread2 transaction fails, database state still B
> 
> Advantages
> ----------
> 
> The advantages of snapshots are corollaries to the disadvantages of MROW 
> locking:
> 
> * They make read and write threads asynchronous to eachother.
> 
> * Database consistency is guaranteed within a snapshot.
> 
> Disadvantages
> -------------
> 
> * Only one writer is allowed at once, although there may be techniques 
> that can piggy-back on the snapshot system to allow multiple concurrent 
> writers.
> 
> * Write performance degrades when snapshot mode is used.
> 
> * Database growth speed between pack operations increases.
> 
> 
> Implementation
> ==============
> 
> What I propose is that the Schevo database engine be altered to work 
> natively in a "snapshot" based mode, using copy-on-write techniques.
> 
> Single thread
> -------------
> 
> Let us take the simplest case of a single-threaded app that executes a 
> transaction.  Here's what would happen in a non-snapshot mode (which, 
> incidentally, could still be left in as an option when high-performance 
> single-threaded operation is desired):
> 
> - Changes are written directly to the database structure in the 
> top-level SCHEVO key of the Durus file.
> 
> - Upon success, Durus commit is called.
> 
> - Upon failure, Durus rollback is called.
> 
> Here's what would happen in snapshot mode:
> 
> - A reference to the current consistent database state (I'll refer to 
> this as s0) is created in the Durus file.  I'll refer to it as s1.
> 
> - Changes are made directly to s1.  When each change occurs, any 
> structures those changes are contained in are created anew within s1, so 
> that s0 contains no references to them.  All unchanged information is 
> still in the form of references to data in s0.
> 
> - Upon success, the s0 state is changed to refer to s1, the s1 snapshot 
> is removed, and Durus commit is called.  The old s0 no longer exists at 
> this point.
> 
> - Upon failure, the top-level SCHEVO key is still s0, the s1 snapshot is 
> removed, and Durus commit is called.
> 
> Multiple threads
> ----------------
> 
> The implementation gets more interesting when we mix read snapshots and 
> read/write snapshots.  I'll use the second timeline above as a template 
> for describing how this would generally work.
> 
> When the database starts out, there is only one snapshot, s0-1.
> 
> 1: thread1 creates read snapshot with database state A
>    - Database creates s1, referring to s0-1.
> 
> 2: thread4 creates read/write snapshot with state A
>    - thread4 obtains database write lock.
>    - Database creates s2, referring to s0-1.
>    - As changes are made, structures in s2 are converted from references 
> to structures in s0 to distincly separate structures, referred to only 
> in s2.
> 
> 3: thread2 creates read snapshot with state A
>    - Database creates s3, referring to s0-1.
> 
> 4: thread3 creates read snapshot with state A
>    - Database creates s4, referring to s0-1.
> 
> 5: thread1 destroys read snapshot
>    - Database destroys s1.
> 
> 6: thread2 destroys read snapshot
>    - Database destroys s3.
> 
> 7: thread3 destroys read snapshot
>    - Database destroys s4.
> 
> 8: thread1 creates read snapshot with state A
>    - Database creates s5, referring to s0-1.
> 
> 9: thread2 requests read/write, must wait for thread4 to finish read/write
>    - thread2 waits for database write lock.
> 
> 10: thread4 finishes read/write, database state now B
> 10: thread2 creates read/write snapshot with state B
>     - thread4 releases database write lock.
>       - Database makes s0 key refer to s2. (referred to as s0-2)
>       - Database destroys s2.
>     - thread2 receives write lock.
>       - Database creates s6, referring to s0-2.
> 
> 11: thread3 creates read snapshot with state B
>     - Database creates s7, referring to s0-2.
> 
> 12: thread1 destroys read snapshot
>     - Database destroys s5.
> 
> 13: thread2 transaction fails, database state still B
>     - Database destroys s6
>     - No change is made to s0 value, still refers to s0-2.
> 
> API
> ---
> 
> The proposed API for this would be the following, idealized with the 
> upcoming 'with' keyword in Python 2.5::
> 
>     with db.reader() as db:
>         # ... read operations go here ...
>         db.execute(...)  # fails because we're not allowed to write
> 
>     with db.writer() as db:
>         # ... read operations, etc. ...
>         db.execute(...)  # works fine
> 
> To support Python 2.4 syntax, it would look like this::
> 
>     db = db.reader()
>     try:
>         # ... read operations go here ...
>     finally:
>         db = db.original()
> 
>     db = db.writer()
>     try:
>         # ... read operations, etc. ...
>         db.execute(...)
>     finally:
>         db = db.original()
> 
> The inner value of 'db' is not the database itself.  Rather, it is a 
> context manager wrapper around the database that pins the operations to 
> the correct snapshot.
> 
> The outer value of 'db' is not necessarily the database itself either, 
> although in many cases it is.  Where that wouldn't occur is if you 
> nested context managers::
> 
>     with db.reader() as db:
>         # ... do something ...
>         # ... call something that ultimately does this: ...
>         with db.reader() as db:
>             # ... do stuff ...
> 
> In this case, let's assume that the outer 'db' is the database itself.  
> So, the first inner 'db' is a read snapshot context manager.  The second 
> inner 'db' is that same context manager, and a reference counter is used 
> to keep track of nesting.
> 
> Writers nested in readers is not allowed.  This code would raise an 
> exception::
> 
>     with db.reader() as db:
>         with db.writer() as db:     # <-- raises exception
>             # ...
> 
> Readers nested in writers are allowed, and the context manager would 
> maintain a reference counter much as in the first nested example above.  
> It should also maintain that you cannot execute transactions within the 
> inner reader context manager, even though it is actually the same as the 
> outer writer context manager.  This would be legal code::
> 
>     with db.writer() as db:
>         # ... write stuff
>         # ... call another function that does the following:
>         with db.reader() as db:
>             # ... read stuff
>         # ... write stuff
> 
> This would not be allowed::
> 
>     with db.writer() as db:
>         # ... write stuff
>         # ... call another function that does the following:
>         with db.reader() as db:
>             # ... read stuff
>             # ... write stuff <-- error
> 
> Details, details
> ----------------
> 
> So, given two snapshots, s0 and s1, that both refer to the same 
> top-level Schevo database structure, how do we make a change to s1 such 
> that everything that s0 refers to remains unmodified?
> 
> That is where I'll stop this, since when we answer that question, we 
> might as well write the code to implement this idea.
> 
> It might get tricky, but I think we could pull it off with some effort.  
> Here's a rough draft of what it might take to update the field values of 
> an entity, assuming that 'values' is already in {field-id: stored-value} 
> form::
> 
>     def update_field_values(root, s0_id, s1_id, extent_id, oid, values):
>         s0 = root[s0_id]
>         s1 = root[s1_id]
>         # Copy top-level structure if same.
>         if s0 is s1:
>             s1 = root[s1_id] = s1.copy()
>         # Copy extents structure if same.
>         s0_extents = s0['extents']
>         s1_extents = s1['extents']
>         if s0_extents is s1_extents:
>             s1_extents = s1['extents'] = s1_extents.copy()
>         # Copy extent structure if same.
>         s0_extent = s0_extents[extent_id]
>         s1_extent = s1_extents[extent_id]
>         if s0_extent is s1_extent:
>             s1_extent = s1_extents[extent_id] = s1_extent.copy()
>         # Copy entities structure if same.
>         s0_entities = s0_extent['entities']
>         s1_entities = s1_extent['entities']
>         if s0_entities is s1_entities:
>             s1_entities = s1_extent['entities'] = copy_btree(s1_entities)
>         # Copy entity structure if same.
>         s0_entity = s0_entities.get(oid, None)   # (may not exist in s0)
>         s1_entity = s1_entities[oid]
>         if s0_entity is s1_entity:
>             s1_entity = s1_entities[oid] = s1_entity.copy()
>         # Insert new field values structure.
>         s1_entity['fields'] = PersistentDict(values)
> 
>     def copy_btree(tree):
>         # XXX: This could be expensive.  Maybe a copy-on-write
>         # Btree variation could be written that has a built-in copy()
>         # method.
>         tree2 = BTree()
>         for key, value in tree.iteritems():
>             tree2[key] = value
>         return tree2
> 
> 
> Devil's advocate
> ================
> 
> Of course, we could take on a different challenge, and start using a 
> backend other than Durus that already includes snapshotting capability.  
> For instance, if I recall correctly, sqlite would be useful for this 
> (divmod's axiom uses it), and if data is structured right, might even 
> boost Schevo's performance.
> 
> However, I personally feel this would detract from one of the primary 
> goals of Schevo, which is to provide a first-class database engine in 
> pure Python (the C extension in Durus is optional, after all).  It would 
> also likely take more time to implement than staying with Durus, and 
> would provide a migration challenge.
> 
> Another option would be to see if ZODB could help us with this, since it 
> has MVCC.  According to http://www.zope.org/Products/ZODB3.6 though, C 
> extensions are a requirement, not an option.  Additionally, this would 
> require a migration process for already-deployed databases.
> 
> Other "devil's advocate" ideas are welcome and appreciated in this 
> discussion.
> 
> 
> The bottom of the message
> =========================
> 
> Thank you for reading this far. :-)
> 
> 
> -- 
> Matthew Scott
> mscott at springfieldtech.com
> 
> 
> 


-- 
http://lazaridis.com


More information about the Schevo-devel mailing list