[Schevo-devel] Re: concurrency and copy-on-write snapshots
Ilias Lazaridis
ilias at lazaridis.com
Tue May 23 17:14:45 EDT 2006
Matthew Scott wrote:
> I've been thinking a lot about Schevo's "concurrency story" over the
> last couple of days, and it has been on the project's "back burner" list
> for many a month now.
>
> I appreciate anyone who takes the time to skim/read this and remark on
> any pitfalls or suggestions I haven't thought of.
I've just overflown this and have some general remarks:
This problem should be demonstrated by standard diagramms (e.g. UML), to
allow a review from experts.
The stated problem should be a general one, and most possibly there is
some documentation available.
> Overview
> ========
>
> We can improve Schevo's flexibility with regards to thread concurrency
> if we adopt a copy-on-write "snapshot" approach, with as-yet-unknown
> speed decreases as a side-effect. The term frequently used for this is
> MVCC (multi-version concurrency control).
>
> The approach could be made to be compatible with the existing Schevo
> database structures, and would be optional upon opening a database
> file. A database opened and modified in snapshot mode could be closed,
> and re-opened in "single" mode to regain performance lost when using
> "snapshot" mode.
>
> Timeline
> ========
>
> No timeline has been set for this, although it will probably be
> implemented in some fashion within the next few months.
>
> Status quo
> ==========
>
> Right now, Schevo has support for "MROW" (multiple-reader, one-writer)
> locking on one database structure. This supports the following type of
> timeline:
>
> thread1 thread2 thread3 thread4
> ----------- ----------- ----------- -----------
> _________
> 1 read lock __________
> 2 | _________ read lock
> 3 | read lock |
> | | _________ |
> 4 | | read lock ...v...
> 5 ____v____ | | write lock
> | | request
> 6 read lock ____v____ | .
> request | .
> . ____v____ __________
> 7 . write lock
> . |
> 8 . read lock |
> 9 . request read lock |
> . . request |
> . . . |
> . . . |
> . . . |
> . . . |
> . . . |
> ....v.... ....v.... ....v.... ____v_____
> 10 read lock read lock read lock
> | | |
> v v v
>
> 1: read lock obtained by thread1.
>
> 2: read lock obtained by thread4.
>
> 3: read lock obtained by thread2.
>
> 4: read lock obtained by thread3.
>
> 5: read lock released by thread1.
> 5: write lock requested by thread4, must wait for read locks to be
> released.
>
> 6: read lock requested by thread1, must wait for write lock to be released.
> 6: read lock released by thread2.
>
> 7: read lock released by thread3.
> 7: write lock obtained by thread4.
>
> 8, 9: read lock requested by thread2 and thread3, must wait for write
> lock to be released.
>
> 10: write lock released by thread4.
> 10: read locks obtained by thread1, thread2, thread3.
>
> Problems
> --------
>
> This type of locking is undesirable in a few ways, including the following:
>
> * They are detrimental to concurrency if proper read/write locks are
> being used.
>
> * Database consistency is out of the question if you decide to ignore
> convention and read from the database while executing a transaction.
>
>
> Snapshots
> =========
>
> What would be a more ideal timeline would be the following:
>
>
> thread1 thread2 thread3 thread4
> ----------- ----------- ----------- -----------
> 1 ________A
> 2 read ________A
> 3 | ________A read/write
> | read |
> 4 | | ________A |
> | | read |
> 5 ____v____ | | |
> | | |
> 6 ____v____ | |
> | |
> 7 ____v____ |
> |
> 8 ________A |
> 9 read read/write |
> | requested |
> | . |
> 10 | ____v____B ____v___B
> 11 | | ________B
> | | read
> | | |
> 12 ____v____ | |
> 13 ___xxx___B |
> v
>
> 1: thread1 creates read snapshot with database state A
>
> 2: thread4 creates read/write snapshot with state A
>
> 3: thread2 creates read snapshot with state A
>
> 4: thread3 creates read snapshot with state A
>
> 5: thread1 destroys read snapshot
>
> 6: thread2 destroys read snapshot
>
> 7: thread3 destroys read snapshot
>
> 8: thread1 creates read snapshot with state A
>
> 9: thread2 requests read/write, must wait for thread4 to finish read/write
>
> 10: thread4 finishes read/write successfully, database state now B
> 10: thread2 creates read/write snapshot with state B
>
> 11: thread3 creates read snapshot with state B
>
> 12: thread1 destroys read snapshot
>
> 13: thread2 transaction fails, database state still B
>
> Advantages
> ----------
>
> The advantages of snapshots are corollaries to the disadvantages of MROW
> locking:
>
> * They make read and write threads asynchronous to eachother.
>
> * Database consistency is guaranteed within a snapshot.
>
> Disadvantages
> -------------
>
> * Only one writer is allowed at once, although there may be techniques
> that can piggy-back on the snapshot system to allow multiple concurrent
> writers.
>
> * Write performance degrades when snapshot mode is used.
>
> * Database growth speed between pack operations increases.
>
>
> Implementation
> ==============
>
> What I propose is that the Schevo database engine be altered to work
> natively in a "snapshot" based mode, using copy-on-write techniques.
>
> Single thread
> -------------
>
> Let us take the simplest case of a single-threaded app that executes a
> transaction. Here's what would happen in a non-snapshot mode (which,
> incidentally, could still be left in as an option when high-performance
> single-threaded operation is desired):
>
> - Changes are written directly to the database structure in the
> top-level SCHEVO key of the Durus file.
>
> - Upon success, Durus commit is called.
>
> - Upon failure, Durus rollback is called.
>
> Here's what would happen in snapshot mode:
>
> - A reference to the current consistent database state (I'll refer to
> this as s0) is created in the Durus file. I'll refer to it as s1.
>
> - Changes are made directly to s1. When each change occurs, any
> structures those changes are contained in are created anew within s1, so
> that s0 contains no references to them. All unchanged information is
> still in the form of references to data in s0.
>
> - Upon success, the s0 state is changed to refer to s1, the s1 snapshot
> is removed, and Durus commit is called. The old s0 no longer exists at
> this point.
>
> - Upon failure, the top-level SCHEVO key is still s0, the s1 snapshot is
> removed, and Durus commit is called.
>
> Multiple threads
> ----------------
>
> The implementation gets more interesting when we mix read snapshots and
> read/write snapshots. I'll use the second timeline above as a template
> for describing how this would generally work.
>
> When the database starts out, there is only one snapshot, s0-1.
>
> 1: thread1 creates read snapshot with database state A
> - Database creates s1, referring to s0-1.
>
> 2: thread4 creates read/write snapshot with state A
> - thread4 obtains database write lock.
> - Database creates s2, referring to s0-1.
> - As changes are made, structures in s2 are converted from references
> to structures in s0 to distincly separate structures, referred to only
> in s2.
>
> 3: thread2 creates read snapshot with state A
> - Database creates s3, referring to s0-1.
>
> 4: thread3 creates read snapshot with state A
> - Database creates s4, referring to s0-1.
>
> 5: thread1 destroys read snapshot
> - Database destroys s1.
>
> 6: thread2 destroys read snapshot
> - Database destroys s3.
>
> 7: thread3 destroys read snapshot
> - Database destroys s4.
>
> 8: thread1 creates read snapshot with state A
> - Database creates s5, referring to s0-1.
>
> 9: thread2 requests read/write, must wait for thread4 to finish read/write
> - thread2 waits for database write lock.
>
> 10: thread4 finishes read/write, database state now B
> 10: thread2 creates read/write snapshot with state B
> - thread4 releases database write lock.
> - Database makes s0 key refer to s2. (referred to as s0-2)
> - Database destroys s2.
> - thread2 receives write lock.
> - Database creates s6, referring to s0-2.
>
> 11: thread3 creates read snapshot with state B
> - Database creates s7, referring to s0-2.
>
> 12: thread1 destroys read snapshot
> - Database destroys s5.
>
> 13: thread2 transaction fails, database state still B
> - Database destroys s6
> - No change is made to s0 value, still refers to s0-2.
>
> API
> ---
>
> The proposed API for this would be the following, idealized with the
> upcoming 'with' keyword in Python 2.5::
>
> with db.reader() as db:
> # ... read operations go here ...
> db.execute(...) # fails because we're not allowed to write
>
> with db.writer() as db:
> # ... read operations, etc. ...
> db.execute(...) # works fine
>
> To support Python 2.4 syntax, it would look like this::
>
> db = db.reader()
> try:
> # ... read operations go here ...
> finally:
> db = db.original()
>
> db = db.writer()
> try:
> # ... read operations, etc. ...
> db.execute(...)
> finally:
> db = db.original()
>
> The inner value of 'db' is not the database itself. Rather, it is a
> context manager wrapper around the database that pins the operations to
> the correct snapshot.
>
> The outer value of 'db' is not necessarily the database itself either,
> although in many cases it is. Where that wouldn't occur is if you
> nested context managers::
>
> with db.reader() as db:
> # ... do something ...
> # ... call something that ultimately does this: ...
> with db.reader() as db:
> # ... do stuff ...
>
> In this case, let's assume that the outer 'db' is the database itself.
> So, the first inner 'db' is a read snapshot context manager. The second
> inner 'db' is that same context manager, and a reference counter is used
> to keep track of nesting.
>
> Writers nested in readers is not allowed. This code would raise an
> exception::
>
> with db.reader() as db:
> with db.writer() as db: # <-- raises exception
> # ...
>
> Readers nested in writers are allowed, and the context manager would
> maintain a reference counter much as in the first nested example above.
> It should also maintain that you cannot execute transactions within the
> inner reader context manager, even though it is actually the same as the
> outer writer context manager. This would be legal code::
>
> with db.writer() as db:
> # ... write stuff
> # ... call another function that does the following:
> with db.reader() as db:
> # ... read stuff
> # ... write stuff
>
> This would not be allowed::
>
> with db.writer() as db:
> # ... write stuff
> # ... call another function that does the following:
> with db.reader() as db:
> # ... read stuff
> # ... write stuff <-- error
>
> Details, details
> ----------------
>
> So, given two snapshots, s0 and s1, that both refer to the same
> top-level Schevo database structure, how do we make a change to s1 such
> that everything that s0 refers to remains unmodified?
>
> That is where I'll stop this, since when we answer that question, we
> might as well write the code to implement this idea.
>
> It might get tricky, but I think we could pull it off with some effort.
> Here's a rough draft of what it might take to update the field values of
> an entity, assuming that 'values' is already in {field-id: stored-value}
> form::
>
> def update_field_values(root, s0_id, s1_id, extent_id, oid, values):
> s0 = root[s0_id]
> s1 = root[s1_id]
> # Copy top-level structure if same.
> if s0 is s1:
> s1 = root[s1_id] = s1.copy()
> # Copy extents structure if same.
> s0_extents = s0['extents']
> s1_extents = s1['extents']
> if s0_extents is s1_extents:
> s1_extents = s1['extents'] = s1_extents.copy()
> # Copy extent structure if same.
> s0_extent = s0_extents[extent_id]
> s1_extent = s1_extents[extent_id]
> if s0_extent is s1_extent:
> s1_extent = s1_extents[extent_id] = s1_extent.copy()
> # Copy entities structure if same.
> s0_entities = s0_extent['entities']
> s1_entities = s1_extent['entities']
> if s0_entities is s1_entities:
> s1_entities = s1_extent['entities'] = copy_btree(s1_entities)
> # Copy entity structure if same.
> s0_entity = s0_entities.get(oid, None) # (may not exist in s0)
> s1_entity = s1_entities[oid]
> if s0_entity is s1_entity:
> s1_entity = s1_entities[oid] = s1_entity.copy()
> # Insert new field values structure.
> s1_entity['fields'] = PersistentDict(values)
>
> def copy_btree(tree):
> # XXX: This could be expensive. Maybe a copy-on-write
> # Btree variation could be written that has a built-in copy()
> # method.
> tree2 = BTree()
> for key, value in tree.iteritems():
> tree2[key] = value
> return tree2
>
>
> Devil's advocate
> ================
>
> Of course, we could take on a different challenge, and start using a
> backend other than Durus that already includes snapshotting capability.
> For instance, if I recall correctly, sqlite would be useful for this
> (divmod's axiom uses it), and if data is structured right, might even
> boost Schevo's performance.
>
> However, I personally feel this would detract from one of the primary
> goals of Schevo, which is to provide a first-class database engine in
> pure Python (the C extension in Durus is optional, after all). It would
> also likely take more time to implement than staying with Durus, and
> would provide a migration challenge.
>
> Another option would be to see if ZODB could help us with this, since it
> has MVCC. According to http://www.zope.org/Products/ZODB3.6 though, C
> extensions are a requirement, not an option. Additionally, this would
> require a migration process for already-deployed databases.
>
> Other "devil's advocate" ideas are welcome and appreciated in this
> discussion.
>
>
> The bottom of the message
> =========================
>
> Thank you for reading this far. :-)
>
>
> --
> Matthew Scott
> mscott at springfieldtech.com
>
>
>
--
http://lazaridis.com
More information about the Schevo-devel
mailing list