[Schevo-devel] concurrency and copy-on-write snapshots
Matthew Scott
mscott at springfieldtech.com
Thu May 11 16:52:48 EDT 2006
I've been thinking a lot about Schevo's "concurrency story" over the
last couple of days, and it has been on the project's "back burner"
list for many a month now.
I appreciate anyone who takes the time to skim/read this and remark
on any pitfalls or suggestions I haven't thought of.
Overview
========
We can improve Schevo's flexibility with regards to thread
concurrency if we adopt a copy-on-write "snapshot" approach, with as-
yet-unknown speed decreases as a side-effect. The term frequently
used for this is MVCC (multi-version concurrency control).
The approach could be made to be compatible with the existing Schevo
database structures, and would be optional upon opening a database
file. A database opened and modified in snapshot mode could be
closed, and re-opened in "single" mode to regain performance lost
when using "snapshot" mode.
Timeline
========
No timeline has been set for this, although it will probably be
implemented in some fashion within the next few months.
Status quo
==========
Right now, Schevo has support for "MROW" (multiple-reader, one-
writer) locking on one database structure. This supports the
following type of timeline:
thread1 thread2 thread3 thread4
----------- ----------- ----------- -----------
_________
1 read lock __________
2 | _________ read lock
3 | read lock |
| | _________ |
4 | | read lock ...v...
5 ____v____ | | write lock
| | request
6 read lock ____v____ | .
request | .
. ____v____ __________
7 . write lock
. |
8 . read lock |
9 . request read lock |
. . request |
. . . |
. . . |
. . . |
. . . |
. . . |
....v.... ....v.... ....v.... ____v_____
10 read lock read lock read lock
| | |
v v v
1: read lock obtained by thread1.
2: read lock obtained by thread4.
3: read lock obtained by thread2.
4: read lock obtained by thread3.
5: read lock released by thread1.
5: write lock requested by thread4, must wait for read locks to be
released.
6: read lock requested by thread1, must wait for write lock to be
released.
6: read lock released by thread2.
7: read lock released by thread3.
7: write lock obtained by thread4.
8, 9: read lock requested by thread2 and thread3, must wait for write
lock to be released.
10: write lock released by thread4.
10: read locks obtained by thread1, thread2, thread3.
Problems
--------
This type of locking is undesirable in a few ways, including the
following:
* They are detrimental to concurrency if proper read/write locks are
being used.
* Database consistency is out of the question if you decide to ignore
convention and read from the database while executing a transaction.
Snapshots
=========
What would be a more ideal timeline would be the following:
thread1 thread2 thread3 thread4
----------- ----------- ----------- -----------
1 ________A
2 read ________A
3 | ________A read/write
| read |
4 | | ________A |
| | read |
5 ____v____ | | |
| | |
6 ____v____ | |
| |
7 ____v____ |
|
8 ________A |
9 read read/write |
| requested |
| . |
10 | ____v____B ____v___B
11 | | ________B
| | read
| | |
12 ____v____ | |
13 ___xxx___B |
v
1: thread1 creates read snapshot with database state A
2: thread4 creates read/write snapshot with state A
3: thread2 creates read snapshot with state A
4: thread3 creates read snapshot with state A
5: thread1 destroys read snapshot
6: thread2 destroys read snapshot
7: thread3 destroys read snapshot
8: thread1 creates read snapshot with state A
9: thread2 requests read/write, must wait for thread4 to finish read/
write
10: thread4 finishes read/write successfully, database state now B
10: thread2 creates read/write snapshot with state B
11: thread3 creates read snapshot with state B
12: thread1 destroys read snapshot
13: thread2 transaction fails, database state still B
Advantages
----------
The advantages of snapshots are corollaries to the disadvantages of
MROW locking:
* They make read and write threads asynchronous to eachother.
* Database consistency is guaranteed within a snapshot.
Disadvantages
-------------
* Only one writer is allowed at once, although there may be
techniques that can piggy-back on the snapshot system to allow
multiple concurrent writers.
* Write performance degrades when snapshot mode is used.
* Database growth speed between pack operations increases.
Implementation
==============
What I propose is that the Schevo database engine be altered to work
natively in a "snapshot" based mode, using copy-on-write techniques.
Single thread
-------------
Let us take the simplest case of a single-threaded app that executes
a transaction. Here's what would happen in a non-snapshot mode
(which, incidentally, could still be left in as an option when high-
performance single-threaded operation is desired):
- Changes are written directly to the database structure in the top-
level SCHEVO key of the Durus file.
- Upon success, Durus commit is called.
- Upon failure, Durus rollback is called.
Here's what would happen in snapshot mode:
- A reference to the current consistent database state (I'll refer to
this as s0) is created in the Durus file. I'll refer to it as s1.
- Changes are made directly to s1. When each change occurs, any
structures those changes are contained in are created anew within s1,
so that s0 contains no references to them. All unchanged information
is still in the form of references to data in s0.
- Upon success, the s0 state is changed to refer to s1, the s1
snapshot is removed, and Durus commit is called. The old s0 no
longer exists at this point.
- Upon failure, the top-level SCHEVO key is still s0, the s1 snapshot
is removed, and Durus commit is called.
Multiple threads
----------------
The implementation gets more interesting when we mix read snapshots
and read/write snapshots. I'll use the second timeline above as a
template for describing how this would generally work.
When the database starts out, there is only one snapshot, s0-1.
1: thread1 creates read snapshot with database state A
- Database creates s1, referring to s0-1.
2: thread4 creates read/write snapshot with state A
- thread4 obtains database write lock.
- Database creates s2, referring to s0-1.
- As changes are made, structures in s2 are converted from
references to structures in s0 to distincly separate structures,
referred to only in s2.
3: thread2 creates read snapshot with state A
- Database creates s3, referring to s0-1.
4: thread3 creates read snapshot with state A
- Database creates s4, referring to s0-1.
5: thread1 destroys read snapshot
- Database destroys s1.
6: thread2 destroys read snapshot
- Database destroys s3.
7: thread3 destroys read snapshot
- Database destroys s4.
8: thread1 creates read snapshot with state A
- Database creates s5, referring to s0-1.
9: thread2 requests read/write, must wait for thread4 to finish read/
write
- thread2 waits for database write lock.
10: thread4 finishes read/write, database state now B
10: thread2 creates read/write snapshot with state B
- thread4 releases database write lock.
- Database makes s0 key refer to s2. (referred to as s0-2)
- Database destroys s2.
- thread2 receives write lock.
- Database creates s6, referring to s0-2.
11: thread3 creates read snapshot with state B
- Database creates s7, referring to s0-2.
12: thread1 destroys read snapshot
- Database destroys s5.
13: thread2 transaction fails, database state still B
- Database destroys s6
- No change is made to s0 value, still refers to s0-2.
API
---
The proposed API for this would be the following, idealized with the
upcoming 'with' keyword in Python 2.5::
with db.reader() as db:
# ... read operations go here ...
db.execute(...) # fails because we're not allowed to write
with db.writer() as db:
# ... read operations, etc. ...
db.execute(...) # works fine
To support Python 2.4 syntax, it would look like this::
db = db.reader()
try:
# ... read operations go here ...
finally:
db = db.original()
db = db.writer()
try:
# ... read operations, etc. ...
db.execute(...)
finally:
db = db.original()
The inner value of 'db' is not the database itself. Rather, it is a
context manager wrapper around the database that pins the operations
to the correct snapshot.
The outer value of 'db' is not necessarily the database itself
either, although in many cases it is. Where that wouldn't occur is
if you nested context managers::
with db.reader() as db:
# ... do something ...
# ... call something that ultimately does this: ...
with db.reader() as db:
# ... do stuff ...
In this case, let's assume that the outer 'db' is the database
itself. So, the first inner 'db' is a read snapshot context
manager. The second inner 'db' is that same context manager, and a
reference counter is used to keep track of nesting.
Writers nested in readers is not allowed. This code would raise an
exception::
with db.reader() as db:
with db.writer() as db: # <-- raises exception
# ...
Readers nested in writers are allowed, and the context manager would
maintain a reference counter much as in the first nested example
above. It should also maintain that you cannot execute transactions
within the inner reader context manager, even though it is actually
the same as the outer writer context manager. This would be legal
code::
with db.writer() as db:
# ... write stuff
# ... call another function that does the following:
with db.reader() as db:
# ... read stuff
# ... write stuff
This would not be allowed::
with db.writer() as db:
# ... write stuff
# ... call another function that does the following:
with db.reader() as db:
# ... read stuff
# ... write stuff <-- error
Details, details
----------------
So, given two snapshots, s0 and s1, that both refer to the same top-
level Schevo database structure, how do we make a change to s1 such
that everything that s0 refers to remains unmodified?
That is where I'll stop this, since when we answer that question, we
might as well write the code to implement this idea.
It might get tricky, but I think we could pull it off with some
effort. Here's a rough draft of what it might take to update the
field values of an entity, assuming that 'values' is already in
{field-id: stored-value} form::
def update_field_values(root, s0_id, s1_id, extent_id, oid,
values):
s0 = root[s0_id]
s1 = root[s1_id]
# Copy top-level structure if same.
if s0 is s1:
s1 = root[s1_id] = s1.copy()
# Copy extents structure if same.
s0_extents = s0['extents']
s1_extents = s1['extents']
if s0_extents is s1_extents:
s1_extents = s1['extents'] = s1_extents.copy()
# Copy extent structure if same.
s0_extent = s0_extents[extent_id]
s1_extent = s1_extents[extent_id]
if s0_extent is s1_extent:
s1_extent = s1_extents[extent_id] = s1_extent.copy()
# Copy entities structure if same.
s0_entities = s0_extent['entities']
s1_entities = s1_extent['entities']
if s0_entities is s1_entities:
s1_entities = s1_extent['entities'] = copy_btree
(s1_entities)
# Copy entity structure if same.
s0_entity = s0_entities.get(oid, None) # (may not exist in
s0)
s1_entity = s1_entities[oid]
if s0_entity is s1_entity:
s1_entity = s1_entities[oid] = s1_entity.copy()
# Insert new field values structure.
s1_entity['fields'] = PersistentDict(values)
def copy_btree(tree):
# XXX: This could be expensive. Maybe a copy-on-write
# Btree variation could be written that has a built-in copy()
# method.
tree2 = BTree()
for key, value in tree.iteritems():
tree2[key] = value
return tree2
Devil's advocate
================
Of course, we could take on a different challenge, and start using a
backend other than Durus that already includes snapshotting
capability. For instance, if I recall correctly, sqlite would be
useful for this (divmod's axiom uses it), and if data is structured
right, might even boost Schevo's performance.
However, I personally feel this would detract from one of the primary
goals of Schevo, which is to provide a first-class database engine in
pure Python (the C extension in Durus is optional, after all). It
would also likely take more time to implement than staying with
Durus, and would provide a migration challenge.
Another option would be to see if ZODB could help us with this, since
it has MVCC. According to http://www.zope.org/Products/ZODB3.6
though, C extensions are a requirement, not an option. Additionally,
this would require a migration process for already-deployed databases.
Other "devil's advocate" ideas are welcome and appreciated in this
discussion.
The bottom of the message
=========================
Thank you for reading this far. :-)
--
Matthew Scott
mscott at springfieldtech.com
More information about the Schevo-devel
mailing list