6.830 2009 Lecture 10: Transactions from Sam Madden's 2008 notes Today's topics:  transactions  serializability transaction  programmer groups a series of DB operations  DB guarantees them ACID properties  Atomicity -- multiple operations are all-or-nothing w.r.t. crashes  Consistency -- maintain data invariants  Isolation -- concurrent xactions don't interfere or see partial results  Durability -- completed xactions survive a crash Transactions  one of the most successful big ideas in computer science  third big idea so far in 6.830 (rel alg data model, query processing/opt) today's focus is Isolation; later Atomicity / crash recovery how to cope with concurrency?  i'm reading and writing, you're reading and writing same data,  how are we going to write correct code? example  t = A  t = t + 1  A = t  looks correct!  but maybe not if other updates to A are interleaved!  suppose our increment executes right before another increment's write    our increment will be lost why isolation helps  lets me write and reason about serial code, as if no concurrency  it is possible but very difficult to write correct code w/o isolation transaction is the unit of isolation  programmer marks start and end of transaction  DB system executes w/ isolation example:  BEGIN TRANSACTION    SELECT b1 ...    SELECT b2 ...    UPDATE b1 ...    UPDATE b2 ...  COMMIT (or ABORT) abort: un-do all changes, as if never started other concurrent transactions don't see either update until commit and no updates from other xactions allowed, e.g. between sel and upd  that update would be "lost" transaction sees its own updates if crash before commit, after restart as if nothing happened! if crash during commit, all or nothing if crash after commit, after restart will see all updates (durable) how can DB implement isolation? Simplest idea: serial execution  DB system runs the transactions one at a time    waits for each to finish before starting next    even if two submitted at the same time -- forces one to wait  "serial schedule" of operations why not implement isolation using serial execution?  it would be fine if no transaction ever waited for anything  but what if an xaction waits for disk read? or waits for user input?    then CPU or entire system is idle    could be running other transactions  so we'd like to actually run xactions concurrently! how can DB know it is correct to run two transactions at the same time?  DB needs a definition of "correct" definition: "serializable schedule"  interleaved schedule of ops from multiple xactions  some final DB state results  the schedule is serializable if the final DB state is identical    to result from some serial (one-at-a-time) execution    of the transactions  a formalization of isolation let's look at an example  BEGIN  t1 = A       RA  A = t1 * 2   WA  t2 = B       RB  B = t1 + t2  WB (note using A here as well as B)  COMMIT going to abstract a bit, just Rx, Wx  Rx reads current value of x from DB  Wx updates x in DB    the actual value written could depend on any preceding R Schedule 1: is this interleaving OK?  [don't draw the =s initially]  T1  T2  RA=1  WA=2      RA=2      WA=4  RB=10  WB=11      RB=11      WB=13  [final result: A=4 B=13] what final A/B values would this interleaving yield?  let's say the initial state is A=1 B=10  is it the same as T1 then T2? T2 than T1?  [both serial executions are the same:   T1 yields A=2 B=11   T2 yields A=4 B=13 --- the only legal answer!] Schedule 2: how about this interleaving?  T1  T2  RA=1      RA=1      WA=2  WA=2  RB=10  WB=11      RB=11      WB=12  [final result: A=2 B=12] what final A/B values would this interleaving yield?  is it the same as T1 then T2? T2 than T1? we want a DB transaction scheduler that gets as much concurrency as possible  i.e. schedules operations as early as possible  but produces only serializable schedules how can a scheduler decide whether a schedule is serializable?  in above examples, we executed interleaved and serial and compared results  that's not very practical could DB scheduler look at the two transactions' code and predict  whether schedule S is equivalent to some serial order S'?  not in general: same as deciding if two functions have same output  undecidable so we must settle for for a conservative estimate of serializable  s.t. XXX-serializable implies serializable  but not XXX-serializable doesn't neccessarily imply not serializable  and we want it to be cheap to decide View Serializability  defn: schedule S is view serializable if there's a serial order S' s.t.  1. Ti does first read of A in S1, also in S'  2. Ti reads A as written by Tj in S1, also in S'  3. final write of every object done by same xaction in S and S'  intuition:    if T1 in S views same inputs as T1 in S', will write same values    and if order of writes same in S and S', final values will be the same  unlike serializability, *view* serializability is defined in terms of    order of reads and writes, *not* on the computations and results  so it's more tractable to write an algorithm to check it Is Schedule 1 above view serializable?  yes, equivalent to S' = T1 T2    T1 in S sees same values as if it ran first    T2 in S sees same values as if it ran second    T2 write A and B second in both S and S'