How’s my complexion?
If the efficiency of mirroring enterprise structure with database structure is
the greatest benefit of a distributed database system, the greatest disadvantage
is the resulting complexity. Networked users everywhere can and do make changes
to their local data, and sometimes, hopefully unwittingly, to remote data. (If
the system is well-built, users don’t even know that some of their data actually
originates from another city rather than under their own desks.) Especially along
shared boundaries, two users may attempt to update the same record at the same
time, both moving the same street segment, for instance, but in different
directions. What happens then? Will one user’s change be ignored or overwritten
by the other? If the system is distributed, but the network is down, and each
user changes her own copy of the same street segment, will their two databases
remain inconsistent with each other? This multiuser example of potential data
corruption is only one of several threats to a distributed organization’s data.
In fact, threats to data integrity, whether deliberate or accidental, are largely
responsible for the creation of databases in the first place: databases are
designed to protect data.
Database theory: a quickie. Databases protect data integrity, recovery, and
concurrency with mechanisms that prevent users from entering bad data, losing
data during a system failure, or corrupting existing data. New database users
learn about integrity during their first database design effort. Because different
users have different opinions regarding data entry, databases respond with
integrity rules to guarantee that, for instance, "Avenue" will always be
abbreviated as "Ave" or that streams will only flow downhill. Recovery and
concurrency are less obvious data protection mechanisms. All disk heads must
eventually crash; databases respond with recovery procedures to minimize data
loss. Multiple database users touch the same data at the same time; databases
respond with concurrency control mechanisms to prevent interference of one request
with another, avoiding lost updates (as in the previous example) or inconsistent
analyses.
To deliver their data-protection guarantee, particularly regarding recovery and
concurrency, databases apply the same rules to every request (or transaction) by
every user. Specifically, databases take an all-or-nothing approach called
atomicity to each user’s request. If a database is able to complete all parts of
a user’s request successfully, it still is not finished--it must then perform an
operation called a commit. Like a fighter pilot painting a star on his plane’s
nose for each enemy shot down, the database logs each successfully completed
transaction as committed. If unable to complete any part of a request, however,
the database reverts or "rolls back" to the state preceding the request, undoing
every part of the transaction. Even if the system crashes in the middle of a
transaction, then, during the recovery process, the absence of a commit signals
the need to roll back the disrupted transaction. Atomic transactions guarantee
that, for example, your request to delete all alleyways will either succeed or
fail completely, never delete just some alleyways, get disrupted, and then,
unbeknownst to you, fail to finish the job.
Hand in hand with atomicity is database locking, a mechanism that prevents
simultaneous transactions on the same data from interfering with each other. The
moment one user begins to edit a street segment, the database locks that segment’s
table, or a subset of the table records, or even the single record itself, until
after the edit has been committed. While locked, no other user can edit, or, in
some implementations, even see the segment. (While this strategy solves concurrency
problems, it raises new ones, such as deadlocks, beyond the scope of this
discussion. The key concept, though, is that locks prevent conflicts and protect
data concurrency.)