Friday, June 20, 2014

Conceptual vs. logical modeling, part II

Integrity

In the previous post, a somewhat detailed inspection was made of possible approaches for specifying, in a modeling language, some database structure, highlighting differences in informational value between those modeling languages, as well as pointing out some pieces of relevant information the specificiation of which is typically left completely unsupported by any of them.

But a full formal spec of a database is not only about its structure, it is also about any additional rules that apply to the constituent components of that structure.  This observation holds, regardless of whether the database is a relational one (and its constituent components are what TTM calls "relation variables", "tables" in SQL) or a graph-based or hierarchical one (and its constituent components are nodes and edges).  I'll be speaking of relvars (TTM abbreviation for relation variables) in what follows, but keep in mind that the same should apply as well, mutatis mutandis, to hierarchical and "graph-ical" models.

While the aspect of 'structure' can reasonably well be modeled in "graphical" languages (such as the various ER dialects and UML), that is much less the case with the aspect of the integrity constraints between the components of that structure.  How come ?

The essential reason is that the nature of an integrity rule/constraint can really be just anything at all.  Its "structure" is constrained only by the fact that it must be expressed exclusively in terms of the relvars that make up the database structure.  At the logical level, where all the formal details of the relvars have been fully specced out in some given language, this is achieved "easily" enough using some language based on/inspired by mathematics.  Just spell out the predicate that makes a violation a violation.  But how to devise a language that supports expressing "anything at all" at the conceptual level ?  The answer is you can't.  The only thing you can do is try to taxonomize the set of all possible constraints in certain well-defined "classes" that might indeed be epxressible.  That is (sort of) exactly what has happened in database modeling land (*).  From ER modeling over IDEFIX to Halpin ORM : the set of all possible constraints is subsetted according to certain chosen criteria, and for each "identifiable" subset, a notation is devised to facilitate documenting constraints belonging to that subset.  A modeling language such as ER leaves it at that, Halpin's "Big Brown Book" explicitly adds a (fourteenth, I believe) category "others", the leftovers that still aren't expressible using the modeling language's symbols that are available.

Anyway.  The fact alone that a powerful modeling approach such as ORM still has this category "leftovers" in the constraints realm, should already suffice to show that _complete_ and fully formal specifications are in fact inachievable without a mathematical language.  For the typical categories of constraints other than those "leftovers", however, the belief seems to be fairly widespread, and firm, that current mainstream notations suffice to document all the stuff we need to know/convey about our databases.  That belief is not entirely warranted, imo, and in this post I'll be illustrating a couple of issues relating to the (common) class of uniqueness constraints.  A subsequent post will do the same for referential integrity/foreign key constraints.

(*) the sad byproduct of this state of affairs is that if one uses the word "constraint" in a database discussion, SQL practitioners will often think you must be talking either of a UNIQUE constraint or a foreign key, overlooking the fact that "there are more types of constraint between heaven and earth than are expressible in ER, and supported by SQL".



Uniqueness

The example case from the previous post is not very well suited to illustrate issues with documenting uniqueness rules.  None of the entities in that example are directly suitable for illustrating the points I want to make, so for the sake of this discussion, I'm somewhat forced to resort to a totally different example, which is likely to look ridiculous in the eyes of business modelers, but I won't mind that for the time being.

Let's say we want to model the operation of numeric addition - if you really can't bear the thought, imagine you are Euclid or Pythagoras, that arithmetic has not been invented yet and you are in the process of doing exactly that, using the latest database design technology (and with apologies upfront for my ascii modeling) :

+---------------+
! ADDITION      !
+---------------+
! N1     number !
! N2     number !
! SUM    number !
+---------------+

(Yes, and the table is indeed like

+----+----+-----+
! N1 ! N2 ! SUM !
+----+----+-----+
!  1 !  1 !   2 !
!  1 !  2 !   3 !
!  ...          !
!  2 !  1 !   3 !
!  2 !  2 !   4 !
!  ...          !
+---------------+

!!!!!!!!)



I'm pretty sure if I'd ask you what the key is here, you'd reply with "N1 and N2 combined, of course".  You get 33% from me for that answer.  There are three keys here : {N1 N2}, {N1 SUM}, and {N2 SUM}.  Well, granted of course that the matter is also important of which ones you ACTUALLY WANT ENFORCED (that's what you're referring to if you wanted to argue that these latter two "do not identify an addition").  If we wanted to "enforce" the obvious consistencies/equivalences between expressions of addition and expressions of subtraction, we would indeed need to model and enforce all three (hehe.  settled that.).

Now how would you document all three of them in your model of "annotated rectangles" ?  You're in trouble !  (In fact, I think it is precisely _because_ of this notational problem to document multiple composite keys inside one single rectangle, that the notion of "primary" key (as distinct from "secondary / ternary / auxiliary / ..." key ????) has become so widespread as it has, and furthermore that the practice of ID-ifying really just everything has become as popular and widespread as it has.  I leave it for you to ponder whether that's a case of putting the cart before the horse or not - or one of redefining/retrofitting the problem such as to suit the most desirable solution.)



Anyway.  Supposing we do want to enforce all three keys.  How can we document this in our drawing ?  Observe in particular that each key consists of >1 attribute and each attribute effectively participates in >1 key.  The only way I can imagine to convey all the key information in our rectangle is like this :

+---------------+
! ADDITION      !
+---------------+-------+
! N1     number ! K1,K2 !
! N2     number ! K1,K3 !
! SUM    number ! K2,K3 !
+---------------+-------+

Very much like the approach of putting a P in front of the attributes, but it takes attentive and careful deciphering to read the drawing and capture the keys information correctly !  (And the sad byproduct of omitting the full keys information, e.g. for readability sake, in diagrams such as these, is indeed that typically not all keys are properly identified, let alone enforced.)

In fact, the most readable way to convey all of this information about uniqueness rules, seems to be exactly by just using syntax very similar to declarative DDL, or the declarative portion of a D language :

UNIQUE {N1,N2} , UNIQUE {N1,SUM} , UNIQUE {N2,SUM} or
UNIQUE { {N1,N2} {N1,SUM} {N2,SUM} }

And here again we are seemingly headed toward a similar conclusion : if you want to be precise _AND_ complete in what you are stating about the nature of the database that you are documenting, then of necessity you MUST resort to a language that has a much higher expressiveness than the ones you typically have available when modeling at a "higher" level of abstraction.

Incidentally, in notations such as Data Vault (a "hub" to represent the entity and separate rectangles connected to the "hub" for each attribute), the problem of documenting keys is even worse.  The only graphical way(s) I can imagine to document the existence of some meaningful "grouping" of attributes, such as their belonging to the same key, will invariably make the diagram equally unreadable because of extraneous line bloat.  Whether you try to do it by surrounding them with dotted lines or so, or by creating a new symbol for documenting the existence of a key (three extra symbols for the ADDITION Data vault) and connecting each attribute with them as appropriate (six extra connections on the diagram), it's always going to turn your beautiful neatly organized DV diagram into more of a spider web.  Fortunately, Data Vault diagrams are typically used only in DW contexts, not to document keys in the operational source systems they're concerned with, but it still goes to show that whatever conceptual notation you use, it only goes as far as it goes and _something_ will always be "missing" from it.



Uniqueness bis

Managing temporal data is somewhat of a long-standing problem in database land.  A thorough analysis of the _nature_ of the problem (and what is needed to address it) can be found in "Temporal Data & the Relational Model", pages 1-857, so I'm not going to re-iterate all of that here, but one particular problem dealt with is "temporal uniqueness".  (Aside : if you haven't yet read the book but are interested to do so, don't go order or search it now.  Updated and revised edition is to appear within a couple of months.)

Say you have

+-----------------+
! ASSET_VALUATION !
+-----------------+
! ASSET_ID     ID !
! FROM       date !
! TO         date !
! VALUE    number !
+-----------------+

or

+-----------------+
! MARRIED_TO      !
+-----------------+
! PERSON1_ID   ID !
! PERSON2_ID   ID !
! FROM       date !
! TO         date !
+-----------------+

and you want to enforce a constraint "no single date giving >1 distinct values for same ASSET_ID", or "no one married to >1 other person on same date".

The "traditional" interpretation of what a key is, will not allow you to express this.  No "traditional", equality-based, key will ever prevent overlaps between various FROM-TO combinations for the same person/asset/...  So perhaps you might be inclined to conclude "that is not a real key".  Interestingly, The "Temporal Data" book proposes to rearrange matters a bit so that expressing the constraint does become possible, and indeed in the form of "specifying a key" at that :

+-------------------+
! ASSET_VALUATION   !
+-------------------+
! ASSET_ID       ID !
! DURING date_range !
! VALUE      number !
+-------------------+

or

+-------------------+
! MARRIED_TO        !
+-------------------+
! PERSON1_ID     ID !
! PERSON2_ID     ID !
! DURING date_range !
+-------------------+

... WHEN UNPACKED ON (DURING) THEN KEY {ASSET_ID DURING} ...
... WHEN UNPACKED ON (DURING) THEN KEY {PERSON1_ID DURING} KEY {PERSON2_ID DURING} ...


The graphical languages such as ER that we typically use for information modeling, still let us down, somewhat, in the case we'd want to specify that level of detail.  In addition to the composite nature of the key, we'd also need to express the semantics of the "WHEN UNPACKED ON" part : namely that the values for the DURING attribute must be interpreted in a "for each individual date value covered by the range" kind of way.  The closest we could come to denoting that might be something like this :

+-------------------+
! MARRIED_TO        !
+-------------------+-----------+
! PERSON1_ID     ID ! K1        !
! PERSON2_ID     ID ! K2        !
! DURING date_range ! K1_T,K2_T !
+-------------------+-----------+

Of course, the notational problem with denoting multiple keys to the relvar has not disappeared, nor has the possible participation of a single attribute in >1 key, just an extra little bit of codification has been added (suffixing _T to a key name on the lines for the attributes where it applies) to denote the extra bit of semantics covered.  It will be clear that while such tricks are indeed possible, and potentially helpful in denoting modeled solutions to a problem that is indeed very common, once again such solutions can only go as far as they do, and taking things further will ultimately result only in making the models we draw ultimately unreadable.  Specifically in the context of temporal data management and temporal keys, observe for example that it is not necessarily the case that all range-valued attributes will _always_ have the _T "interpretation" for _all_ the keys in which they participate :

+----------------------------------+
! NOT_BEST_OF_EXAMPLES_BUT_ANYWAY  !
+----------------------------------+------+
! PRESIDENTIAL_TERM     year_range ! K1   !
! DURING                date_range ! K1_T !
! PRESIDENT_NAME               ... !      !
+----------------------------------+------+

( think   70-74 : 70-73 : NIXON   &&   70-74 : 74-74 : FORD )

Once again the conclusion seems warranted that extending expressiveness/notational support beyond current common practices, will quickly result in making the models more unreadable and thus less informative, rather than more informative.



Uniqueness ter

Another variation on the theme of uniqueness rules, is the problem of enforcing uniqueness on only a (proper) subset of all occurrences of an entity type.  Say you have

+----------------------------------+
! CAR_LICENSE_PLATE                !
+----------------------------------+------+
! CAR_CHASSIS_ID               ... ! K1   !
! TAX_LICENSE_PLATE            ... ! K2   !
! CAR_STILL_IN_ACTIVE_USE     bool !      !
+----------------------------------+------+

and for the purpose of re-using license plate numbers, you want to enforce license plate uniqueness (key K2) only for those cars that are still in active use (there are admittedly better solutions to this problem than the one modeled here, which I sometimes call the ultra-poor man's historical database, but it does serve to illustrate my point).

The aspect of the problem that makes the "key" "not documentable", is precisely the subsetting rule, i.e. the fact that the "key" is not to be enforced on the whole CAR_LICENSE_PLATE entity, but just on the subset of it that, once the database is implemented in SQL, could be found by issuing SELECT ... FROM CAR_LICENSE_PLATE WHERE CAR_STILL_IN_ACTIVE_USE == true;

If we absolutely wanted to be able to document the existence of this key, using the available means of adding a "Kn" annotation in the rectangles, we'd have to add a separate rectangle for the subsetted CAR_LICENSE_PLATE entity, and then we'd have shifted the problem to documenting the definitional connect/dependency of this new rectangle with/on the "original" one, the "full" entity.  That is, we've transformed the problem into one of conceptually documenting "view definitions" (and that very idea is probably seriously questionable in itself already because including the "definitional connect" smacks quite a bit of conflating conceptual/logical).  Once again, our modeling language will let us go only as far as it goes.



(still to be continued)

No comments:

Post a Comment