A dangerous book by Eduardo Bellani

I recently commented on how the book (Kleppmann 2017) is a dangerous book, due to a subtle error on how it defines data models. I suppose it’s my burden to further clarify this point, and for that I’ll use Hayek’s critical methodological maxim:

We must first explain how an economy can possibly work right before we can meaningfully ask what might go wrong

What is a data model?

Here are 3 definitions, in increasing level of detail:

A data model is an abstract, self-contained, logical definition of the objects, operators, and so forth, that together constitute the abstract machine with which users interact. The objects allow us to model the structure of data. The operators allow us to model its behavior. (Date 2003):

  1. a collection of data structure types (the building blocks of any database that conforms to the model);
  2. a collection of operators or inferencing rules, which can be applied to any valid instances of the data types listed in (i), to retrieve or derive data from any parts of those structures in any combinations desired;
  3. a collection of general integrity rules, which implicitly or explicitly define the set of consistent database states or changes of state or both – these rules may sometimes be expressed as insert-update-delete rules.

(Codd 1980)

In particular, the Relational Data Model

  1. An open-ended collection of scalar types, including type BOOLEAN in particular
  2. A type generator and an intended interpretation for relations of types generated thereby
  3. Facilities for defining variables of such generated relation types
  4. A assignment operator for assigning values to such variables
  5. A complete (but otherwise open-ended) collection of generic operators for deriving values from other values

(Date 2015)

Unfortunately, in our industry, it almost exclusively means a model of which information is relevant to particular business cases. Those used to be called Conceptual Schemas. It is part of the classic data model progression(Steel 1975):

Conceptual schema -> Logical schema -> Physical schema 1

What are those? I can’t do better than (Pascal 2016)

Think of a conceptual model as the territory, the logical model as its symbolic representation on the map and the map print and medium (paper, plastic, screen) as the physical model.

How about the Data Model, how does it fit in this metaphor?

The data model is the map legend that provides the mapping symbols and their correspondence to the elements of the territory (e.g., cities, highways, forests and so on) they symbolize on the map.

What is wrong with the book’s definition?

(Kleppmann 2017) does not provide an explicit definition. The closest he has is this paragraph:

Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer?

My translation of this, given the rest of the book’s chapter on Data Models, is that a Data Model for the author is any particular implementation of a higher abstraction in a lower abstraction would count as a Data Model. So, the author refers to all 4 models (and any concrete instance of them) using the same term.

Why does this matter?

I hope that the consequences of such confusion would be clear to the reader. If not, consider the advice of (Pascal 2016)

Referring to all four as data models, or using the terms interchangeably blurs the important differences, reflecting common confusion of levels of representation, namely

with costly consequences.

A single example from the book should suffice, I think:

There are several driving forces behind the adoption of NoSQL databases, including:

Here, the author is confusing a Data Model (the relational data model) with physical concerns (scalability and throughput), which might lead to wrong (and very costly) technology and business decisions.

References

Codd, E. F. 1980. “Data Models in Database Management.” Sigplan Not. 16 (1): 112–14. https://doi.org/10.1145/960124.806891.
Date, C.J. 2003. An Introduction to Database Systems. 8th ed. USA: Addison-Wesley Longman Publishing Co., Inc.
Date, Chris. 2015. Sql and relational theory: How to write accurate sql code. Paperback. O’Reilly Media.
Kleppmann, Martin. 2017. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. Paperback. O’Reilly Media.
Pascal, Fabian. 2016. The Dbdebunk Guide to Misconceptions About Data Fundamentals. DATABASE DEBUNKINGS.
Steel, Thomas B. Jr., ed. 1975. “Interim Report: Ansi/X3/Sparc Study Group on Data Base Management Systems 75-02-08.” Bulletin of Acm Sigmod 7 (2): 1–140. http://portal.acm.org/toc.cfm?id=984332.

  1. (schemas are synonymous to models in this context) ↩︎