There’s a part of me that’s always striving for conceptual perfection in software, and that part is consistently let down.
For example, let’s say I need to store an enormous amount of data with the standard ACID guarantees. The natural choice would be to store the data using a SQL server. SQL allows you to express a wide range of data validity constraints, from as simple as guaranteeing a value is not null to as complex as using a trigger to verify that the data meets some complicated inter-table validation scheme. Furthermore, SQL will guarantee that invalid data will never get into the database — the attempt to insert invalid data will fail. However, as wide-ranging as SQL’s data validity constraints are, they are (and likely always will be) incomplete. For example, on PostgreSQL, I cannot express the constraint that a particular column must contain XML which validates against a specified schema.
The problem seems to be that data validity is an arbitrarily complex concept. For example, even if a database were to support the above-mentioned XML schema validation, what if I were to say that the XML is actually XHTML and all links within the document must be valid? What if ensuring data validity requires me to to pull, calculate, and correlate data from many different, arbitrary sources? It seems that conceptually the only way to allow arbitrarily complex data validity requirements would be to allow the execution of arbitrary code. However, that would have serious implications for the SQL server, not the least of which are performance and reliability.
The only real alternative seems to be to create an out-of-band, external data validation method. Since losing the assumption that the data in your SQL server is always valid often makes writing consumers of the data much more difficult, it seems we should focus on making sure the bad data never gets in there in the first place. Typically this implies a few things: forcing all database input to go through the complex data validation rules, ensuring all intra-database modifications never create invalid data, and often creating a method for detecting and correcting the inevitable bad data that gets into the database.
The theoretical side of me then says: “Wait a minute, why do we have two places where we are checking for data validity: the data input side and the database constraints (primary keys, null checks, etc.)? As the input-side external data validation rules are written in code, then can express any arbitrary validation requirements, so why shouldn’t we check for null there and get rid of the database constraints?” Conceptually this seems like it would work, but practically it gives me the shivers, in that I trust myself to notice and correctly write a ‘NOT NULL’ constraint on a database column more than I trust myself to correctly write the code to do it in all cases.
Considering these practical concerns, it seems more sensible to have a ‘layered’ data validation approach, where each layer chooses some level of futher data validity it wishes to ensure. It is redundant, but not fatal, for a subsequent layer to recheck a validity constraint a preceding layer ensured.
It seems the idea behind ‘layering’ data validation spreads beyond just databases. One could conceptualize the checks the compiler makes in a statically-typed language such as type safety, syntax, etc. as a first layer of data validation, in contrast to a dynamically-typed, runtime-based language. However, as any fan of Python can tell you, relaxing these data validation constraints can actually be an advantage in some cases. The same is true for databases: validity checks have rapidly increasing costs and diminishing returns. Therefore, as much as I’d like to have perfect, input-side database validity checks that always guarantee the data in the database is valid, perhaps the goal of perfection is not worth the cost.
Recent Comments