Validity

Programming No Comments »

There’s a part of me that’s always striving for conceptual perfection in software, and that part is consistently let down.

For example, let’s say I need to store an enormous amount of data with the standard ACID guarantees. The natural choice would be to store the data using a SQL server. SQL allows you to express a wide range of data validity constraints, from as simple as guaranteeing a value is not null to as complex as using a trigger to verify that the data meets some complicated inter-table validation scheme. Furthermore, SQL will guarantee that invalid data will never get into the database — the attempt to insert invalid data will fail. However, as wide-ranging as SQL’s data validity constraints are, they are (and likely always will be) incomplete. For example, on PostgreSQL, I cannot express the constraint that a particular column must contain XML which validates against a specified schema.

The problem seems to be that data validity is an arbitrarily complex concept. For example, even if a database were to support the above-mentioned XML schema validation, what if I were to say that the XML is actually XHTML and all links within the document must be valid? What if ensuring data validity requires me to to pull, calculate, and correlate data from many different, arbitrary sources? It seems that conceptually the only way to allow arbitrarily complex data validity requirements would be to allow the execution of arbitrary code. However, that would have serious implications for the SQL server, not the least of which are performance and reliability.

The only real alternative seems to be to create an out-of-band, external data validation method. Since losing the assumption that the data in your SQL server is always valid often makes writing consumers of the data much more difficult, it seems we should focus on making sure the bad data never gets in there in the first place. Typically this implies a few things: forcing all database input to go through the complex data validation rules, ensuring all intra-database modifications never create invalid data, and often creating a method for detecting and correcting the inevitable bad data that gets into the database.

The theoretical side of me then says: “Wait a minute, why do we have two places where we are checking for data validity: the data input side and the database constraints (primary keys, null checks, etc.)? As the input-side external data validation rules are written in code, then can express any arbitrary validation requirements, so why shouldn’t we check for null there and get rid of the database constraints?” Conceptually this seems like it would work, but practically it gives me the shivers, in that I trust myself to notice and correctly write a ‘NOT NULL’ constraint on a database column more than I trust myself to correctly write the code to do it in all cases.

Considering these practical concerns, it seems more sensible to have a ‘layered’ data validation approach, where each layer chooses some level of futher data validity it wishes to ensure. It is redundant, but not fatal, for a subsequent layer to recheck a validity constraint a preceding layer ensured.

It seems the idea behind ‘layering’ data validation spreads beyond just databases. One could conceptualize the checks the compiler makes in a statically-typed language such as type safety, syntax, etc. as a first layer of data validation, in contrast to a dynamically-typed, runtime-based language. However, as any fan of Python can tell you, relaxing these data validation constraints can actually be an advantage in some cases. The same is true for databases: validity checks have rapidly increasing costs and diminishing returns. Therefore, as much as I’d like to have perfect, input-side database validity checks that always guarantee the data in the database is valid, perhaps the goal of perfection is not worth the cost.

Raymond Chen’s The Old New Thing

Programming No Comments »

Raymond Chen, a developer for Microsoft, has one of the most informative and interesting Windows-related technical blogs out there called The Old New Thing. His post “A twenty-foot-long computer” describes an amusing story of how the Windows team tested Plug and Play using a 20-foot-long chain of PCI bridge cards, but I saw an even better anecdote in the post’s comments by Adrian Oney:

During the Windows 2000 project, we did something very similar. To test Plug and Play and to test the Driver Verifier, we created the “USB Cart of Death”. We started with a 2 level cart similar to what you’d see in a library.

About ten 8port hubs were wired together, and then every port was filled with some different type of USB device. A USB steering wheel adorned the back of the cart, and a USB radio provided the antenna. Two cameras were one the front. All power went to a USB UPS. The entire cart, completely mobile, came down to two cables (power, USB). The final USB cable was plugged into a USB PCMCIA card.

We’d plug in the card into a laptop, watch the OS start up the 50 or so devices on it, and then (before or after it finished) we’d uncerimoneously yank the PCMCIA card. If a blue screen occured or the Driver Verifier detected a bug, the appropriate developer would be asked to look at the machine. In the meantime, the card would be wheeled to the next laptop, in hopes of finding a different bug.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in