Home
Catalog Quick search
Catalog Advanced Search
HomeArchivesDatabasesE-JournalsLibrary CatalogReady ReferenceServicesSite Search
Library Services

Medical Informatics I: Principles of Database Design

Transcript Part V: Questions and Answers:

Question: Is a parent/child relationship always a one-to-many relationship?

Answer: No. One could have that instance, for example, where a governor governs one state and a state has only one governor. You could have a one-to-one relationship between records in a system that would be considered parent/child, but it is unusual. Because when that occurs, most commonly you just embed it in the unit record. You might have reasons to do that. Almost always the notion of parent/child relationships is the allowance for the possibility that there will be a one-to-many. Even though in any given instance, for example, there might be just one author of an article.

Question: Do you think that the relationships parent/child, in a unique way for database creation, is different from say parent/child relationships in a thesaurus? So from a point of view for instance, of an author...the author could be the main table in another type of record. So you are saying any record that is linked to whatever...is defined as the main table? ...is the child of that parent?

Answer: So the question is, "Are all parent/child relationships to main records? No actually you can have....certainly wherever you have a one-to-many relationship, that is generally called a parent/child relationship. So, for example, the relationships between a main MeSH heading and multiple subheadings would be considered a parent/child from the perspective of the main MeSH heading. And not from the perspective necessarily of the main record object of this data system. So you can have these parent/child relationships nested. They don't always mean that the child has this child relationship to the core main record.

Question: But it is a logical relationship? It is not based on the actual content of the information. In other words "author" is in this situation, a "child" record but the main record in the concept of the author do not have any semantic relationships to each other.

Answer: Actually they do actually have a semantic relationship, and in this case actually the semantic relationship between the main record and the author record is actually embedded in the name of the table, because the name of the table is actually a role. And if you think about it...and in other relational systems you might depict that as a role. And in fact let's think about Medline in the absolutely correct world. And that is where every author has an author unique identifier, right? We don't have any of this name ambiguity stuff. Right? The first time you ever published something you get an identifier you get to use every time you publish future things. And in that model of Medline, you would actually have a person record, which is you as a human being with your unique identifier as a human being. And then the author table would actually be a linking table that had a role of author with respect to a Medline unique identifier citation.

And that way we would have this separation of items in this model, the role is actually the title of the table. But that is not the correct real world because these represent people that have other things they do in the world. So if we were trying to model the real world...you know it is a very interesting aspect that you raise...that this one-to-many relationship does hold systemically between the definitions of the object. Even if there were no instances of Bob Greenes and no instances of author, that one-to-many relationship is trying to model something that holds in the abstract, the semantics of the relationship between a situation and it's authors in this case. So it is trying to model the objects, not the instances of the objects.

Question: At what point do you decide how to use the (inaudible) A single line in your main. Now okay so then later if I wanted to find all of the JAMA articles I could presumably filter my query looking for JAMA. Why not make a sub-record for that? There is balance there.

Answer: Yeah, the balance is in the envisioned uses of the data. The question is, "Well, why make any of these singly occurring characteristics of a citation record? Why not put them all in separate tables so that the logical record is...everything is linked by the unique identifier?

That gives you, for example, this easy search ability, where you just choose the table you want to search and you have simpler query language. The tradeoff is the IO complexity, that is, how many different tables do you have to open and read to get one logical record? I mentioned we had 65 in GenBank and that was too much. In order to get one display of one GenBank record to read the corresponding rows in 65 tables and their associated indexes, which meant you had to open 130 files to get one record...simply made the system unusable in the terms of its performance.

So what happens is, you tend to use these fully normalized relational systems for building and editing and maintaining the data, and then you collapse it into what are called "denormalized forms" for purposes of delivering the product.

And that raises the issue of a very important thing...of how you would, in fact, optimize the performance for these very complicated records, yet still allow yourself one-to-many relationships in a data model, because...for example is anybody here from Columbia? So, some of the Columbia faculty discovered that in their world they had definitions for almost 10,000 fields, if you will, and many of them could occur once, or not at all, and they had this kind of hopelessly large data dictionary. And they were faced with, "How would we possibly create a clinical information system that have predefined all of the fields that might possibly occur in all of our clinical data?"

One could say, "Is there a more efficient--as opposed to decomposing this relational Medline model--is there a more efficient one-to-many relationship that I could establish, that would allow me to put all these multiply occurring values in one table? Got any ideas?

The answer is, you can do it, and I will show you. And it is a particular form of data modeling called "entity attribute value" or EAV data modeling. And its power is its flexibility of representation and it turns out EAV modeling is exactly what Columbia chose.

So I am going to create another version of this same table here. Let me define the parts I want to export here. Or maybe what I am going to do is just create a second page of this. That way you will get it all in your handout. This doesn't want me to do that.

Alright in that case, let's us just create a page break here, and we will title this one...rather than simply a relational model...we are going to call this an "entity"--this is, by the way, the last thing we are going to teach in this session--or so called EAV relational model for Medline. And it is a very simple adaptation of this parent/child relationship that allows one to have the same main record, that is singly occurring, must occur attributes happen only once.

So there is the main Medline record, but there is a single relationship with a "child" table that is... some people call...a canonical form, a canonical table. And the canonical table, in fact let me pick up the arrow here, so we have this one-to-many relationship. I will copy those and move down to page two, Paul Harvey would say. And instead of having these individually named tables we would have a table that is something like EAV data. And each record in this table has got an attribute name and an attribute value.

So for example, the attribute name could be [inaudible] and you can see that you can essentially model each of the elements of the sub-record within a "child" table by not only modeling just the data itself, that is, not only having an "instance" of the data, but essentially converting the field name into another column in the table.

And this allows us to reuse a general purpose table structure, to have one column that is the pointer to the main record, a second column which is the field names, and the third column which is the feed value. Now you can detect already that when you get embedded one-to-many relationships where you have the nested one-to-many-to-many, this model gets in trouble because now it begins to start having trouble establishing the twople where you bind the sub-record...the child of the child...But for singly occurring, that is for a single parent/child relationship, where you have a one-to-many relationship, this is a remarkably powerful adaptation of a relational data modeling that takes advantage of the fact that current relational systems are very, very good at reading extremely large tables that have simple structure.

So things like Oracle and Sybase and these other forms can buffer thousands of records at a single disc read, especially if the things have been physically sorted in a way that allows this kind of EAV data modeling, although it is inherently inefficient. Because remember...instead of accessing just a single record for an author, now we are having to access each of the field names for that author table. So you might have as many as 5 or 10 or 20 times as many rows that you have to read in order to get the same logical record.

But the tradeoff is that you can get a remarkable flexibility, because now in order to define a new field in your database, you don't have to change the structure at all. All you have to do is create a new value that goes in the attribute name column, so it gives the ability to have data-driven dictionaries that essentially provide the same kind of functionality that in the old days...you would have to modify the table structure to create a new field. Now, all you do is modify the data dictionary and you have a legal value for a new variable that you want to store on the database.

So this EAV relational model, although it causes this explosion where now you have many, many, many records that make up a logical record. It takes advantage of the growing power and sophistication of database technologies and very, very fast disc, so that it looks, in terms of performance, not much different than the classical unit record that was all packed together. And you could read from front to back in a single disc read.

If you look at the major clinical information systems, because they have such a complex data dictionary--you have so many different values, potential variable names they need to store--increasingly you will see these implemented with this, so-called EAV, or "entity attribute value" relational modeling. So let me ask Ed, are your current clinical systems EAV modeled?

Ed: They are usually mixtures, but not really.

So in the best world, again, the EAV models help you when you do not have embedded substructure. But as soon as you get embedded substructure, now you have to define separate tables. So this doesn't solve the MeSH subhead problem for you, but it could solve the problems related to this direct relationship...parent/child relationships between the core records.

And we are using this, in fact, for microarray data, where we essentially...every gene expression value is the intersection of one row and one column. We can use this to modelrow/column intersections--out of spreadsheets and create databases that have literally millions of gene expression values from hundreds of different experiments, where the issue is flexibility of representation. And we never know, in a sense, what the variable names are going to be depending upon the experiments. So this is a trade-off of representational power and flexibility compared to efficiency and pre-structured record.

So those cover the general principles of how relational systems get built, and I hope, give you a sense of...when you are confronting the task of a new system design..and, "How am I going to lay out my structure in a way that it is searchable, maintainable and stands the test of time?"... that these principles of making tables that look like the real world objects that they represent, pays back in the effort that it takes to do it.

The bottom line then, in database design, is that I hope you believe somewhat more now that the database model is the most critical aspect of design and function. That those models, really, in their best situations, should reflect the real world objects and those relationships of those real world objects, to be durable. And that the correct data model subserves and outlasts the applications including many that are not anticipated, and the future systems applications developers will thank you, not knowing it was you who did it correctly, for creating a model that looks like the world they have to deal with. And with that, I would be happy to answer any questions.

Question: inaudible

Answer: The difference...the EAV model looks like XML. In some ways that is true, although XML has this companion where every unit record that has tagged data has a document type description...a DTD...which is more or less the data dictionary for each of the tags that can occur legally inside of it. So, the XML model is a little looser and more text-based than the EAV, which is a structured database table where every record has exactly the same unit structure.

Question:...which was similar to what you did using XML and basically the idea here is to create a document for every document that is used in health care, from imaging, from radiology interpretations, to batteries of laboratory testing and everything else. My suspicion is those ultimately will become stored the way the data is will be stored using the XML structure.

Answer: So you are going to hear more about HL7. It is the worldwide most rapidly growing standard for packaging data so it can be mailed in electronic envelopes, messages between systems. It was a messaging standard and the latest version HL7--version 3 is built on XML--and the tour de force article describing it was written by Bob Dolan from Kaiser Permanente is going to be published in the next couple of months in JAMA. So that this introduction to XML data modeling in health care is going to grow in importance. XML, however...I think it is important to recognize that the principle focus of XML will tend to remain. I think the exchanging of data between this similar system that is being able to transport it in a way that does not lose content. Because for the reasons I mentioned, of complexity and the tagged verbosity, it isn't always the case that you want to store. In fact it is seldom the case you actually want to store data as XML locally. You may want to transform it into something much more compact and efficient but when you send it to others, so other systems can understand what the content of your messages is. XML and HL7 are becoming the tour de force world standard for doing this.

Okay, so you got through 5 and we are right on time, and we will go into a different mode for this evening's workshop. This evening is all hands-on. You get to play and type and point and click and we will be editing a class Web page so even if you are Webmasters already, and know all about this stuff, we will ask you to at least stop in and create your own page for which we took your picture this afternoon, and I think we will have a lot of fun. So we will plan to see you back at 7 o'clock and hopefully you have some time to get some fresh air and exercise and eat yet another time.

< previous section   |   next section >


More: Medical Informatics I: Principles of Database Design

LIBRARY INFO | STAFF DIRECTORY | SITE INDEX | MBL HOME | WHOI HOME | SEARCH | PRIVACY
copyright © 2006 by The MBLWHOI Library