|
|
We have been using the terms “data” and “information” throughout this book, but do we know what they really are? This chapter lays the foundation for understanding how data combines with predicates to form information, which is the topic of the next chapter. Predicates generalize information. Meaning is given by information, and meaning is the focus of semantics.
Information
When we attempt to define the words “information” and “data” in ways that are more precise and yet compatible with natural language, we encounter problems right away. Consider these definitions from Merriam-Webster.
information : FACTS, DATA
fact : a piece of information presented as having objective reality
Here we have a circularity: Information consists of facts, and a fact is a piece of information! Let’s see if the word “data” can help us escape the circle.
data : information in numerical form that can be digitally transmitted or processed
So data is a kind of information: it’s numerical information. That’s fine, but we still don’t know what information is!
Of course, I have deliberately selected these definitions from several possibilities Merriam-Webster gives for each of these words, in order to show that, to some extent at least, a good definition of “information” is hard to find. A study of the alternative definitions of these words begins to widen the circle to include the word “knowledge”, among others, but there is no strong definition of “information” in this dictionary. Our task, then, is to develop a definition of “information” that is precise, consistent, and useful as a building block.
It is at least indisputable that the word “information” refers to a mass quantity. It is much like when we refer to “water”: we aren’t referring to any particular quantity of water, and we certainly aren’t counting water molecules: we just mean water en masse.
The human race did not need to know about water molecules before we could benefit from or harness the power of water, but our understanding of the physical world took a great leap forward when we learned of the existence of water molecules, and in fact this understanding helped to usher in the modern era where we have much greater control over the physical world. If we are to understand information in a deep way, and truly gain control over it, it is essential that we understand information at the molecular level, so to speak. We already get value out of information, but by understanding information at the molecular level, we will enable even greater insights and accomplishments. We need to answer the question, What is the fundamental piece of information?
For the answer, I will look to the field of mathematical logic, specifically propositional logic and first-order predicate logic, for the terms proposition and predicate. We gain a tremendous advantage by linking the definition of information to the field of logic, because we can harness all of the proven techniques of formal logic systems to assist in information analysis and processing. Propositional logic and predicate logic are what link the fields of data and semantics.
Merriam-Webster’s defines the word proposition as follows:
proposition 2 a : an expression in language or signs of something that can be believed, doubted, or denied or is either true or false
For example, the statement, “It is raining outside right now,” is a proposition, because at this very moment the statement is either true or false—or at least one may argue about whether it is true or false. (Perhaps it is only drizzling.)
A proposition is the most fundamental piece of information. A collection of propositions constitutes information.
Let’s see what this means in practice. Here is a series of propositions, which I believe most of us intuitively consider to be, collectively, information.
(In fact, each of these propositions is a compound proposition, because each asserts more than one claimed truth. We will save consideration of decomposing propositions—moving from the molecular level to the atomic level—until a later time.)
Is Information Always True?
In natural language we sometimes speak of “false information”. The above definition of information, as a collection of propositions, allows for the possibility that information is false, since a proposition may be true or false. Thus, our definition of information enables us to use the word in this natural-language way. In contrast, the word “fact” carries with it the notion of truth. When a supposed statement of fact turns out to be false, we don’t call it a “false fact”; instead, we say that it is not a fact.
Given this definition of the word:
fact 5 : a piece of information presented as having objective reality— in fact : in truth (Merriam-Webster)
, we can say that a fact is a proposition (a piece of information) that is true.
Those familiar with fact-based modeling, which was reviewed in chapter 7, will recognize that the facts of fact-based modeling are propositions.
From Information to Data
Here is a proposition in the context of employment:
Employee #952 works in Department 4567 and earns a salary of $5000 per month.
We would expect the Human Resources department of a corporation to make many similar propositions. Here are a few more examples of such propositions:
Employee #956 works in Department 4567 and earns a salary of $4000 per month.
Employee #891 works in Department 4566 and earns a salary of $5000 per month.
Clearly these propositions, being of the same form, repeat a lot of text. By separating the unique parts of each proposition from the common parts, we can simplify the expression of this information. We will take the common parts of these propositions and place variables where the unique parts fit, as follows.
Employee #EmpId works in Department DeptNr and earns a salary of $SalaryMnthUsdAm per month.
We call a statement of this form a logical predicate, or predicate for short. (This definition of “predicate” is the one used by logicians, and is different from its use in the field of semantics and in ordinary English. We will look at both of those other meanings later on in this chapter.) We will take the unique parts of these propositions and arrange them in a table such that it is obvious how they are to be substituted into the predicate, in place of the variables, to “re-constitute” the original propositions.
The values in Table 14-1 are called data, which is a plural word. Each value individually is a datum.
|
EmpId |
SalaryMnthUsdAm |
DeptNr |
|
952 |
5000 |
4567 |
|
956 |
4000 |
4567 |
|
891 |
5000 |
4566 |
Let’s examine Webster’s definition:
datum 1 a : something given or admitted especially as a basis for reasoning or inference
In this context, a datum is given to a predicate as a value for one of its variables. Observe that a datum may have some intrinsic meaning, but its full meaning in context is known only after it is substituted into its related predicate. For example, the number 5000 could indicate many different things, including a monthly salary, the number of people attending a concert, or the cost of a computer. We wouldn’t know which unless we knew the predicate with which it is associated.
It is also important to understand that a value is not a datum unless it is intended to be bound to a predicate’s variable. For instance, the value 39 is just a number. If, however, the value 39 is associated with some predicate variable (for example, 39 is the x in “I just bought x apples”; or 39 is the y in “I am y years old”), then in that context 39 is a datum. This is important, because it reveals that the terms datum and data connote roles that values play. There’s a lot of over-use of the word “data” to describe values and/or symbols regardless of whether or not those values play the role of being bound to some predicate’s variable. We’ll talk a lot more about roles that data play in chapter 15.
Data en Masse
We tend to deal with data en masse. That’s because the value of data processing lies in the capability of computers to process large quantities of data. This is the reason we see the singular form of the word, datum, so seldom. It is also the reason that the word “data” has come to be treated as a mass noun, like “water” and “information”: we treat it not as a plural noun (“the data are . . .”), but as a singular noun (“the data is . . .”). In this perfectly legitimate usage, we ignore that any quantity of data is composed of many elemental particles, in the same way that we ignore that any quantity of water is composed of many molecules. We need to accept both the singular and plural usages of the word “data”, reserving the plural usage for more technical contexts where we are paying attention to the fact that data is composed of multiple atoms, each of which is a datum.
In order to deal with data en masse, we separate data from information. That is, we reduce a number of propositions of the same form to a single predicate and a set of data per proposition. We then store the data in a database management system, which is a computer system specifically designed to manage large quantities of data. In order to recover the original information, we must retrieve the data from the database system and marry it to its associated predicate. This latter operation is rarely done in an automated fashion. That is, it is usually done by humans, outside any computer system. For instance, a worker in a human resources department might bring up an employee’s record, and see, on a screen, values labeled Employee ID, Salary, and Department. Because the values are appropriately labeled on the screen, the human computer user understands the implicit predicate and re-constitutes the original proposition in his mind (“Employee #956 works in Department 4567 and earns a salary of $4000 per month.”). Rarely does any so-called information system represent this whole proposition. (Rarely does any so-called information system even represent the predicate, but that is a topic for a later section.)
Variable Names
To make things easy for ourselves, we humans typically try to choose variable names that remind us of what the variables stand for—in the example above, we are reminded by the variable names EmpId, DeptNr, and SalaryMnthUsdAm that these variables stand for employee ID number, department number, and monthly salary, respectively. But the computer attaches no such meaning to variable names; in fact, it attaches no meaning to them at all. As far as the computer is concerned, the predicate could be
Employee #X works in Department Y and earns a salary of Z per month.
, and everything would be just fine.
Summary
Keep in mind that
A proposition is the fundamental piece of information; to put it another way, one or more propositions = information.
A predicate + data (as values for the predicate’s variables) = a proposition.
I like to say that data is dehydrated information; just add predicates.
Information and Data as Colloquialisms
What have been presented above are very precise and technical definitions for the words “information” and “data” that are quite different from many common uses of the words. These other uses remain legitimate. I believe you will see that most of these other uses relate in an approximate way to the tight definitions given above, usually by sharing some core concept. Let’s look at some of those other meanings.
Information En Masse
The word “information” is sometimes used to refer to insights gained by analyzing some quantity of data. For example, retailers often analyze how well a certain product is selling and correlate this to, for instance, the price of the item and the geographical region in which it is sold.
Given “information” as defined above, the results of analyses are certainly a kind of information. However, if we use the term “information” solely to mean the results of analyses, we lose the more fundamental capability to reason about information as a collection of propositions. Therefore we will keep the definition of “information” a tight one. We will use the term “analytics” or “insight” for the data or information obtained by analyzing data.
It’s Just Data
The term “data” is sometimes used as a pejorative term, to imply that there is insufficient meaning or value in some data or information, and that a context must be supplied for the data, or further analysis of the so-called data is required. It is sometimes said, “This is just data; we need information.”
That which is referred to as “just data” might really be data in the strict sense, in which case a context is definitely needed (more precisely, a predicate) in order to understand what the data indicates. By the definition above, data is separate from the context (a predicate) which gives it meaning.
If that which is referred to as “just data” is in fact information in the strict sense—a set of propositions—then the complaint is saying either that not enough supporting information has been supplied in order for the information to be useful, or that analysis of the information (usually a large quantity of information) is required in order to extract valuable insights from it.
Putting It All Together
Consider this progression from data to information to analytics.
just a number: 39
probably data: 39 degrees
probably data: 39 degrees Celsius
information: The patient’s temperature is 39 degrees Celsius.
information: The outdoor temperature is 39 degrees Celsius.
analytics (a kind of information): The average high temperature in Tucson, Arizona in the month of July for the last twenty years has been <data: list of monthly average temperatures>
insight (a kind of information): It sure is hot in Tucson in July!
The pure number “39”, without any other context, cannot be assumed to be data: it’s just a number. When the pure number “39” is combined with units of measure—degrees (of something unspecified) or degrees Celsius—we can begin to suspect that the values are data, because it would be unlikely that temperatures would be presented apart from some context. But without further information, such as a predicate for which the values were suited, even these measurement values are just values.
When a proposition is made—that is, when an assertion is made that can be agreed or disagreed with, or believed, doubted, or denied—then we have information. Before that line is crossed, the values presented do not form any kind of proposition, and therefore there is no information, nor can we confidently assert that the values are data.
Analytics are information derived from other information or data, and insights are information derived from analytics.
“Unstructured Data” and “Semi-Structured Data”
Although corporations store vast quantities of data in databases, a great deal of storage is occupied by artifacts designed primarily for human consumption, such as text documents, graphical presentations, audio and video recordings, etc. These kinds of artifacts have come to be known by the unusual term unstructured data. This term is relative to the term structured data, so before we can understand why some things are called unstructured data, we must understand what structured data is.
Table 14-1 above provides a structure for data. It is clear from the context surrounding the table that the EmpId column should only contain employee IDs, the DeptNr column should only contain department numbers, and the SalaryMnthUsdAm column should only contain monthly salaries. To a limited extent, database systems can enforce these requirements by preventing the insertion of data that violates certain rules. For example, given a second table of department numbers, a database system can ensure that the DeptNr column of Table 14-1 only contains values found in the table of department numbers.
One of the chief advantages of database systems is that they can be used to impose structure on data. A structure organizes data so that it is easier to understand, easier to process efficiently, and easier to verify that it is correct. A typical database has hundreds of tables, each of which imposes structure on the data it contains. That’s a lot of structure!
In contrast, the software that is used to create and maintain text documents, audio recordings, etc., does not impose any structure on those artifacts other than the structure necessary to ensure that they are in fact text documents, audio recordings, etc. This is why they are called unstructured. So-called unstructured data, then, needs to be considered at two levels. At the lower level, there is a structure. Text is represented by data that is interpreted as text; audio is represented by data that is interpreted as sound; video is represented by data that is interpreted as a sequence of pictures plus audio, etc. There is structure expected of and imposed on the data at this lower level. At the higher level, no additional structure is imposed (in general). Whether the text forms sentences, the audio is meaningful, or the images are interesting, is not something that software guarantees. At this higher level, the “data” is unstructured; or, to state it more precisely, the lower-level data, which is structured, expresses things which may or may not have structure.
Despite the lack of guarantees, generally text documents contain complete sentences, audio files contain meaningful sounds, etc.; in other words, unstructured data contains information. It is also common for unstructured data to express what is, strictly speaking, data. For instance, page numbers on a multi-page text document are, in fact, data, because the numbers generally appear in a structured manner so that they can be recognized as page numbers and not part of the text. There is an implicit predicate around a page number that says, The page on which this page number appears is the nth page in this document.
Additionally, unstructured data can contain what is potentially data. For example, consider residential mortgage agreements. Such agreements are often 40 pages long, and contain many statements (which are propositions) that give information about the responsibilities of mortgagor and mortgagee, the term of the loan, the schedule of payments, etc. If one had a pile of, say, 100 such mortgage agreements, all issued by the same lender in a month’s time, one would discover that the bulk of the 40 pages were identical. If one separated out the customer-specific variable parts of the agreements from the non-varying parts, and put placeholder variables where the customer-specific variable parts should be inserted, one would have customer-specific data plus (mostly) non-customer-specific predicates: we would have given structure to the otherwise unstructured data.
There are, of course, artifacts that would not easily lend themselves to a separation of data from other content; for example, video and audio recordings containing little repetition. Referring to these as “unstructured data” makes the term a bit of a misnomer. “Unstructured information” would be a better moniker.
“Semi-structured data” refers to information stored in a way that separates data from other content, but not in a system such as a database system that enforces a stricter pre-defined structure on the data. A spreadsheet is an example of semi-structured data. So is an XML document, where the XML markup has added some structure to an otherwise unstructured string of text.
Data Object
Whether or not something is a datum depends on the use to which the entity is put. As the example above showed, the number 39 is just a number unless it is known that it is intended to be substituted for a variable in a predicate. Strictly speaking, then, there is no special “data object”. An object is a data object only if its states represent values intended for use in a variable that is part of a predicate.
One may construct objects for dealing with data in general, but then such objects will likely deal, not with individual objects representing individual values, but rather with more complex objects representing logical records, tables, and other data structures. Such objects are then indeed “data objects”, but modeling them generically and distinctly from non-data objects probably has no value unless one is designing a database management system. A value is data only if it plays the role of data. In our next chapter we will focus on roles played by data.
|
Key Points
|
Chapter Glossary
proposition : an expression in language or signs of something that can be believed, doubted, or denied or is either true or false (Merriam-Webster)
information : a collection of propositions
fact : a proposition that is true or believed to be true
predicate : short for logical predicate
logical predicate : a statement containing variables which, when the variables are bound, yields a proposition
predicate : a statement containing variables which, when the variables are bound, yields a proposition
datum : that which is intended to be given to a predicate as a value for one of its variables
data : plural of datum
analytics : information derived from other information or data
insight : information derived from analytics
structured data : collections of data items stored in a database that imposes a strict structure on that data
unstructured data : data representing text, audio, video or other data which have no structure imposed on what they represent
semi-structured data : collections of data items stored in a way that supports but does not enforce a structure