Sizing up data

All this while we've been talking about the operations and have pretty much ignored the data that programs are typically associated with. In fact the SSI section had a few variables and I glossed over them on purpose because it was relatively easy to deal with operations by themselves. Now that we have a handle on the operations, however, its time to go back and give data its due.

What is Data?

First of all, what is data? To go back to the visualizations, data is what flows through the blocks and exercises them. It is the liquid in the Pipes analogy and the marbles in the marble run analogy. It makes the wheels turn and the flip-flop flip (or flop). It is what is used by the code. As a definition, therefore:

Data: Pieces of information.

Some data is input, some is generated by the program itself while it executes and some is output. Static languages pre-allocate some data for efficiency and allocate the rest while running; dynamic languages allocate most of their data during execution. Some data is static, some dynamic; some are scalar, others not. All these facets are ways of classifying data, but the core definition remains true in all: data is passive and acted upon by code.

The basic structure of data

Data comes in a myriad of variety: numbers, letters, amounts, text, percentages, addresses, report cards, census results and so forth. Data is routinely converted from one variety to another, interpreted in different ways, sometimes refers to other data and is also self-referential at times. The exact definition of what separates one "chunk" of data (to use a highly technical term) from another is somewhat subjective. However, in any particular sphere of application, there is usually a finite set of "smallest possible" chunks of data. For example, the smallest possible data chunks when talking about students' grades are their names, their grade and the year/semester that they got that grade. Their names could probably be broken further into first and last names, but that's about it. So after reduction we end up with the following as the smallest possible kinds of data chunks:

first name, last name, date of grade, grade

The "kind of chunk" as I've informally named it, is essentially one piece of Metadata - data about data; and once we start talking about data, it is impossible to avoid talking about metadata - its type or kind, the limits of values and so forth. As definitions, therefore:

Data: the smallest pieces of different kinds of information.

Metadata: Data about the data, eg, its kind or type.

Data can be grouped to form interesting collections:

List: An ordered collection of data of the same kind. Examples are an array of increasing numbers or a string of characters that form a name.
Set: An unordered collection of data of the same kind. An example is a collection of poll data.
Record: A collection of data of different kinds that forms a cohesive whole. An example is an address: it has a street address, typically a house or apartment number, a zip and so forth.
Tree: A hierarchy of data. An example would be a family tree.
Graph: A set of data that is related to each other in some way; depicted as nodes and relations between nodes. A typical example is a telephone network or a flight map.
... and so forth...

More interestingly, each of these collections - or Compound Data as I'll call them - can also contain other Compound data as their items.

Compound Data: Data that contains other data within itself - simple or compound.

With Compound Data, infinite kinds of data structures can be created using the basic data types.

Two additional kinds of data that we will naturally gravitate towards are References and Links. A reference is a a way to locate a piece of data - a unique identifier to it. A link connects two pieces of data; that is, it is a piece of data that stores a reference to another piece of data. The referred piece could be atomic or compound. In languages that support it, they're typically called pointers or references (although languages like C++ have both and they have different semantics - we will stick to the english meaning, therefore).

Reference: A datum that locates another piece of data - simple or compound. It is a unique identifier for that piece of data.

Link: A connection between two pieces of data. Typically it manifests as a reference to the second piece of data stored in the first piece.

To summarize:

Data are the smallest possible pieces of information
Metadata is data about data
Once we have data, we can start forming collections of them.
Two special kinds of data are references and links. References are unique identifiers to specific pieces of data and links point one piece of data to another by storing references.

Ok, so how do we go about sizing data? It seems intuitive enough to say that the sizes of collections are related to the size of their contents and we can calculate the size of all the data used using this intuition. But how do we size the smallest pieces of data?

Since data has infinite variety, sizing it is the quintessential "apples to oranges" task: for example, how do you compare an SSN to an address, to pick a random juxtaposition? In general, we cannot; and natural language reflects that: if a bag contains 3 apples and 2 oranges and we're asked how many things there are in the bag, the only thing we can say is that there are 3 apples and 2 oranges. In formulas now:

size(bag with 3 apples & 2 oranges) = size(3 apples) and size(2 oranges)             --(D1)

We could say that there are 5 fruits, but that's by extension of language (that is, we have invented the term "fruit" as we find it useful) and also because in the context of the question asked, the generalization is appropriate enough. So:

size(bag with 3 apples & 2 oranges) = size(3 fruit and 2 fruit)
                                    = size(3 fruit) + size(2 fruit)
                                    = 3. size(1 fruit) + 2. size(1 fruit)
                                    = 5 . size(1 fruit)
                                                                                     --(D2)

There's quite a few interesting bits here:

Unless things are of the same kind, the collective size can only be expressed as a full statement of the sizes of the individual collections, for example "3 apples and 2 oranges"
If we're able to find a common term for different types, the collective size can be expressed as a sum of the sizes of the individual collections, for example "5 fruit".
We still havent figured out the size of 1 fruit (just as we handnt for apples or oranges), but if we did know it, we can safely say that the size of the bag's contents is 5 times that size.
We're igoring the fact that the bag itself could have some size of its own for now because we're focusing on the contents.

Now, we could continue abstracting the types of data and call them "things" (like the question above does) and that would allow us to respond with either "fruits" or "things", not to mention being able to count things other than fruit - if they were added to the bag. So if a ball is added to the bag, we could say:

size(bag with 3 apples, 2 oranges and 1 ball) = size(3 things and 2 things and 1 thing)
                                              = 3 . size(1 thing) + 2 . size(1 thing) + 1 . size(1 thing)
                                              = 6 . size(1 thing)
                                                                                     --(D3)

If we kept going on, We could end up with something like "there are x billion atoms in the bag"! While accurate, this statement doesnt help us much in understanding exactly what we're talking about; which brings to light the following:

There is a spectrum of names that we can ascribe to data items that goes from the very specific to the very generic. The right balance depends on the domain and the specific application.
the scale of abstraction required to find a common term for measurement might be quite large, to the extent that individual data items may not be easily identified with that unit of measure.

These two points speak to the fact that excess abstraction clouds understanding. There is another impact, however, of making things generic: if the unit of measure (like thing above) is not a fixed number, the measurement is worthless: thing, for example could be anything - really small or really big - and that doesnt build confidence in it as a unit of measure.

With that caveat, (D3) can be rewritten in a generic form as:

size(collection of data) = size(collection of type 1) and size(collection of type2)  and size(collection of type3)
                           where type1 = apples, type2=oranges and type3=ball

                         = size(collection of base type1) 
                           and size(collection of base type1)
                           and size(collection of base type1)

                           where base type1 = thing

                         = size(collection of base type1) + size(collection of base type1) + size(collection of base type1)
                         = n1 x size(base type1) + n2 . size(base type1) + n3 . size(base type1)
                         = (n1 + n2 + n3) . size(base type1)
                         = n . size(base type1)  where n = n1+n2+n3
                                                                                     --(D4)

That is, the size of any data can be expressed as a combination of:

A count of the number of items of a particular "type"
The size of a unit of that type

Or more generically:

If a collection of data has n distinct types - T1, T2,....Tn; each in N1, N2, ... Nn amounts,
and B is a base type such that T1, T2, ... can also be called B's.

size(collection) = size(collection of T1) and size(collection of T2) ... and size(collection of Tn)
                 = N1 x size(T1) and N2 x size(T2) and ... and Nn x size(Tn)
                 = N1 x size( B)   + N2 x size( B)   + ...   + Nn x size( B)
                    ... because T1, T2, ... Tn are also B's.
                 = (N1 + N2 + ... + Nn) . size(B)
                                                                                     --(D5)

From the discussion above, its clear that there are potentially many candidates for B. In fact we can consider a series of base types B1,B2,...Bm of them such that:

... size(B1) > size(B2) > ... > size(Bm) ...
                                                                                     --(D6)

Note:The ellipses at the beginning and end denote that there could be more on either side of this continuum.

A particular choice might be most appropriate for a particular application, but is one value is likely to be universally appropriate? Our challenge is finding a primordial type that all data can fit under, while still being meaningful and useful. How can we arrive at a unit of measure that addresses both concerns?

Measure using Base and derived types

One option is to choose a base type and express every other type in terms of it. This presumes that such a type can be found, of course; but if we did, like so:

Pick an appropriate k such that Bk is a useful base type for the application at hand. Then:
size(1 Bk) = 1

Then if size(T1) > size(Bk) and size(T2) > size(Bk)
size(1 T1) = C1 x 1 = C1
size(1 T2) = C2 x 1 = C2
   where C1 an C2 are conversion factors for each type
                                                                                     --(D7)

... a collection of data that has n1 data items of type T1 and n2 data items of type T2 can then be expressed as:

size(collection) = size(n1 items of type T1 and n2 items of type T2)
                 = size(n1 items of (C1 items of type Bk) 
                    and n2 items of (C2 items of type Bk))
Now they're of the same type, so we can add them up
                 = size(n1 items of (C1 items of type Bk) 
                      + n2 items of (C2 items of type Bk))
                 = n1 . size(C1 items of type Bk) + n2 . size(C2 items of type Bk)
                 = n1.C1.size(Bk) + n2.C2.size(Bk)
                 = (n1C1 + n2C2) . 1
                 = n1C1 + n2C2
                                                                                     --(D8)

Or more generically,

size(collection) = sum(ni Ci)                                                        --(D9)

There are two challenges with this option:

Arriving at the conversion factors might takes us closer to the implementation model than the domain model.
The measures still have to be proportional, ie an increase in size in the domain should result in a proportional increase in the measurement. If not, the unit of measure is not a high fidelity one.

Measure using an Atomic base type

The other option comes from examining equation (D5) again:

size(collection of data) = (N1 + N2 + ... + Nn) . size(B)
                                                                                     --(D5)

... and realizing that if size(B) is small enough, the value of size(collection of data) is controlled by (N1 + N2 + ... + Nn)- the number of items there are in the collection. So if we could find a measure small enough that the number of data items matters more than the size of each atomic item, the size could be expressed as a count of the items alone.

Another way to think of this is to look at (D6) and say that it actually has an end, like so:

... size(B1) > size(B2) > ... > size(Bm) ... > size(Bomega)
where size(Bomega) = 1
                                                                                     --(D10)

Superficially, this seems similar to the "base type + derived" option and indeed we could consider N1 from (D5) and n1C1 from (D8) be the same and it WOULD be the same. However, there's a subtle difference. With this option, we're saying that even though data is of different size, the

STOPPED HERE FEB4 AM. WAS GOING TO EXPOUND ON HOW SIZE CAN BE GOT FROM STRUCTURE, NOT SIZE OF ATOMS AND CONTRAST WITH CONVERSION FACTOR OPTION ABOVE.

Applying this same logic to

STOPPED HERE FEB3 PM. PROOFED TILL HERE AND FIXED FORMULA REFS TOO. BELOW NEEDS CLEANING.

NEED TO RATIONALIZE THE ARGUMENT OVER THE 2-3 SECTIONS ABOVE. BRING OUT SIZE OF DIFF TYPES WHEN REDUCED TO THEIR SMALLEST FORMS IS A FUNCTION OF THE STRUCTURE AND LESS THAT OF THE SMALLEST FORMS, AND EACH OF THESE FORMS CAN BE CONSIDERED INFINTESIMAL ENOUGH THAT WE DEEM THEM TO BE OF THE SAME UNIFORM SIZE.

Between these two ends of the option spectrum, is there a happy middle where data can be just data? Let's consider our options.

Current notions of data size

First off, we have to consider the incumbent - the Byte (or its 1/8th equivalent - the Bit). This unit is machine-independent and in tune with the stored-program concept that all computers are examples of. The level of homogenity that the byte brings to data is huge, however - everything is essentially a binary number. All other kinds of data - text, pictures, attribute collections are reduced to a single stream of bits; subject to interpretation by an intelligent observer. In the hands of the right interpreter (program or human), binary digits are processed correctly and meaning emerges; in the hands of another, gibberish does. This "data is in the eye of the beholder" nature(sic) is even exploited in some languages to store 2 pieces of information in the same memory - for example, in C unions.

More important, however, is the issue of "proportional scale": as implemented, the byte ends up diverging from our natural view of data in subtle ways. For example: because computers are finite machines they have a maximum number that can be directly represented in them. This is usually a power of 2 and the power is called the word length of the computer. This has two implications:

All numbers below 2^word_length can be readily represented and manipulated, while numbers above this limit cannot and need special handling.
All numbers are represented as multiples of the word length; ie, smaller numbers still take at least the word length's worth of space and larger numbers will take additional space so as to round up to the next highest multiple of the word length.

The result of this is that there is a gap between the natural notion of size and how it is represented in the computer. For example, common sense says that a list with 100 numbers should have a size of 100 times the size of one number. Or in pseudo-formula:

1 number = 1 number unit, let's say.
Thus, size(list(100 numbers)) = 100 number units
                                                                                     --(D9)

Assuming that these numbers were in the range +/- [(2^32)-1], that list of 100 numbers would take up a different size based on the word length of the actual machine when implemented. For example, here are three computers, A, B & C:

Computer A's word length = 4 bytes => size(list) = 100 number units = 100 words = 400 bytes
Computer B's word length = 2 bytes => size(list) = 100 number units = 200 words = 400 bytes
Computer C's word length = 8 bytes => size(list) = 100 number units = 100 words = 800 bytes
                                                                                     --(D10)

A few points to note here:

The "number units" measure retained its size regardless of changes in implementation machine.
In contrast, Word length didnt remain constant or even linear: increasing the word length in C didnt increase the total word size for the list in the same way as decreasing the word length did. Instead of 3 computers we could consider these to be better versions of the same compter to imagine the impact of the machine changing from "underneath our application", so to speak.
Similarly, bytes as a measure of data size didnt correlate well either; despite halving the word length between A & B, the size in bytes remained the same; but that didnt hold when moving to C.
- This is, of course, a direct consequence of using word boundaries, but it does underline the disadvantage with using bytes as a unit of measure to express the natural size of data.

So the Byte leans a little too much towards the computer and not our mental models. Are there other ways of measuring size? At least two come to mind:

Big Omega: which is a well-known way to measure data size used in the analysis of algorithms' complexity. This measures the "order of magnitude of the maximum possible size of a data structure when the input contains n items". It produces an imprecise, but sufficiently accurate and useful estimate of the size of a data structure; so it is used routinely to make design and performance decisions by practitioners.
Bit: Not to be confused with Binary Digit, this is a concept that is otherwise unrelated, but similar in sprit to Big Omega that is used in Personal Privacy circles. In this context, a 'bit' is a chunk of information - an SSN or postal address or some such real-world attribute of a person which can be used to uniquely identify that person. The more number of bits about a person that are available in public, the easier it is to triangulate that person's location, preferences, family, etc and gain access to his/her bank accounts and such protected assets.

These two ideas of data seem to be more higher-level than the byte. They do more things: the second measures rate of change of data size well and the third allows for rich expression of complex data and their types. More importantly, they draw their power from the source of the data rather than its implementation in a computer system.They do lack in accuracy, though. They are useful in the niches that they are used in, but may not be able to give a number for size like the byte does.

So, which one should we pick? In keeping with our goal to arrive at natural measures, our data size measure should:

Use the structure of the data AS DATA,
Be language agnostic,
Be measurable,
Be extensible to define other properties of data with,
And yet be easy for humans to understand and use to understand a lot of information in one go.

What we need is something akin to the "number units" from (D9) above. That unit of measure represents data as data, is language agnostic, measurable, extensible and easy to understand. It has two parts to it: "number" and "unit". The latter allows data to be counted accurately in terms of pieces of information while the former provides the qualification that the data is of a particular type. Depending on the context, either form might be required to help with understanding the system; so we could use both:

100 numbers = 100 units of data = 100 number units of data

To use this convention on the apples and organges example,

3 apples and 2 oranges = 3 apple units and 2 orange units = 5 fruit units = 5 data units

There's a fine distinction between plain units and "typed" units above. Depending on the context, we might care about the type or not. Care must be taken such that all measurements are done in the same set of units. When 2 different types need to be compared or combined, they must be converted into a common unit (eg, to compare or combine apples and oranges, convert them to fruits).

The base of all such units is the "data unit", which homogenizes every primitive type to a single base unit where their sizes are all 1. This unit is useful to understand the gross size of a collection of data as a count of the "things" in the collection regardless of their actual size/volume. In formulas, therefore:

size(1 data unit) = 1
                                                                                     --(D13)

Should we care that the data unit homogenizes all data? With code, we gave all the primorial operations the same numeric size because we couldnt discern their internal workings any better, but with data there could be multiple basic data types. So should we give weight to the fact that meaning actually emerges above the "data unit" level, while that level is merely a count of the number of atomic data units?

It depends. Sometimes it might be useful to say "The system is 5K data units" vs "The system is 2K lists of numbers and 3K users' info", for example. Some other times it might be the opposite that's useful. We shall adopt both and see where we get. The only condition is that both sides of equations have the same units.

It would be remiss to avoid comparing this bare-bones data unit to the Byte. We rejected the byte as our universal measure because it homogenized too much; but it actually stops short of saying "all data is 1 byte regardless of type", not to mention providing a clear conversion factor between different types. So why should we favor this data unit over the byte? The answer is that the data unit models natural data better than the byte. It also allows the level of abstraction to be varied depending on the context of the application.

Sizing up well known primitives

Let's try sizing up some well known and routine data types using this new, yet unnamed size measure.

Numbers

Since numbers are what computers do best, let's start with them. Immediately we have to be aware of the different kinds of numbers: integers, fractions or rational numbers, real numbers, complex numbers and so forth. What can we say about their size when treated as pure data? Not much. We could possibly enumerate how many of them we might need for a particular application with some domain analysis; and possibly - with some more effort - set bounds on the range of values that these numbers might take in that domain.

Regardless, at a human (natural?) level, these are all numbers. We could choose, therefore to equate each such number to one data unit; we could also then qualify it with the "type", so we could say, for example:

An list with 100 numbers has a size of 100 number units
or in formula:

1 number = 1 number unit = 1 data unit
Thus, size(list(100 numbers)) = 100 number units = 100 data units
                                                                                     --(D14)

We could apply the same sort of logic for other kinds of numbers; for example:

1 real number = 1 real number unit = 1 number unit = 1 data unit
1 complex number = 1 complex number unit = 1 number unit = 1 data unit
                                                                                     --(O)

.. and so forth. We could even define "number" to include everything from integers to complex numbers and call that a single number unit. Again, the advantage of this "intentional blindness" to the implementation details is that it allows us to think in terms of the domain and not the machine.

Text

Text as we humans use it, is the name for a string of letters or characters. Quite naturally, that can be seen as a compound data item - an ordered list of letters. So we could define, for example:

size(1 letter or character) = 1 text unit = 1 data unit

Regular text is a list or array of letters. 
Thus size(text) = size(list of letters) = N letter units = N data units
                                                                                     --(P)

As mentioned above, computers handle text by encoding text into a lookup table and storing the indexes as stand-ins for the real textual values. A common encoding is ASCII, which uses 1 byte per character; another is UTF-8 which also takes one byte per character but allows for expression of more human languages than ASCII. Regardless of the encoding used, text is essentially numbers within the computer and have the same inherent disadvantage of varying with the physical machine's word length and the encoding used.

The machine-agnostic "letter unit" measure does not have this disadvantage.

Logical (True/False) values

True/False values are pretty straightforward to think of in terms of a data unit - each one is one logical data unit. Comparisons between this logical data unit and bytes follow similar lines as the discussion for numbers:

Computer A's word length = 4 bytes => size(boolean value) = 1 bool unit = 4 bytes
Computer B's word length = 2 bytes => size(boolean value) = 1 bool unit = 2 bytes
Computer C's word length = 8 bytes => size(boolean value) = 1 bool unit = 8 bytes
                                                                                     --(Q)

There are, of course compound data types like bit vectors which store the extra bytes to store additional boolean values and use boolean algebra to update or read this information. If such a data structure were required in the domain of the application, they would still be counted as multiples of a primitive boolean unit instead of fractions of a byte. That is,

Computer A's word length = 4 bytes => size(list of 100 bools) = 100 bool units = 4 words (13 up to 16) = 13 bytes (12.5 rounded up) 
Computer B's word length = 2 bytes => size(list of 100 bools) = 100 bool units = 7 words (13 up to 14) = 13 bytes (12.5 rounded up) 
Computer C's word length = 8 bytes => size(list of 100 bools) = 100 bool units = 2 words (13 up to 16) = 13 bytes (12.5 rounded up) 
                                                                                     --(R)

In formula, therefore:

size(logical data) = 1 bool unit = 1 data unit                                       --(S)

Enumerations

Enumerations have some natural parallels like days of week, months in a year and so forth, but they are really a programmer's convenience. In terms of natural units, they would just be a list of names. Thus, there is no single "Enumeration type", only specific instances of the pattern like a "days of the week unit" that is allowed the values "Mon, Tue, Wed, Thu, Fri, Sat, Sun".

size(enumerated value) = 1 data unit                                                 --(T)

References and Links

References do not have a direct natural complement - the natural way to refer to something exclusively is to enumerate all the attributes that make it unique. This is, of course, cumbersome as the reference becomes almost as large as the thing itself, if not larger. Natural language appreciates compaction as much as computer representation does, however, so we HAVE come up with ways to refer to things and people without having to describe their every difference from other things and people, two examples being SSNs and Driver's Licence numbers. These are essentially real-world references; they do not actually describe the person, but once assigned, they can be used to refer to (and more importantly retrive information about) a person exclusively.

When natural references are available, creating units for them is straightforward:

size(1 ref)  = 1 ref unit = 1 data unit
size(1 link) = 1 ref unit = 1 data unit
                                                                                     --(U)

That is, the size of a link to something is the same as the size of the reference itself, simply because the link is another reference!

When natural references are not available, however, it is possible to create artificial ones: a particular piece of data could be refered to by its position in the overall collection of data - "the 5th integer", "the 13th player" and so forth; or a particular piece of data could be referred to by how its attributes were recorded in some system of record - "person found in ledger 13, folio 345, line 56" or "psalm 3:16". Whether each such component of the reference should be accounted for separately is a domain concern, but the key points are that they can be considered as being separate from the thing that they describe and that they have some finite, enumerable size and therefore can be measured.

There is a strong similarity between this latter scheme from the real world and how references and links are represented in computers. References and links in computers are essentially memory addresses. In plain english, they are positional counters from the first addressible memory location. Once assigned (at least for the lifetime of the running program), they are uniquely addressible and allow access to retrieve information about a specific object.

The key question, therefore, is not "How do we ascribe a size to references and links?", but rather: "Will there be links to a particular object and should those links be references?" If the answers to the questions are "Yes and Yes", then we should use (I) above to count their sizes. Where artifical references (or memory addresses) are used, we can call those out as a separate "memory reference/link unit".

STOPPED HERE JAN 27 PM. WAS PROOFING FOR CONFUSING TEXT AFTER CHANGE TO EVERYTHING IS DATA.

Sizing up Compound data

By definition, compound data is a collection of primitive data items that together form a cohesive whole. As such the size of compound data is the aggregate of sizes of the primitive data items contained in it. If there are data items that are not specific to any individual child item, but pertains to the aggregate itself, these are still (from a size perspective) considered to be contained in that that aggregate as well. Thus:

size(compound data) = size(compound data attributes) + size(contained itmes)         --(V)

Let's use this formula on some of the often-used/well-known compound data types.

Sizing up Lists

From a natural standpoint, lists are pretty straightforward. They are collections (usually ordered) of things of the same kind. So the default sizing for an list would be:

size(list) = sum(size(list item))                                                    --(D9)

Their representation in computers, however, vary depending on the implementation chosen: if the total number of items and the size of each item is small, contiguous memory could be allocated to the list; if the sizes are huge, references could be stored instead of the item itself; if the items are to be accessed in random order vs sequential order, the implmentation might be a contiguous array of memory vs linked lists; and so forth. How do we reconcile all these differences so that the natural size that we're evolving into actually makes the most sense?

The answer lies in (D9). Regardless of the implementation, the core of size of a list MUST be a sum of the sizes of the items in it. Where the items are not stored as-is, their contribution to the array's size is essentially the size of the references stored in it, not the size of the item itself because that size should be counted elsewhere.

When the implementation uses additional storage outside of each item, for example to store the size of the list in a "length" attribute or store a head pointer, we should account for it using (J):

size(list) = size(list attributes) + sum(size(list item))                            --(D10)

While this is a simple enough equation, each term needs some discussion on how to apply them.

Discussion: list attributes

However, we should apply this overhead ONLY when the domain itself has such list-level attributes, not when the implementation alone does. For example, a list of students in a class and the class strength are prime candidates for representation as a list. The "class strength" attribute is an independently recognizable attribute in the domain. An application written to manage schools SHOULD represent this attribute directly. A display of this list might require pagination and each such subset could be considered a list on its own. Its length is most probably controlled by a configuration value, but should NOT be represented in the sizing because it is an attribute created due to the implementation.

All of this parallels a similar discussion on code size where the "level of abstraction" has a bearing on how we count size. When sizing data, therefore, we should determine the domain up-front and all sizes should be stated as relative to it. To go back to the previous example, therefore:

If our domain is an application that manages schools, the list of students in a class would have a list attribute called "class strength" and be sized with this addition, but each page of the display of this list would be sized as in (I).
If our domain is a library of array management functions that the application to manage schools is one client of, we could expose two types of arrays: one with a length attribute and one without. Both would use (D9) with list attributes = 0 units for the latter. Note that in this case we might just be using bytes as our unit of measure, especially if the library is very specific to a processor or technology stack.

Of course, this is fertile ground for leaning either way - too abstract into the domain end of the spectrum and you can artificially reduce the size; to specific into the implementation domain and you can make the size so specific that it is useless as a generic measure. This issue, however, is inherent with using relative bases. This theory is intended to be used mostly at the level of #1 above and less at #2 because the latter case is not very "natural" by definition. However, it does not preclude application of the theory at that level. The best course would be to reach an agreement (by stating the meanings of each type) amongst the audience of discourse on the level of abstraction.

Discussion: List items

What can we say about the sum(size(list item)) term? The first thing that stands out is that each list item has a unit attributable to it (eg, a list of numbers would be of type "number unit" and so forth), but the list itself doesnt have a unit of its own.

What it does, however, is provide a container to group data that might otherwise not be. The one type-agnostic measure of size that a list has, therefore, is the count of items in it. Indeed, (D10) can be rewritten as:

size(list) = size(list attributes) + size(list item) x N
             where N is the count of list items                                      --(D11)

Applied to a list of 10 grades that are of type "real number unit" and no metadata, therefore,

size(list of grades) = 0 + size(real number unit) x 10 list slot units
                     = 10 real number-list units

A list, therefore is long AND wide - it has two dimensions. Since the data items may be of different types, the squared unit is represented as a hyphenated combination of both the type of the list item and the fact that its in a list.

What would happen if there were list attributes, however? These attributes are part of the list, but not part of the collection of items in the list - they could be of a type completely different from the contained data, for instance. Thus, a list essentially becomes a record containing list attributes and list items. That is,

size(list) = size(record(list attrs, list items))                                    --(D12)

(D11) and (D12) can be reconciled with (D10) after we size up records.

Sizing up records or aggregates

Records are aggregates of data that form a cohesive whole. The individual bits of data - the attributes of the whole - are typically of different types. The size of such a datum can only be expressed as an enumeration of the different kinds of data units.

size(record) = sum(size(attributes)) expressed as n1 T1 units, n2 T2 units and so forth
               where there are n1 number of units of type T1 and so forth
                                                                                     --(D13)

Extrapolation: List as a record

Now we can go back and reconcile (D12) with (D10):

Using (D12), size(list) = size(record(list attrs, list items))
Using (D13),             = size(list attrs) and size(list items)
Using (D11),           = size(list attrs) and N item type-list units

Extrapolation: Program as a record

On a grand-enough scale, a program can be considered a record of code, data and metadata. Thus,

size(program) = size(code) + size(data) + size(metadata)
              = C turings  + D data units + M metadata units                         --(D14)

Of course, program could be replaced by application, system or set of applications with the same result. Note that this is the "Stored Program Size" that we started off this digression to discover.

There are still some pieces to tie down, however, so lets proceed on to other common compound data types.

Sizing up Tables

Tabular data in the natural world are used when items in a list have multiple attributes and comparing/contrasting them together is useful. For example, a table comparing features of two brands of a fridge is useful when in the market for one. The defining characteristics of a table are:

Each row represents one "thing" in the domain, naturally represented as a record of its attributes.
Each column represents the attributes of interest for each thing.

The second major use of a table is to lookup information based on attributes. A table of school grades could be browsed to find a particular student's grade if we know his/her name or student ID, for example.

The third major use of tables is to derive new or summary information - combine the information from multiple source tables to form a new one.

These uses of real-world tables are represented in multiple implementation data structures - multi-dimensional arrays, Hashmaps or associate arrays and relational database tables (which are themselves a collection of some complex implementation data structures like BTrees) to give a few examples.

The choice of the actual data structure used in the implementation is usually dependent on how the data is used. For example, the table with student grades is most likely to be queried to find a particular student's grade, so access to the grade given a student ID is optimized most and that governs how the data in the logical table is stored. Duplication, artificial references and links are all valid additions to the implementation data structure to optimize the access path.

For purposes of size, however, how we access the data is immaterial. The data (as understood in its domain, not the implementations') is still rows of records. Therefore we can say with some surety that:

size(table) = sum(size(record))                                                      --(O)

Sizing up Hierarchical Data

Hierarchical data is any collection of data where items are related to each other with a strict parent-child link. Family trees and Cabinet+Folder+files are typical examples. The size of such data can be readily computed as:

size(hierarchy) = size(root) + size(root's children)
size(root)      = size(compound)
size(children)  = size(list(child items))
                                                                                     --(P)

Hierarchical data is usually implemented as linked lists (in imperative, per-object environments) or foreign keys (in declarative, per-collection environments). The same points of applying "hierarchy-level attributes" mentioned above for lists apply to hierarchies as well.

Sizing up Graphs

Graphs focus on items/objects/entities and the relationships between them. A map of all possible flights in a country is a graph, as is a network of relationships between friends. The key concepts in a graph, therefore, are nodes and edges. The size of a graph can readily be calculated as:

size(graph) = sum(size(node)) + sum(size(edge))
size(node)  = size(compound)
size(edge)  = size(reference to end node)
                                                                                     --(Q)

Graphs are typically implemented in one of three ways:

As linked lists, when traversal is the primary application
As adjacency lists
As nxn matrices

These implementations have their own size characteristics, obviously. TODO: FIGURE OUT IF THERE ARE IMPLICATIONS OF THIS SO A BETTER MEASURE IS ADVISED.

Compound items contained within other compound data

TODO: REWRITE THIS AFTER FIXING OTHER SECTIONS TO MAKE THEM 2D

What if the items in the list are themselves compound? In all likelihood, a list of students will have more than one attribute of the student; so the student list item will actually be a record of all such interesting attributes. Using (D10):

size(student list) = 0 + sum(size(student record))

Assuming we have a first name, last name and grade as the record attributes,

size(student record) = size(first name) + size(last name) + size(grade)
                     =  1 name unit     + 1 name unit     + 1 real number unit
                     =  2 name units + 1 real number unit.

Thus:

size(student list) = sum(size(student record))
                   = size(student record) x number of students in list
                   = [ 2 name units + 1 real number unit] x N,  where N is the number of students in the list

Dynamic Size

Like code, data too can vary over time - program execution changes sizes of data structures as do real-world events. At a particular aggregate level, this could be seen as the change in size of that aggregate - its dynamic size.

For most uses of this theory, we'd be interested in the maximum possible dynamic size that some data could achieve. That is,

dyn_size(data) = max(size(data)) across all instants of time
                                                                                     --(R)

Note: There is a difference between static and dynamic data and static and dynamic sizes of data. A program will have both static and dynamic data: the former being data allocated at startup and the latter being data allocated during execution. Both these types of data, however, will dynamic sizes - their sizes can change over time. For example, an array might be allocated a size of 100 at startup, but only 5 are used with the rest expected to be filled during execution.

(Q) is an interesting equation because this is exactly the definition of Big Omega. That theory, of course, takes the next logical step and says that the maximum size is proportional to some attribute of the data, for eg, the number of items in it, usually referred to as N. Thus:

max(size(data)) is proportional to N
                                                                                     --(S)

This means all of the theory surrounding algorithms can be readily used to come up with dynamic size of data contained in programs, only with the unit being the new (unnamed) data units.

Code as Data and vice versa

Part of the reason we embarked on this detour of data size is to size up code when it has data-like properties; especially when its stored in a passive form. So, how would we size up code as data? Well, in its most general depiction, code can be considered a graph. Thus:

size(code) = size(graph of instructions)
                                                                                     --(T)

This does not, of course, account for dynamic code itself, so an equation similar to (Q) could be written for dynamic size of code:

dyn_size(code) = max(size(code)) across all instants of time
                                                                                     --(U)

TODO: DATA AS CODE: READ SICP AGAIN AND SEE IF THERE'S A HAPPY CONVERGENCE OF IDEAS. THIS MIGHT ALSO BRING THE TURING AND CHURCH VERSIONS TOGETHER WITH THIS THEORY.

Name for the data unit

Options for name: Chris Date or Edgar Frank "Ted" Codd. since date is also a datatype, maybe not Date; and Codd seems a little odd. I'm leaning towards Ted. STOPPED HERE JAN 13 AM