Cutting through the bullshit.

Thursday 13 June 2013

What's 'metadata'?

[Note: a few additions, 2013 06 16]

When I first read about the NSA collecting 'metadata' about telephone calls, I thought it was an unconventional use of the term. But on reflection, if you regard the actual substance of a phone call as the data of interest, then the details pertaining to the call are, in fact, metadata.

But this is deceptive. The 'metadata' the NSA collect are crucial to understanding the data they pertain to – the actual content of phone calls and messages that they are purportedly not capturing. A message stating 'The bomb goes off in 48 hours' would be useless if you didn't know when it was sent, among other things. Under the circumstances, they are treating the 'metadata' as data and analysing it as such.

At another level of analysis, those metadata are themselves data and are only meaningful when viewed in light of their own set of metadata. That is what I think of as metadata.

The record of a telephone call is probably represented as a string of characters. The metadata would specify what they mean. First of all, it would tell you that the unit of enumeration – what each record describes – is a 'Phone call', which would be defined to include or exclude SMSs, MMSs, video calls, etc. If it included more than one type of call, there would be a 'Call type' field with a code indicating what kind of call it was. The first 20 characters of the record, then, might be a sequential call record identifier. The metadata would then specify that the field is named 'Call record id' and comprises 20 numerical characters. There might then be a code for the date of the call, which would tell you that it is an eight character numeric field and link it to a particular time zone. Similarly, the metadata for 'Time of call' would say that the field is four numeric characters and the applicable time zone, and define its content as the time the call is answered (or perhaps when dialling was completed). The next three characters might represent the country code of the originating phone, which would be called, for example, 'Originating country code', be 3 numeric characters long and associated with a table of valid country codes (a classification), linking each with the relevant country; etc. The 'classification' associated with the fields for actual phone numbers would effectively be a reverse telephone directory. Other fields could include codes for the phone towers, satellites and nodes the call passes through, with times, details of the receiving number, and so forth.

The metadata pertaining to survey data would include definitions of the concepts purportedly collected, the wording and sequence of the questions asked to elicit data supporting those concepts, how such data are classified, a definition of the units being measured, the time the data apply to, the sampling methodology, and so forth. So a record in a labour force survey would include data like 'Dwelling id', 'Household number', 'Person number', 'State' or other geographic indicators, 'Sex of person', 'Age of person in years', 'Marital status of person', 'Labour force status of person', 'Hours actually worked by person during reference period', 'Hours usually worked by person', 'Status in employment of person in main job', 'Occupation of person in main job', 'Industry of person in main job', additional comparable fields for second, third, fourth... jobs, 'Duration of unemployment of person', etc. The metadata for records like that would name and define each field, stipulate its length and the type of characters allowed, and associate it with any relevant classification. The classification could be quite simple. A classification of 'Sex of person' might look like this:
Sex
0    Undetermined
1    Female
2    Male
3    Other
while classifications of 'Occupation' and 'Industry' fill large tomes. Sometimes, of course, a number is just a number, 'Age in years', for example. But even these can be classified by grouping them into ranges. For the record, it's never a good idea to collect age in ranges, as that can make it impossible to compare the data with other data collected or output in different ranges. So if you are interested in the population aged 18-22 years, for example, and ask respondents whether they are in that range, you could never compare your data with data collected in standard age ranges: 15-19, 20-24... If you're developing a survey, I urge you to collect age last birthday in single years.

To digress, it's worth pointing out that there's more than one way to define a concept. For most purposes – when defining a word, for example - it's probably best to identify the core of the concept and acknowledge that speakers will vary in how far from that core something can be before they'll call it something else. When defining statistical concepts, however, it's crucial to demarkate the outline of the concept so that there is little or no possibility of ambiguity. Each unit either fits into a category or not. I hasten to add that statisticians are not always as careful as they need to be about defining terms and that standard definitions may not accurately reflect the concept as it is actually collected.

Similarly, not all classifications are the same. In a statistical classification, categories have to be defined to be mutually exclusive so that every unit in the population to be enumerated fits into one and only one category. And the classification must be exhaustive, so there's a category to accommodate every unit to be classified, even if the category is just 'Other'. Again, not all statistical classifications are entirely 'fit for purpose' or even internally coherent. And even when they are, they are often applied in contexts they were not designed for. So a classification of industries, for instance, that may (or may not) be perfectly serviceable in classifying the commodities each industry produces can obscure similarities and differences among industries when applied to workplace safety. 'Agriculture, forestry and fishing' captures the industries that produce food and timber, but the hazards encountered in abalone diving are quite different to those in the dairy cattle farming industry, which in turn might be similar to those in the beef cattle farming industry.

Finally, a generalisation like 'The mean number of completed calls originating from phones within the 50 states is 7.8 per day' is not really metadata. It's just another way of presenting the data. The applicable metadata would include definitions of 'Completed call' (including or excluding SMS, etc.) and 'Within the 50 states' (including or excluding diverted calls, global roaming, etc.).

No comments:

Post a Comment