Tuesday, July 16, 2013

On the topic of data: Why do we collect data?


Data don't make sense we will have to resort to statistics


Recently on one of the social networks I am a member of, a discussion started about statistics and usage of the term data. The best definition I can provide for data is that data is a recorded observation of an event such as how much a person spends on a new car or the score one earned on a test. Once you have a number of such data points, you would have a data set and it is common to title the data set as just data. Although each term, data, data point, and data set are different, we can and often do classify them all as data.

The more pressing question or the logical continuation of this discussion would be to explore the why, how, when, and where we collect data. This of course would take me more than one blog post to provide me opinion but I will do my best to provide insight into these areas across a few blog posts.

This post will attempt to provide some insight into why we collect data. The history of data collection began about the same time that people began counting. People counted the number of items necessary to trade for another item or set of items, how many times the sun rose to determine a season, or the quantity of food necessary to live during the winter. After we became competent in counting, we began to associate meaning to these numbers. For instance, a farmer must know how many seeds to plant in order to produce a crop that yields enough to provide properly.  The first areas to use data and the interpretation of data were agriculture, astronomy and politics. In the recent 100 years, people began collecting data in an attempt to describe behavior.

Simply put, the most prevalent reason to collect data about a human is to predict behavior. Taking it one step further, we use statistics to either describe the data we collect or to make inferences (judgments) about a group of people. Descriptive statistics are used when one wishes only to describe the entire group of people from which the data were provided. If we have 3 levels of web domain services to offer, we might use data to conclude that 80% of our customers prefer the 2 tier of service thereby making it our most popular. It would be accurate to claim that 4 out of 5 customers prefer service level 2. It becomes a matter of ethics if we disclose the definition of the term “customers” within that statement. Additionally, we might be curious about the number of total customers that were recorded when this claim was made. How often do you actually consider the defining characteristic of the people that were measured? In the above case, that would be the “customer”. A customer is a person that purchased one of our domain hosting packages. These statements are descriptive; they describe what has already occurred in an attempt to influence the behavior of those looking to make a purchase.  

Data can also be interpreted in such a way that inferences can be made in terms of describing human behavior. Inferential statistics allow us to measure data from a small but representative group of people and make decisions about how the entire population will behave. We are constantly bombarded by what should be quality statistical inferences; and, we rarely challenge them. I often hear folks say that a statistician can make any claim they desire using statistics. While it may seem true, I prefer to believe that most of us that are experienced and capable in the area of statistics only make statements that are strongly supported by the data.

In summary, we use the word data to describe the measurements that we make on an individual or a group of individuals. Applying statistics to the data allows decisions to be made. These decisions are either descriptive or inferential and pertain to an aspect(s) of human behavior.

My next blog on statistics will address the basic descriptive statistics and how we might use them accurately.