Recently on one of the social networks I am a member of, a discussion started about statistics and usage of the term data. The best definition I can provide for data is that data is a recorded observation of an event such as how much a person spends on a new car or the score one earned on a test. Once you have a number of such data points, you would have a data set and it is common to title the data set as just data. Although each term, data, data point, and data set are different, we can and often do classify them all as data.
The more pressing question or the logical continuation of
this discussion would be to explore the why, how, when, and where we collect
data. This of course would take me more than one blog post to provide me
opinion but I will do my best to provide insight into these areas across a few
blog posts.
This post will attempt to provide some insight into why
we collect data. The history of data collection began about the same
time that people began counting. People counted the number of items necessary
to trade for another item or set of items, how many times the sun rose to
determine a season, or the quantity of food necessary to live during the
winter. After we became competent in counting, we began to associate meaning to
these numbers. For instance, a farmer must know how many seeds to plant in
order to produce a crop that yields enough to provide properly. The first areas to use data and the
interpretation of data were agriculture, astronomy and politics. In the recent
100 years, people began collecting data in an attempt to describe behavior.
Simply put, the most
prevalent reason to collect data about a human is to predict behavior.
Taking it one step further, we use statistics to either describe the data we
collect or to make inferences (judgments) about a group of people. Descriptive statistics are used when one
wishes only to describe the entire group of people from which the data were
provided. If we have 3 levels of web domain services to offer, we might use
data to conclude that 80% of our customers prefer the 2 tier of service thereby
making it our most popular. It would be accurate to claim that 4 out of 5 customers prefer service level 2. It
becomes a matter of ethics if we disclose the definition of the term “customers” within that statement.
Additionally, we might be curious about the number of total customers that were
recorded when this claim was made. How often do you actually consider the defining
characteristic of the people that were measured? In the above case, that would
be the “customer”. A customer is a
person that purchased one of our domain hosting packages. These statements are descriptive; they describe what has
already occurred in an attempt to influence the behavior of those looking to
make a purchase.
Data can also be interpreted in such a way that inferences
can be made in terms of describing human behavior. Inferential statistics allow us to measure data from a small but
representative group of people and make decisions about how the entire
population will behave. We are constantly bombarded by what should be quality
statistical inferences; and, we rarely challenge them. I often hear folks say
that a statistician can make any claim they desire using statistics. While it
may seem true, I prefer to believe that most of us that are experienced and
capable in the area of statistics only make statements that are strongly
supported by the data.
In summary, we use the word data to describe the
measurements that we make on an individual or a group of individuals. Applying
statistics to the data allows decisions to be made. These decisions are either
descriptive or inferential and pertain to an aspect(s) of
human behavior.
My next blog on statistics will address the basic descriptive statistics and how we might use them accurately.
My next blog on statistics will address the basic descriptive statistics and how we might use them accurately.