## Monday, July 22, 2013

### Adequately Describing Sales Data Using Simple Statistics

Continuing on the topic of data, I would like to discuss the widely used descriptive statistics: mean, median and mode. Collectively, these are defined as measures of central tendency. We use measures of central tendency to gain some understanding of the distribution of the collected data. Practically, whether it is “sound or not,” these measures are often used when it comes to making decisions about inventory, the success of a business, and even to establish the need for additional legislation.

### Using the Average

The arithmetic average, referred to, as the mean is a single value that best represents the set of value from which it is calculated. It is not without faults but is widely used.

For example, if you owned a car dealership, you would probably record the sales price for every car sold. You might also be interested in the average vehicle sales price at your dealership sold. Therefore, you would add up the sales total and divide it by the number of vehicles sold to calculate the average (also referred to as the mean) amount spent by each customer.

An important issue to consider is that the mean is affected by extreme values. Additionally, the average may not tell a business owner what the best selling or most popular product is among consumers. Figure 1. The mean is significantly influenced by extreme values.  The mode is the value that appears the most in a set of numbers.  The median is the midpoint of the data set.

### Extreme Values Influence The Mean

To illustrate the influence of extreme values on average, I have present two scenarios. In scenario one, (Table 1) notice that the cars ranged in price from \$16,100 to \$278,295 and the number sold at each price listed. Multiplying the cost of the vehicle by the number sold for that price yields the total sale amount for that model. Add up the total number of cars sold (133) and do the same for the total sale price for each car (\$3,076,655). To calculate the mean, divide \$3,076,655 by 133 for a mean of \$23,132.82.

Just by removing the one vehicle that was sold for \$278,295. We have a fairly substantial change in the average sale price of vehicles. The new average sale price is 21,199.77, almost \$2,000 less than the first average.

Unfortunately, neither average is equal to an actual sales price. Both the \$23,132 and the \$21,040 averages are between the vehicles priced at \$20,940 and \$27,670. If we use the average (mean), we might recommend that we increase our inventory with vehicles priced between \$20,940 and \$27,670. It might be reasonable to suggest that the dealership increase the stock of vehicles at or near the average price; however, this recommendation may not yield an increase in actual sales. We should use other descriptive statistics to determine the most frequent sales price.

### The Most Frequent Sales Price (the mode)

The statistical term mode, the most frequently observed data point, may provide us with a bit more insight. The mode is most often used when analyzing data that is not “numerical,” but “categorical.” An example would be when you wish to determine which exterior color of the vehicle sold the best. In the case of color, you cannot add up values and compute a mean value. So counting up the number of vehicles sold of each color, and then reporting the highest value would be one way to use the mode.

Using the previous scenarios, from tables 1 and 2, we can use the mode to help us make a recommendation regarding the restocking of vehicles based on the sales price. Notice, in both scenarios, the vehicle priced at \$18,995 was the most frequently sold vehicle. The vehicle at this price accounted for 21 of the 133 total sales. The mode is  \$4137 and \$2045 lower than the averages of scenario 1 and 2 respectively.

At this point, we have calculated the mean and the mode using the given data from both scenarios. We have a clearer understanding of the purchases made at our dealership, but we have one more descriptive statistic that we might want to explore.

### Using the Median, the Mid-Point, Percentiles

A percentile is not the same as a percentage; it is the percentage of values that fall at or below a given value. It is likely that you are aware of a test score given in terms of percentiles. The results may have been something like you score: 87 percentile. This means that your score was equal to or greater than 87 percent of the scores on the same test. By taking the number of questions answered correctly and dividing it by the number of total questions you calculate a percentage.

The median is the midpoint of the data set. The exact midpoint is known as the 50th percentile marking the price where 50% of the sales were for a lower or a higher price.

To find the data point that represents the median:
• When you have an odd number of data points, in our case we have 133, you take the number of data points and add 1 to that number, take the sum and divide it by 2. So: 133 + 1 = 134; divide 134 by 2 (134 / 2) = 67.
• If you have an even number of data points, calculate the usual average of the two data points found in the middle.
List vehicles sold by price, starting at the lowest price, count the number of cars sold until you reach 67. The sale price of the 67th car sold is \$18,995

Sometimes this value is easy to determine and sometimes it is not. When it is difficult to determine it is common practice to report other percentiles. Using the above data sets, table 1 and table 2, we find that 61% of the 133 cars sold were at or below the price of \$18,995. The 76th percentile for both scenarios is \$20,940, and the 46th percentile is \$18,500. Recall the average for scenario 1 and 2 were \$23,132.82 and \$21,040.49. Even the 76th percentile is not as high as the average value.

### What About The Entire Picture? The Range

Sometimes you might want to report the range of the observed data. In this case, we might want to know the range in sales price. This information might be useful to highlight the large disparity among the customers, or even to justify having a few high priced vehicles in stock. To calculate the range, subtract the lowest sales price from the highest sales price:
• Using the data from Table 1: \$278,295 - \$16,100 = \$262,195
This range, \$262,195, could be a bit misleading. If no further information is provided, the actual details behind this broad range in sales price are unknown.

To provide the best picture of the data, other measures of central tendency should be reported. Choosing one or all of the measures of central tendency will give the best description of the data.

### So What Should The Recommendation Be?

Based on the presented data, we should quickly realize that at least ¾ of the customers purchased vehicles priced at or under \$20,940. The sales of a few higher priced vehicles caused what might be classified as a mean that may not be useful when determining what cars to restock. If the salesperson’s commission on a new vehicle is independent of the vehicle sales price, then it makes the most sense to stock the inventory with the cars that are most likely to sell. Using the mode and percentile data we find that cars priced at \$18,995 as well as those priced at or under \$20,940 are trendy among the customers.

Personally, I would stock only cars that were priced at or under \$20,940 and offer the higher priced vehicles via order only. I would equally distribute the number of vehicles having higher prices across the number of cars ordered at the \$18,500, \$18,995 and \$20,940 price points.

Summary
Often times we can be misled by using only one statistic to describe the data. In this example, we found that the mean could be characterized as being slightly inflated. The mode was several thousand dollars below the mean; and, the price, of more than ¾ of the vehicles sold, was lower than the average. I find it interesting that most people are primarily concerned with the average.
• The mean is affected by extreme values but is still the most representative of all the scores. It is the most representative because it includes all the values.
• The mode is the most frequently occurring data point, and, the only measure of central tendency that can be applied to data that is not numerical. In this case, we used the mode to determine the most frequent sales price.
• Also known as the 50th percentile, the median is the point at which 50 percent of the observed data points fall at or below.
There are several other instances that I can think of where I would like to have more than one measure of central tendency reported. How about you?

## Tuesday, July 16, 2013

### On the topic of data: Why do we collect data?

Recently on one of the social networks I am a member of, a discussion started about statistics and usage of the term data. The best definition I can provide for data is that data is a recorded observation of an event such as how much a person spends on a new car or the score one earned on a test. Once you have a number of such data points, you would have a data set and it is common to title the data set as just data. Although each term, data, data point, and data set are different, we can and often do classify them all as data.

The more pressing question or the logical continuation of this discussion would be to explore the why, how, when, and where we collect data. This of course would take me more than one blog post to provide me opinion but I will do my best to provide insight into these areas across a few blog posts.

This post will attempt to provide some insight into why we collect data. The history of data collection began about the same time that people began counting. People counted the number of items necessary to trade for another item or set of items, how many times the sun rose to determine a season, or the quantity of food necessary to live during the winter. After we became competent in counting, we began to associate meaning to these numbers. For instance, a farmer must know how many seeds to plant in order to produce a crop that yields enough to provide properly.  The first areas to use data and the interpretation of data were agriculture, astronomy and politics. In the recent 100 years, people began collecting data in an attempt to describe behavior.

Simply put, the most prevalent reason to collect data about a human is to predict behavior. Taking it one step further, we use statistics to either describe the data we collect or to make inferences (judgments) about a group of people. Descriptive statistics are used when one wishes only to describe the entire group of people from which the data were provided. If we have 3 levels of web domain services to offer, we might use data to conclude that 80% of our customers prefer the 2 tier of service thereby making it our most popular. It would be accurate to claim that 4 out of 5 customers prefer service level 2. It becomes a matter of ethics if we disclose the definition of the term “customers” within that statement. Additionally, we might be curious about the number of total customers that were recorded when this claim was made. How often do you actually consider the defining characteristic of the people that were measured? In the above case, that would be the “customer”. A customer is a person that purchased one of our domain hosting packages. These statements are descriptive; they describe what has already occurred in an attempt to influence the behavior of those looking to make a purchase.

Data can also be interpreted in such a way that inferences can be made in terms of describing human behavior. Inferential statistics allow us to measure data from a small but representative group of people and make decisions about how the entire population will behave. We are constantly bombarded by what should be quality statistical inferences; and, we rarely challenge them. I often hear folks say that a statistician can make any claim they desire using statistics. While it may seem true, I prefer to believe that most of us that are experienced and capable in the area of statistics only make statements that are strongly supported by the data.

In summary, we use the word data to describe the measurements that we make on an individual or a group of individuals. Applying statistics to the data allows decisions to be made. These decisions are either descriptive or inferential and pertain to an aspect(s) of human behavior.

My next blog on statistics will address the basic descriptive statistics and how we might use them accurately.