Sometimes your data has outliers. Trimming and Winsorizing are two ways to mitigate the effect of extreme values on your analysis.
"Trimming" data excludes the outlier values from your analysis. "Winsorizing" retains the responses in your basis, but caps numeric outliers so they fall at the edge of the main distribution.
A common request is to either trim or Winsorize data to within the [5%, 95%] percentiles. However, in practice survey data is often highly asymmetric, so clipping the data at just the high end may be reasonable.
Example
In the example below, most physicians report under 100 patients per month. But a few (4%) report much higher numbers.
The screener termination criteria already bound the responses to be at least 5. We might clip answers above 100, as shown by the gold line.
Winsorizing
We can cap those answers to within a defined range by setting the "ceiling"
and "floor"
attributes. Press the circle edit icon, choose "Edit JSON..." and set "ceiling": 100
.
The data is now bounded to the range [5,100]. The outlying values are not dropped, but are now counted as if they were equal to 100 and thus fall in the range "81 to 100" which has increased from 8% to 12%. The N size is still 100, but the mean is a bit lower now.
Note that the median did not change at all. In all but the most extreme cases, the median is robust to outliers and unaffected by Winsorizing because the extreme values stay on their side of the median.
Trimming
Another approach is to ignore responses outside the main range. To do this we can set a filter which includes only responses that fall within the range (5, 100].
Here the basis is lower, N=96, reflecting that the outliers are ignored from the distribution. The mean is a little lower still. The median happens to stay at 30, but trimming may change the median if more values are removed from one end than the other. In Protobi, select "Edit JSON..." from the context menu and enter the following:
"filter": { "S8v2": { "$gt": 5, "$lte": 100 } }
This is MongoDB query syntax, "$gt"
means "greater than" and "$lte"
means "less than or equal". This says to include only responses where data column S8v2
is greater than 5 or less than or equal to 100.
Recode outliers
Sometimes responses are entered honestly but in error. For instance, a respondent may have written they purchased their Tesla in the year "2081".
We might prefer to believe they meant to write "2018" rather than time traveled from the future to complete the survey. In this case one could recode "2081" to "2018" using the "Recode..." dialog in Protobi.
Retain outliers and use a log scale
Just because numbers are atypical doesn't mean they are unreasonable. Here, it's possible a few physicians really do treat many more patients of this condition than do most doctors.
Many phenomena yield "long-tail" distributions where a few outliers legitimately exist. For instance in economics most people have modest wealth but a few have very high net worth, and to exclude them from analysis would be misleading.
Many "long-tail" distributions have normal distributions when plotting the logarithm of values. In Protobi you can set "Round By..." to log, which chooses small bin ranges for smaller numbers and larger bin ranges for larger numbers.
Summary
This tutorial demonstrated three approaches to handling extreme values: trimming , Winsorizing, and retaining but plotting on a logarithmic scale.
Trimming makes a lot of sense when you simply don't believe the answers, e.g., a traveler who says he makes 999 commercial flights per year
Retaining the data makes sense when there legitimately may be high values, e.g., a few business travelers may actually take 100+ flights per year. A log scale may be useful.
Winsorizing makes senses when we want to retain the high-value responses but not take them too literally, such as when weighting physicians by self-reported patient volumes.
See related
To remove certain respondents completely from the project view, see Filter or remove respondents.