What You Need to Know About Analyzing Compensation Survey Data to Ensure Data Integrity
In the current environment of a tight job market within the US, it is becoming extremely important for hiring new employees and employee retention that the employees are paid fairly relatively to their job responsibilities and local markets. The process is often referred to by compensation professionals as market pricing.
How does an organization find out what is a fair market price for a certain job in their location along with other attributes such as industry and physical location? This data is often available these days based on compensation surveys conducted by consulting or other third-party companies where the organization reports their actual compensation data matched to a common set of jobs. This data is then analyzed, cleaned (to be sure that the reported data does match the job), and then reported in statistical elements such as averages and percentiles.
The data cleaning process (part of what we call data integrity analysis) is the most important element of assuring high data quality. Although there are many sources of errors in data submitted by the organizations (data entry errors, mismatched jobs, partial year data, etc.), these errors nearly always manifest in out-of-range data (suspect data), which is detectable if proper care and methods are used. This suspect data should be carefully examined on a case-by-case basis to see if the data is correct, and if it should be kept in the dataset for analysis.
The suspect data can be classified in two manners: anomalous and outlier. Although both types of data have out-of-range characteristics but differ in the underlying causes.
Consider Larry Page, the CEO of Alphabet (Google’s parent) with revenues exceeding $25B, reporting an annual salary of $1 in the compensation survey. Or, Warren Buffet, reporting an annual salary of $100,000. As outlandish these numbers may sound, these numbers are correct. These are anomalies, and in a dataset, these would appear as extreme outliers. We call such data as anomalous. Anomalous data is correct data but it is not representative of the market, and should be excluded, just as the extreme outlier data should be excluded for the same reason.
It would appear that the public company CEOs making $1 annual salary is as rare as a unicorn, and not worthy of attention. That in itself is not true, but the anomalous data is lot more prevalent in private companies’ data reporting. The owners of such companies have a lot more freedom in setting artificial compensation levels (or compensation structures such as salary, variable pay, other compensation) for a wide variety of reasons, with tax planning being the most prominent. The 2018 tax law makes such cases even more likely by creating new tax rates for direct compensation and distribution-based compensation.
The other type of outlier data is often based on data errors. But, irrespective of the origin, and whether the data is correct or not, it is important not to use reported data if it is not reflective of the market. So the outlier data, anomalous or error-driven, should be treated the same way for data integrity analysis. Such analysis should follow two steps: identification of outliers, and resolution to fix any erroneous data. Care should be taken that every data issue is identified and resolved fully, and any data issue, if not resolved, should not be included in the dataset.
It is important to understand that even rare data errors, if not resolved, can cause serious doubts and inaccuracies about the reported data. Let us say that average salary of a dataset for a job is $80,000, but one outlier reported data included in the dataset is $160,000 (the correct number is $82,000) and there are 20 companies in the dataset. Using the correct data, the average would have been $76,100, implying an error of about 5%, which is significant. However, most organizations, look for the data cuts that match their industry, size or location. Let us say that a particular cut containing the erroneous datapoint includes 6 companies and shows an average of $84,000. However, the correct value in this case would have been $71,000, an error of over 18%, which is not acceptable. Such a large error is likely to creep into every cut where that company’s data would be included, with serious significances on data quality.
In summary, data quality depends on assiduous identification and resolution of every source of incorrect data, anomalous or otherwise.
Does your organization need help in aligning your compensation and benefits strategies with employee expectations? Learn how PeriscopeIQ’s compensation and benefits survey platform ensures your compensation data is accurate and compliant.