Comparing Data Enrichment to Data Cleansing in Three Ways

Published by Claire Ponsaran on

Data cleansing may be an essential step in the data enrichment process, but it doesn’t define what data enrichment really is and its importance in data management. Data cleansing may resemble data enrichment in some ways, but there are also differences that cannot be ignored. To better understand these two data-related processes, let’s take a closer look at their differences and similarities.

#1 Data cleansing doesn’t add new information to the data set; it cleans up “dirty data” and prepares it for data enrichment and processing.

A study by DMC Software revealed that more than 50% of businesses stored their data in unsecured systems. Of that number, 11 percent use email to store their data and 5 percent still keep their records on paper. As a result, whatever value their data may have will gradually fade as time goes by. Without a data cleansing strategy, their corporate data eventually become dirty or coarse.

Dirty data is full of errors and duplicates while coarse data has lots of missing values or incomplete information. Dirty data will need corrections, of course, and duplicates must be found and deleted. Coarse data, on the other hand, will need follow-ups and additional research. In effect, data cleansing takes care of dirty data while data enrichment adds more details to what’s already found in coarse data.

#2 Data cleaning techniques involve a mix of human intelligence and machine learning, just like with data enrichment.

Data cleansing doesn’t only involve deleting duplicates and correcting erroneous or obsolete data. It also involves changing the values to reflect a more uniform pattern or code, and thus, a more accurate count of key fields in your database.

For example, people who filled out a questionnaire answered “USA” when asked about the country they’re living in, but the form requires that they answer a two-letter country code. Cleaning this up means changing their answers to “US”, so their answers can be counted together.

What if the respondents misspelled their answers? Let’s say, respondents are asked about smartphone brands. Out of 100, fifty of them answered “iPhone”, 20 said “Samsung”, 10 said “Apple”, 5 said “Huawei”, and 5 said “Hawei”. If you’re using a spreadsheet to count them, these answers will be counted as separate brands. Machines don’t recognize the fact that “iPhone” is a brand owned by Apple and that Hawei is just a misspelling of Huawei. But, humans can.

To reduce the time it takes to complete the job, human agents use software to search, filter, and sort the data when they’re cleaning it up. But, the decision to delete obsolete and duplicate data always falls on human hands. It’s the same with data enrichment. Human agents will need to use software to search, sort and classify data, but the job of extracting information and matching the values to the right data entry will always be assigned to a human worker.

#3 You can’t clean up your company’s database once and just forget about it.

Data must be checked and cleaned up on a regular schedule. In comparison, data enrichment can happen either on a real-time basis or through a one-time batch update. Thus, data cleansing will always require humans and machines to work together.

Data entry workers and data scientists will continue to play an important role in data management. Artificially intelligent technology will continue to advance, but it will take another millennium before we can create A.I. that can perfectly mimic the way the human mind functions.

Thinking about outsourcing a large volume of work for your data enrichment or data cleansing project? Fill out the form on our Contact page and our consultant will get in touch with you shortly.