My Journey into Data Science: I suppose… data cleaning hasn’t caught up with data complexity

Most enterprise datasets are not clean by design.They are assembled.

Pulled from multiple systems, shaped by operational constraints, and often optimized for storage or reporting rather than modelling. Missing values are not exceptions in these environments. They are expected. Yet the way they are handled has remained largely unchanged.

I suppose.... the limitation is not in awareness, but in approach.

Traditional imputation methods operate at the column level. Mean, median, forward fill, or rule-based substitution assumes that each feature can be corrected independently. This works in controlled datasets, but breaks down when relationships between variables are strong. In most real-world data, missingness is conditional, not random.

Let’s go back to basics and look at how machine learning approaches handle this differently.

K-Nearest Neighbors imputation is usually the easiest step up from traditional methods. Instead of filling in a missing value using a global average, it looks for rows in the dataset that are most similar to the one with missing data. Think of it as finding lookalike records. If a value is missing for a customer, the algorithm finds other customers who behave similarly based on the available fields, and uses their values to fill the gap. This works well when your data has clear groupings or patterns. But as the number of columns increases, it becomes harder to define what similar really means, and the method can slow down significantly.

Iterative imputation, often referred to as MICE, takes a more structured approach. Instead of treating each column separately, it tries to understand how columns relate to each other. It fills in missing values step by step. For example, if one column is missing, it uses all the other columns to predict it. Then it moves to the next column and repeats the process. This cycle runs multiple times until the values stop changing much. The strength of this method is that it captures relationships across the dataset, not just similarities between rows. In many enterprise datasets where features are connected, this leads to more realistic results. The downside is that it is more complex to run and takes longer to compute.

Deep learning approaches, such as autoencoders, work in a very different way. Instead of explicitly comparing rows or building step-by-step predictions, they try to learn the overall structure of the dataset. The model compresses the data into a smaller representation and then learns how to rebuild it. During this process, it also learns how to fill in missing values based on patterns it has seen across the entire dataset. This makes it powerful for complex data where relationships are not obvious. However, it is harder to explain how the values are being filled, and it requires more effort to train and maintain properly.

In most enterprise environments today, adoption is still limited. Deterministic pipelines dominate because they are predictable and easy to audit. Machine learning approaches introduce variability, which creates hesitation. But the value is not in replacing everything at once. It is in applying these methods where traditional logic starts to break.

How are you currently handling messy datasets in your environment, and where do you see these approaches realistically fitting in?

My Journey into Data Science

Thursday, March 26, 2026

I suppose… data cleaning hasn’t caught up with data complexity

No comments:

Post a Comment

I suppose... Data is no longer just information