My Journey into Data Science

KumarsMLJourney is where I reflect on data, predictive systems, and how analytics actually works in real-world environments. With 15 years in tech operations and 10 years in data science across descriptive to prescriptive analytics, my focus has shifted from building models to understanding how they hold up in production. Most challenges aren’t technical, but sit between data, process, and people. This blog explores those gaps and the realities behind making data truly work.

Sunday, March 29, 2026

I suppose... what makes data easy to explain can make it harder to learn from

What helps us explain the past does not always help us learn from it.

A dataset prepared for reporting and a dataset prepared for modelling are built with very different intentions, even if they appear similar on the surface.

A reporting dataset is designed to describe what has already happened. The priority is consistency, aggregation, and interpretability. Data is grouped, summarized, and aligned to business definitions. Daily totals, weekly averages, pre-calculated KPIs, and standardized dimensions make it easier to consume and explain. Missing values are handled in ways that keep outputs stable, often by replacing nulls with zeros or carrying values forward so dashboards remain intact.

A modelling dataset is designed differently. The objective is not to summarize, but to preserve signal. Variation, relationships, and structure matter more than simplicity. Instead of smoothing the data, it needs to retain enough detail for patterns to be learned.

When a reporting dataset is used for modelling, the differences start to show.

Aggregation reduces variance. Transaction-level detail becomes compressed, smoothing out outliers and weakening relationships between variables. Patterns that exist at a finer level become less visible once they are averaged.

Handling of missing values also takes on a different meaning. Replacing nulls with zeros or carrying forward values may stabilize reporting, but it changes the underlying signal. A missing value and a true zero are treated the same, even though they represent different conditions.

Time alignment introduces another layer of complexity. Reporting datasets are structured around business periods, while modelling often depends on precise prediction points. This can create situations where information from the future is unintentionally included in features.

There is also a structural difference. Reporting datasets are typically wide, clean, and designed for readability. Modelling datasets tend to be more granular and may require additional transformations before they can be used effectively.

The challenge is that reporting datasets feel complete. They look clean, consistent, and trustworthy because they have already been processed to remove inconsistencies.

But that same processing often removes the variation and relationships that models depend on.

I suppose..

A lot of what we consider “ready” data depends on the context in which it is used. Data that is ready for reporting is optimized for clarity and stability. Data that is ready for modelling is optimized for learning and pattern detection.

When one is used in place of the other, the limitations are not always obvious at first. The dataset appears clean and well-structured, but some of the underlying signal may already have been simplified or altered.

This is where the distinction becomes important. Reporting benefits from aggregation and standardization. Modelling benefits from preserving detail and relationships. Both approaches are valid, but they serve different purposes.

Because ~~ What helps us explain the past does not always help us learn from it.

Thursday, March 26, 2026

I suppose… data cleaning hasn’t caught up with data complexity

Most enterprise datasets are not clean by design.They are assembled.

Pulled from multiple systems, shaped by operational constraints, and often optimized for storage or reporting rather than modelling. Missing values are not exceptions in these environments. They are expected. Yet the way they are handled has remained largely unchanged.

I suppose.... the limitation is not in awareness, but in approach.

Traditional imputation methods operate at the column level. Mean, median, forward fill, or rule-based substitution assumes that each feature can be corrected independently. This works in controlled datasets, but breaks down when relationships between variables are strong. In most real-world data, missingness is conditional, not random.

Let’s go back to basics and look at how machine learning approaches handle this differently.

K-Nearest Neighbors imputation is usually the easiest step up from traditional methods. Instead of filling in a missing value using a global average, it looks for rows in the dataset that are most similar to the one with missing data. Think of it as finding lookalike records. If a value is missing for a customer, the algorithm finds other customers who behave similarly based on the available fields, and uses their values to fill the gap. This works well when your data has clear groupings or patterns. But as the number of columns increases, it becomes harder to define what similar really means, and the method can slow down significantly.

Iterative imputation, often referred to as MICE, takes a more structured approach. Instead of treating each column separately, it tries to understand how columns relate to each other. It fills in missing values step by step. For example, if one column is missing, it uses all the other columns to predict it. Then it moves to the next column and repeats the process. This cycle runs multiple times until the values stop changing much. The strength of this method is that it captures relationships across the dataset, not just similarities between rows. In many enterprise datasets where features are connected, this leads to more realistic results. The downside is that it is more complex to run and takes longer to compute.

Deep learning approaches, such as autoencoders, work in a very different way. Instead of explicitly comparing rows or building step-by-step predictions, they try to learn the overall structure of the dataset. The model compresses the data into a smaller representation and then learns how to rebuild it. During this process, it also learns how to fill in missing values based on patterns it has seen across the entire dataset. This makes it powerful for complex data where relationships are not obvious. However, it is harder to explain how the values are being filled, and it requires more effort to train and maintain properly.

In most enterprise environments today, adoption is still limited. Deterministic pipelines dominate because they are predictable and easy to audit. Machine learning approaches introduce variability, which creates hesitation. But the value is not in replacing everything at once. It is in applying these methods where traditional logic starts to break.

How are you currently handling messy datasets in your environment, and where do you see these approaches realistically fitting in?

I suppose… we are building faster than we can stabilize

There’s a noticeable shift happening in how ideas move through organizations.

Prototypes are no longer slow.They are immediate.

Functional leaders and teams can now build working solutions in hours. Dashboards, workflows, and even AI-driven tools can be put together quickly enough to validate an idea almost instantly.

On the surface, this looks like acceleration.

More ideas are tested.
More concepts are demonstrated.
More things appear to be working.

But the moment something works, even partially, it starts to carry expectation.
What begins as a quick experiment quickly turns into something people want to scale, integrate, and rely on. The transition from “this is interesting” to “can we deploy this?” happens much faster than before.

That is where the tension starts to show.

Because while creation has become fast, production has not changed at the same pace. Deployment still requires structure. Security, scalability, ownership, and integration do not move at prototype speed.

So a gap begins to form.

Not between idea and execution, but between what works and what is actually ready.

As more of these solutions emerge, another pattern starts to develop. Things that are not fully productionized begin to get used in small, practical ways. A script here, a workflow there, something that supports a decision because it is “good enough for now.”

Individually, these feel harmless.
But over time, they begin to shape how work gets done.

I suppose..

What we are starting to see is less about failure and more about how the system is evolving. When prototypes become easier to build, more of them naturally move closer to real usage, even if they were not originally designed for it.

Over time, this creates a layer of solutions that sit somewhere between experiment and production. They work well enough to be useful, but they may not have the structure, ownership, or stability that production systems typically require.

This also changes how work flows into engineering. Instead of building from clearly defined requirements, there is often a need to reinterpret what was created, understand the assumptions behind it, and reshape it for a more stable environment.

At the same time, expectations begin to shift. When something can be built quickly, it is easy to assume it can also be deployed quickly. But the requirements for something to work once and something to work consistently are quite different.

None of this is necessarily a problem on its own. It is more a reflection of how much faster the front end of innovation has become compared to the systems that support it.

The question is less about whether this is right or wrong, and more about how well the organization can absorb this new pace without creating unintended dependencies or instability over time.

hashtagVibeCoding hashtagMVP hashtagPrototypes

I suppose… models don’t fail, processes do

Forecasting conversations often begin with a very familiar question.

How accurate is the forecast?

Accuracy is important. But in operational environments, accuracy alone rarely determines whether a forecast is usable. Production teams care just as much about stability, predictability, and clarity around when the number becomes final.

In volatile environments, this tension becomes obvious. Forecasts can be regenerated frequently as new data arrives. Each refresh may capture the latest trend, seasonality, or anomaly. From a modelling perspective this is valuable. From an operational perspective it can create confusion. If the number changes every week, planning becomes difficult.

That is why many organizations end up building layers of manual adjustment around their forecasts. Spreadsheets become the buffer between analytical output and operational reality. They allow teams to smooth volatility, introduce judgement, and stabilize numbers before they reach production planning.

Excel becomes less a tool and more a negotiation space.

I suppose..

A lot of what we describe as forecasting complexity is not really a modelling problem. It is a coordination problem between analytics and operations.
Models can generate forecasts continuously, but operations still require a clear moment when the forecast becomes the official planning number.
Without that boundary, every refresh introduces uncertainty.
That is why some of the most important forecasting capabilities are not statistical at all. They are governance mechanisms. Reforecasting schedules, forecast locks, override tracking, and transparent comparison between automated output and current planning numbers all help create stability around the model.

When those controls exist, automation becomes easier to trust. The model can adapt to new data, while the planning process remains stable.
Without them, teams compensate with manual spreadsheets, local adjustments, and side calculations that gradually fragment the process.
The interesting thing is that spreadsheets are rarely the root problem. They are often the symptom of a forecasting process that has not yet defined where analytical output ends and operational commitment begins.
Automation does not remove human judgement. It simply changes where judgement sits.

Instead of editing the forecast itself, the focus shifts to deciding when the forecast should refresh, when it should lock, and how deviations should be monitored over time.

In other words, the conversation moves from manipulating numbers to governing the system that produces them.
The real question is not whether a model can generate a forecast automatically.

It is whether the organization has designed a forecasting process that can absorb automation without losing stability.
Before asking whether automation is ready, perhaps we should first ask whether the forecasting process itself is designed to support it.