My Journey into Data Science

KumarsMLJourney is where I reflect on data, predictive systems, and how analytics actually works in real-world environments. With 15 years in tech operations and 10 years in data science across descriptive to prescriptive analytics, my focus has shifted from building models to understanding how they hold up in production. Most challenges aren’t technical, but sit between data, process, and people. This blog explores those gaps and the realities behind making data truly work.

Thursday, May 7, 2026

I suppose... Data is no longer just information

Did you hear about the Goblin Effect?

There was a period recently where language models started doing something slightly unusual.

In situations involving system errors, bugs, or abstract issues, responses began to include references to “goblins” or “gremlins.” Not occasionally or contextually, but with enough consistency to be noticed. The descriptions were still coherent, often even helpful, but the framing felt misplaced.

No one had explicitly trained the system to describe errors this way.

There was no dataset defining “goblins” as a standard abstraction for system behavior. And yet, the pattern appeared, repeated, and persisted across interactions.

At first glance, it is tempting to explain this using a familiar idea. Something must have gone wrong with the data. But that explanation does not quite hold.

The underlying information remained valid. The model was not hallucinating in the traditional sense. Instead, it was expressing correct concepts through a pattern that had become disproportionately prominent.

This suggests something else is happening.

These systems are not only learning from structured data. They are also shaped by feedback, interaction patterns, and ranking signals. Responses are selected and reinforced based on loosely defined objectives such as clarity, usefulness, or engagement.

Over time, certain ways of expressing ideas begin to appear more frequently, not because they are more accurate, but because they align more closely with what the system is implicitly encouraged to produce. A metaphor that resonates slightly better can be selected more often, and that preference, when repeated at scale, begins to influence behavior.

What starts as a minor tendency can become a visible pattern.

The goblin example is relatively harmless. But it highlights a broader shift in how we think about data and models.

I suppose..

What we are seeing here is not a failure of data in the traditional sense, but a change in what data represents. It is no longer limited to facts or structured inputs. It also includes signals about preference, tone, and usefulness, introduced through feedback loops.

When these signals are present, the system does not simply optimize for correctness. It optimizes for what is rewarded. This means behavior can emerge not directly from the data, but from how outputs are selected and reinforced.

The result is not necessarily error, but drift. Outputs remain valid, but the way they are expressed can shift in ways that were not explicitly intended.

The challenge is no longer just understanding what data goes into a system.

It is understanding how behavior is shaped over time.

Because the system is not just learning what to say, It is learning how to behave.

Curious how this is being observed in your environment. When patterns like this emerge, how do you distinguish between useful behavior and unintended drift?

Thursday, April 9, 2026

I suppose... Building everything is not the same as scaling it

Recently, I decided to build something end to end.

Not a prototype. Not a POC. An actual working system.

For most of my career, this kind of work was distributed. Different people owned different parts. Data moved through layers. Assumptions were discussed, sometimes challenged, sometimes left undocumented, but at least visible across the system.

That structure was not perfect.

Pipelines had hidden logic.

Interpretations carried bias.

Decisions were not always fully documented.

But knowledge was spread out.

This time, I built it alone.

With AI and cloud-based environments, it is now possible to move across the entire stack. Data ingestion, transformation, modelling, validation, and deployment can sit within one workflow, driven by a single person.

At first, it feels efficient.

Fewer dependencies.

Faster iteration.

Less coordination.

But the experience is different.

Not because the system does not work. It does.

But because the way it works becomes harder to separate from the person building it.

The logic is still there.

The assumptions are still there.

The decisions are still being made.

But they are compressed.

What used to be spread across roles now sits closer together. Some of it is captured in code. Some of it in prompts. Some of it in choices that are made along the way and not always revisited.

This is not entirely new.

There have always been systems that only one person fully understood. The difference now is how much can be built that way, and how quickly that complexity can accumulate.

A dataset is no longer just a dataset. It reflects a sequence of decisions about joins, filters, assumptions, and edge cases. A model is shaped not only by data, but by how it was iterated, tested, and adjusted.

The system works, but understanding how it works requires retracing those steps.

I suppose..

What changes here is not just who builds the system, but where the complexity sits. In distributed environments, complexity is spread across people and processes. In end-to-end workflows, it becomes more concentrated.

This does not automatically make the system worse. In some cases, it makes it faster and more coherent. For smaller or self-contained use cases, that concentration may not matter at all.

But as the system becomes more connected to other processes, the visibility of that complexity starts to matter more. Not because others cannot work on it, but because the path from input to output is less obvious without reconstructing the decisions behind it.

This is where the question begins to shift.

Not whether one person can build everything.

But what needs to be visible for something to be continued by someone else.

Because building end to end is now straightforward.

What is less clear is how much of that thinking needs to be externalized for the system to exist beyond the person who built it.

Tuesday, April 7, 2026

I suppose... Variation often gets mistaken for impact

In periods of disruption, data tends to move.

Volumes shifts, pattern breaking and numbers spiking/dropping.

When that happens, explanations arrive quickly.

A recent example is the Middle East crisis. In the weeks that followed, multiple metrics across industries showed noticeable changes. Contact volumes increased. Booking patterns shifted. Cancellations moved in ways that were not seen in the weeks before.

The immediate conclusion was consistent.

This is driven by the crisis.

In many cases, that was true. Events of that scale do create real impact. They influence behavior, disrupt flows, and introduce uncertainty into systems that were previously stable.

But something else tends to happen at the same time.

The presence of a strong external event creates a dominant narrative. Once that narrative is established, it begins to absorb variation.

Not all of it, but enough.

Spikes that may have occurred anyway start to get explained through the same lens. Seasonal patterns, ongoing trends, operational changes, and even random fluctuation begin to take on a common explanation.

Different teams may interpret the same movement in different ways, but the dominant narrative often remains unchanged.

This is where the distinction becomes important. Not between right and wrong, but between bias and noise.

Bias is directional. If the crisis consistently shifts behavior in one direction, that effect can be observed and measured over time.

Noise is different. It is the variation that exists regardless of the event. Short-term spikes, fluctuations, and inconsistencies that do not follow a clear pattern, but still demand explanation.

The difficulty is that both can appear at the same time.

A real shift may be happening. But so is unrelated variation.

I suppose..

What we are seeing in these moments is not just the impact of the event, but how interpretation adapts around it. When a strong narrative is present, it becomes easier to explain changes through that narrative than to separate what is actually driven by it and what is not.

This does not make the explanation incorrect. It makes it incomplete.

Over time, this can influence how systems are understood. Short-term variation may be treated as structural change. Temporary movement may be interpreted as a new baseline. Decisions may begin to anchor on patterns that do not persist.

The challenge is not in recognizing that an event has impact.

It is in understanding how much of what we are observing truly belongs to it.

Because not every movement during a disruption is caused by the disruption.

And not every explanation reflects the full picture.

Curious how this is approached in your environment. When patterns shift during major events, how much effort goes into separating signal from variation?

Sunday, March 29, 2026

I suppose... what makes data easy to explain can make it harder to learn from

What helps us explain the past does not always help us learn from it.

A dataset prepared for reporting and a dataset prepared for modelling are built with very different intentions, even if they appear similar on the surface.

A reporting dataset is designed to describe what has already happened. The priority is consistency, aggregation, and interpretability. Data is grouped, summarized, and aligned to business definitions. Daily totals, weekly averages, pre-calculated KPIs, and standardized dimensions make it easier to consume and explain. Missing values are handled in ways that keep outputs stable, often by replacing nulls with zeros or carrying values forward so dashboards remain intact.

A modelling dataset is designed differently. The objective is not to summarize, but to preserve signal. Variation, relationships, and structure matter more than simplicity. Instead of smoothing the data, it needs to retain enough detail for patterns to be learned.

When a reporting dataset is used for modelling, the differences start to show.

Aggregation reduces variance. Transaction-level detail becomes compressed, smoothing out outliers and weakening relationships between variables. Patterns that exist at a finer level become less visible once they are averaged.

Handling of missing values also takes on a different meaning. Replacing nulls with zeros or carrying forward values may stabilize reporting, but it changes the underlying signal. A missing value and a true zero are treated the same, even though they represent different conditions.

Time alignment introduces another layer of complexity. Reporting datasets are structured around business periods, while modelling often depends on precise prediction points. This can create situations where information from the future is unintentionally included in features.

There is also a structural difference. Reporting datasets are typically wide, clean, and designed for readability. Modelling datasets tend to be more granular and may require additional transformations before they can be used effectively.

The challenge is that reporting datasets feel complete. They look clean, consistent, and trustworthy because they have already been processed to remove inconsistencies.

But that same processing often removes the variation and relationships that models depend on.

I suppose..

A lot of what we consider “ready” data depends on the context in which it is used. Data that is ready for reporting is optimized for clarity and stability. Data that is ready for modelling is optimized for learning and pattern detection.

When one is used in place of the other, the limitations are not always obvious at first. The dataset appears clean and well-structured, but some of the underlying signal may already have been simplified or altered.

This is where the distinction becomes important. Reporting benefits from aggregation and standardization. Modelling benefits from preserving detail and relationships. Both approaches are valid, but they serve different purposes.

Because ~~ What helps us explain the past does not always help us learn from it.

Thursday, March 26, 2026

I suppose… data cleaning hasn’t caught up with data complexity

Most enterprise datasets are not clean by design.They are assembled.

Pulled from multiple systems, shaped by operational constraints, and often optimized for storage or reporting rather than modelling. Missing values are not exceptions in these environments. They are expected. Yet the way they are handled has remained largely unchanged.

I suppose.... the limitation is not in awareness, but in approach.

Traditional imputation methods operate at the column level. Mean, median, forward fill, or rule-based substitution assumes that each feature can be corrected independently. This works in controlled datasets, but breaks down when relationships between variables are strong. In most real-world data, missingness is conditional, not random.

Let’s go back to basics and look at how machine learning approaches handle this differently.

K-Nearest Neighbors imputation is usually the easiest step up from traditional methods. Instead of filling in a missing value using a global average, it looks for rows in the dataset that are most similar to the one with missing data. Think of it as finding lookalike records. If a value is missing for a customer, the algorithm finds other customers who behave similarly based on the available fields, and uses their values to fill the gap. This works well when your data has clear groupings or patterns. But as the number of columns increases, it becomes harder to define what similar really means, and the method can slow down significantly.

Iterative imputation, often referred to as MICE, takes a more structured approach. Instead of treating each column separately, it tries to understand how columns relate to each other. It fills in missing values step by step. For example, if one column is missing, it uses all the other columns to predict it. Then it moves to the next column and repeats the process. This cycle runs multiple times until the values stop changing much. The strength of this method is that it captures relationships across the dataset, not just similarities between rows. In many enterprise datasets where features are connected, this leads to more realistic results. The downside is that it is more complex to run and takes longer to compute.

Deep learning approaches, such as autoencoders, work in a very different way. Instead of explicitly comparing rows or building step-by-step predictions, they try to learn the overall structure of the dataset. The model compresses the data into a smaller representation and then learns how to rebuild it. During this process, it also learns how to fill in missing values based on patterns it has seen across the entire dataset. This makes it powerful for complex data where relationships are not obvious. However, it is harder to explain how the values are being filled, and it requires more effort to train and maintain properly.

In most enterprise environments today, adoption is still limited. Deterministic pipelines dominate because they are predictable and easy to audit. Machine learning approaches introduce variability, which creates hesitation. But the value is not in replacing everything at once. It is in applying these methods where traditional logic starts to break.

How are you currently handling messy datasets in your environment, and where do you see these approaches realistically fitting in?

I suppose… we are building faster than we can stabilize

There’s a noticeable shift happening in how ideas move through organizations.

Prototypes are no longer slow.They are immediate.

Functional leaders and teams can now build working solutions in hours. Dashboards, workflows, and even AI-driven tools can be put together quickly enough to validate an idea almost instantly.

On the surface, this looks like acceleration.

More ideas are tested.
More concepts are demonstrated.
More things appear to be working.

But the moment something works, even partially, it starts to carry expectation.
What begins as a quick experiment quickly turns into something people want to scale, integrate, and rely on. The transition from “this is interesting” to “can we deploy this?” happens much faster than before.

That is where the tension starts to show.

Because while creation has become fast, production has not changed at the same pace. Deployment still requires structure. Security, scalability, ownership, and integration do not move at prototype speed.

So a gap begins to form.

Not between idea and execution, but between what works and what is actually ready.

As more of these solutions emerge, another pattern starts to develop. Things that are not fully productionized begin to get used in small, practical ways. A script here, a workflow there, something that supports a decision because it is “good enough for now.”

Individually, these feel harmless.
But over time, they begin to shape how work gets done.

I suppose..

What we are starting to see is less about failure and more about how the system is evolving. When prototypes become easier to build, more of them naturally move closer to real usage, even if they were not originally designed for it.

Over time, this creates a layer of solutions that sit somewhere between experiment and production. They work well enough to be useful, but they may not have the structure, ownership, or stability that production systems typically require.

This also changes how work flows into engineering. Instead of building from clearly defined requirements, there is often a need to reinterpret what was created, understand the assumptions behind it, and reshape it for a more stable environment.

At the same time, expectations begin to shift. When something can be built quickly, it is easy to assume it can also be deployed quickly. But the requirements for something to work once and something to work consistently are quite different.

None of this is necessarily a problem on its own. It is more a reflection of how much faster the front end of innovation has become compared to the systems that support it.

The question is less about whether this is right or wrong, and more about how well the organization can absorb this new pace without creating unintended dependencies or instability over time.

hashtagVibeCoding hashtagMVP hashtagPrototypes