Modern Enterprise Data Modeling

Author:Murphy  |  View: 28499  |  Time: 2025-03-22 20:43:44

I have been involved in modeling data for over 30 years, creating a variety of data models (3NF, dimensional, ensemble (anchor, data-vault), graphs, etc.) mainly for analytical systems. However, many of these have also gradually become outdated or obsolete. Sometimes it feels like the work of the unfortunate Sisyphus who persistently rolls his boulder up the hill, only to realise at some point that it was in vain again.

For a very long time, I was convinced that it must be possible to centrally model a common and complete view of business matters for a company. After all, long-time business people who have been involved in the modeling process know what's going on in the company, right? Well, the smaller the company was, the closer I reached the goal. But to be completely honest, in the end, each model remained just an approximation – a static view that tried to reflect the constantly changing reality.

But even if it is quite laborious to create such a model, we absolutely cannot be successful without it. The modern data-driven enterprise is based on the core idea of deriving value from data. However, the fact is that data has no value in and of itself. We need to use, combine, integrate and apply data in different business contexts to ultimately derive value from it.

But without a real understanding of the information and business context in our data, quite frankly, we can't derive anything from it. We can try to create a data structure beforehand doing schema on write and we can deduce it while we read doing schema on read. Regardless which way, only with a clear structure and detailed descriptions about the things and known relationships between these things are we able to apply intelligence and deduce what is still unknown. Without structure, we'll sink into in the unstructured data lake's data swamp.

From Data to Information

Given an unstructured text on any topic, we won't understand the content before we have read the text completely. While we read, we are creating a mental model of the information buried in the text. For longer texts, we will most likely also take notes that actually persist part of the mental model on paper.

Something similar happens if ML models read text. Let's take Large Language Models (LLM), because interest in them is currently exploding. LLMs do not understand much until they have scanned the text in full – ignoring for a moment that these models don't really understand anything anyway. While these LLMs scan the text, they derive a model that take words (or tokens), relate them to other words before and after the word in question, create embeddings and persist the result as vectors in a database – similar to our reading and taking notes on paper, actually.

What's the Problem with Traditional Data Modeling?

Without going further into the details of LLMs and the algorithms they use to derive models from unstructured text, let's explore the bottom-up nature of this process.

We could theoretically store all the information generated by business applications as plain natural language text. These texts could then be sent through our LLMs to derive a model from this business data. This would be a rather lengthy and expensive process but it could actually also be beneficial. Unfortunately the model produced with this process is not a meaningful semantic model for us humans that directly helps to understand the data. But the model persisted in a vector database combined with a smart client like ChatGPT would likely allow you to run queries for uncovering unknown things about your data.

At the same time, in most companies we typically have enterprise data models that, at best, represent an often outdated high-level view of your business. A view that was typically crafted top-down with significant human labor. Unfortunately these models are most often neither complete nor fully consistent with the other, more detailed models built for the business applications in the organization. The model may even be physically implemented as an enterprise data warehouse – or many such data warehouses to reflect reality.

But I dare say that there is not a single company today that has a complete and up-to-date data model of the information contained in all its applications and databases – and by that I explicitly mean the entire operational and analytical planes. And I'm even more convinced that no company owns an enterprise data warehouse with an implemented version of this model that is fully populated with data from all applications and constantly kept up-to-date.

Even data models designed for a single application are typically a top-down partial view of business objects and relationships relevant for the application at a specific point in time. But the reality, of course, is constantly changing with every new product that is defined, with every new application (or even microservice) rolled out, and generally with continuous amendments to the business processes of the corporation.

It's impossible to map the ever changing way of a company doing business in a static top-down data model. Even if you could theoretically capture all business objects and relations correctly at a specific point in time, the model would quickly become outdated.

Transitional Modeling to the Rescue

If this labor intensive top-down modeling does not fully meet our requirements, why not add a bottom-up modeling approach to it? Let's take a fresh view on data modeling by following an approach that Lars Rönnbäck called transitional modeling in his intriguing paper on Modeling Conflicting, Unreliable, and Varying Information.

Comparable to the LLM approach to derive a model from natural language text (lightly structured by words and sentences), we can derive our data model from data atoms, or Posits as Lars Rönnbäck calls them. Data atoms are like standardized tidbits of information about business in the organization. We will explore them in more detail later in this article.

The difference to natural language text is the minimal rigid schema by which data atoms are defined and combined, as opposed to the unstructured way words are combined into sentences. With this minimal schema, we gain the simplified processing option, while maintaining the flexibility of natural language text to express highly complex facts and multifaceted relationships. In this way, we can derive the model from the data itself, similar to how the LLM derives its model from the words in the natural language text, but more efficiently and simply.

Converting the content of a database into data atoms is actually a fairly straightforward process and can be implemented much more efficiently than converting the same to a meaningful natural language text. And each IT business domain can easily add their business context information as additional data atoms to what was automatically derived from databases. This overall stream of data atoms contains all the necessary information to derive an up-to-date data model at any time. Any change in the domain model can be easily published by feeding additional data atoms into this stream.

A stream of data atoms is the ideal input for what we have termed as the event-driven Data Architecture. A system design where processing is triggered by specific events or changes in state (here the data atoms), enabling real-time interactions between applications in the enterprise.

With each rollout of a new application feature, that most likely will also change the corporate data model, the change can be published as data atoms to the event-driven architecture. Practically this means that every other application in the enterprise can react directly to the changed model and update visualizations or dependent downstream processing accordingly. Overall this allows your systems to evolve dynamically and provide very flexible ways to represent information.

Any other data model can be derived from the basic model of a time-ordered series of data atoms.

We can use this fact effectively to create enterprise data models that remain up-to-date at all times. And even more, any other database model optimized for a specific application, can efficiently be derived from the enterprise model. We can develop applications that are able to automatically transform the stream of data atoms into, for instance, a data vault, a graph model or a performance-optimized star schema for a relational database to be used in a data warehouse.

And keep in mind that these data atoms not only encode the semantics for the data to derive models, but also the complete business content. An immutable stream of data atoms is therefore the optimal model for a lossless recording of any information that has been generated in the enterprise over time. This is quite similar to what a relational database uses its commit log for to, for instance, implement ACID guarantees and to derive all kinds of relations from the transactions committed to the database. But here we use this on the enterprise level to integrate all applications instead of being locked in one single database management system.

Data Atoms

Sounds too good to be true, so let's explore these data atoms. The simple and obvious process of forming complex organisms from simpler atoms and molecules is another smart invention of Mother Nature. We can apply the same principle to form our models bottom-up from data atoms.

So let's stir the primordial soup a little and define data atoms that can be combined to express highly complex facts. Lars Rönnbäck has created a colorful and nice description of the structure that form our data atoms. A little scala code can maybe help to describe the structure even more concisely and allows you to play with the concept:

// === THE BASIC STRUCTURE ===

// An Appearance can be identified with any type I
// and can have a role of any type R
case class Appearance[I,R](id: I, role: R):
    override def toString(): String = s"($id,$role)"

// A DataAtom as a set of Apearances and a value (of type V) 
// as of one particular point in time (type T)
// override toString to mimic Lars Rönnbäcks compact way to write Posits
case class DataAtom[I, R, V, T](appearances: Set[Appearance[I,R]], value: V, timestamp: T):
    private var id: java.util.UUID = scala.compiletime.uninitialized
    override def toString(): String = s"[${appearances.toString.replace("Set(","{").replace("))",")}")}, $value, $timestamp]"
    def infoId(): String =
      if (id == null) then id = java.util.UUID.randomUUID()
      id.toString()

// === Create a simple model ===

// The classes defined in the ontology
val customerType = DataAtom(
  Set(Appearance("Ontology.Customer", "class")),
  "Customer", java.time.LocalDateTime.MIN)

val carType = DataAtom(
  Set(Appearance("Ontology.Car", "class")),
  "Car", java.time.LocalDateTime.MIN)

// Define the customer John Wayne with id 4711
val customer4711 = DataAtom( 
  Set(Appearance("4711", "thing"), Appearance("Ontology.Customer", "class")),
  "official", java.time.LocalDateTime.now())
val customer4711LastName = DataAtom( 
  Set(Appearance("4711", "last name")), "Wayne", java.time.LocalDateTime.now())
val customer4711FirstName = DataAtom( 
  Set(Appearance("4711", "first name")), "John", java.time.LocalDateTime.now())

// Define the car with red color and VIN 0815 to be owned by customer 4711
val car0815 = DataAtom(
  Set(Appearance("0815", "thing"), Appearance("Ontology.Car", "class")),
  "official", java.time.LocalDateTime.now())
val car0815VIN = DataAtom( 
  Set(Appearance("0815", "vehicle identification number")),
  "0815", java.time.LocalDateTime.now())
val car0815Red = DataAtom(
  Set(Appearance("0815", "color")), 
  "red", java.time.LocalDateTime.now())
val carOwned = DataAtom( 
  Set(Appearance("4711", "owns"), Appearance("0815", "registered with")),
  "officially registered", java.time.LocalDateTime.now())

// create a stream of DataAtoms
val stream = Seq.empty[DataAtom[?,?,?,?]] :+ 
  customerType :+ carType :+ 
  customer4711 :+ 
  customer4711LastName :+ 
  customer4711FirstName :+ 
  car0815 :+ 
  car0815VIN :+ 
  car0815Red :+ 
  carOwned

////////////////////////////////////////////////////////////////////////////////////////
// Change the ontology
// Define a new vehicle type for bicycles, cars and other vehicles
val vehicleType = DataAtom(
  Set(Appearance("Ontology.Vehicle", "class")),
  "Vehicle", java.time.LocalDateTime.now().plusYears(3))

// The car will be repainted in blue one year later
val colorChange = DataAtom(
  Set(Appearance("0815", "color")), 
  "blue", java.time.LocalDateTime.now().plusYears(1))

// The customer Herold Ford appears with id 4712 two years later
val customer4712 = DataAtom(
  Set(Appearance("4712", "thing"), Appearance("Ontology.Customer", "class")),
  "official", java.time.LocalDateTime.now().plusYears(2))

// John Wayne sells his car to Herold Ford
val carOwnedNow = DataAtom(
  Set(Appearance("4712", "owns"), Appearance("0815", "registered with")),
  "officially registered", java.time.LocalDateTime.now().plusYears(2))

// The fact that customers can own bicycles gets tracked
val customer4711withBicycle = DataAtom(
  Set(Appearance("4711", "owns bicycle")),
  true, java.time.LocalDateTime.now().plusYears(2))

// John Wayne asserts one year after he sold his car, that he thinks with 80% 
// percent confidence he sold it to Herold Ford the year before
val assertion = DataAtom(
  Set(Appearance(carOwnedNow.infoId(), "ATOM"),
      Appearance("4711", "determines confidence")),
  0.8, java.time.LocalDateTime.now().plusYears(3))

// the car is redefined as a vehicle with a new attribute: type
val carNowVehicle = DataAtom(
  Set(Appearance("0815", "thing"), Appearance("Ontology.Vehicle", "class")),
  "official", java.time.LocalDateTime.now().plusYears(3))

val carVehicleType = DataAtom(
  Set(Appearance("0815", "type")),
  "Car", java.time.LocalDateTime.now().plusYears(3))

// create another stream of DataAtoms
val newStream = Seq.empty[DataAtom[?,?,?,?]] :+
  vehicleType :+ 
  colorChange :+
  customer4712 :+ carOwnedNow :+
  customer4711withBicycle :+
  assertion :+
  carNowVehicle :+ carVehicleType

val overallStream = stream ++ newStream

// As one tiny example let's query the stream to
// get all the entities in the model as a 
// set of (entity-name, valid-from) tuples
val entities = overallStream.filter(_.appearances.map {
   role => role match {
     case Appearance(_,"class") => true
     case _ => false
   }}.reduceLeft( (acc,b) => acc && b )
).sortBy(_.timestamp.asInstanceOf[java.time.LocalDateTime]).map {
  a => (a.value, a.timestamp)
}.toSet

The small example hopefully demonstrates the potential of this approach. It allows your corporate models to stay up-to-date, complete and fully aligned with all the detailed models exposed by your applications. Apply this on the enterprise level to encode information created in your business applications that need to be shared at large scale. Information encoded into data atoms can form the backbone of modern Data Modeling. At the same time this approach can also be used for what I call the universal data supply – an enterprise data architecture that aims to deliver on the data-driven value proposition by enabling lossless data sharing across all applications in the enterprise.

An Ontology to Guide Your Data Atoms

As is so often the case, the principle of doing one thing without neglecting the other applies here as well. Even if our top-down modeling approach cannot succeed alone, it can significantly streamline our transitional modeling.

llustrative example for a top-down ontology to integrate distributed and bottom-up developed domain models – Image by author

I have written in more detail about an ontology and how to create one in part 3 of my three-part series on data mesh. But as a very brief summary of the core idea: An ontology is about the big picture and the conceptual understanding of the business on a high level that provides orientation for the application specific models developed in the business domains.

Modern data modeling approaches the problem from the top down with the ontology and from the bottom up by creating simple data atoms that are linked to the ontology. This enables the construction of full-fledged enterprise data models that can stay current and complete.

This does not mean that modeling has practically become an automated process. We can still have conflicting and varying information that need to be harmonised and agreed upon. It's just that we now enable everyone in the enterprise to actively participate. We effectively distribute the data modeling process and this sparks an open and fruitful debate about the enterprise data model.

The process is no longer driven by a theoretical top-down view on the business at one specific point in time. The combination of an ontology and transitional modeling provides the framework to allow real collaboration across the enterprise. Discussions and different points of view are completely transparent at all times. This at least gives us the tools to ultimately reach consensus on an enterprise-wide data model. With the right governance processes guiding this collaborative modeling exercise, we have a good chance of taking data modeling to the next level of evolution.


If you find this information useful, please consider to clap. I would be more than happy to receive your feedback with your opinions and questions.

Tags: Data Architecture Data Engineering Data Modeling Database Notes From Industry

Comment