Unlocking the Power of Big Data: The Fascinating World of Graph Learning
Large companies generate and collect vast amounts of data, as an example and 90% of this data has been created in recent years. Yet, 73% of these data remain unused [1]. However, as you may know, data is a goldmine for companies working with Big Data.
Deep learning is constantly evolving, and today, the challenge is to adapt these new solutions to specific goals to stand out and enhance long-term competitiveness.
My previous manager had a good intuition that these two events could come together, and together facilitate access, requests, and above all stop wasting time and money.
Why is this data left unused?
Accessing it takes too long, rights verification, and especially content checks are necessary before granting access to users.
Is there a solution to automatically document new data?
If you're not familiar with large enterprises, no problem – I wasn't either. An interesting concept in such environments is the use of Big Data, particularly HDFS (Hadoop Distributed File System), which is a cluster designed to consolidate all of the company's data. Within this vast pool of data, you can find structured data, and within that structured data, Hive columns are referenced. Some of these columns are used to create additional tables and likely serve as sources for various datasets. Companies keep the informations between some table by the lineage.
These columns also have various characteristics (domain, type, name, date, owner…). The goal of the project was to document the data known as physical data with business data.
Distinguishing between physical and business data:
To put it simply, physical data is a column name in a table, and business data is the usage of that column.
For exemple: Table named Friends contains columns (character, salary, address). Our physical data are character, salary, and address. Our business data are for example,
- For "Character" -> Name of the Character
- For "Salary" -> Amount of the salary
- For "Address" -> Location of the person
Those business data would help in accessing data because you would directly have the information you needed. You would know that this is the dataset you want for your project, the information you're looking for is in this table. So you'd just have to ask and find your happiness, go early without losing your time and money.
"During my final internship, I, along with my team of interns, implemented a Big Data / Graph Learning solution to document these data.
The idea was to create a graph to structure our data and at the end predict business data based on features. In other word from data stored on the company's environnement, document each dataset to associate an use and in the future reduce the search cost and be more data-driven.
We had 830 labels to classify and not so many rows. Hopefully the power of graph learning come into play. I'm letting you read… "
Article Objectives: This article aims to provide an understanding of Big Data concepts, Graph Learning, the algorithm used, and the results. It also covers deployment considerations and how to successfully develop a model.
To help you understand my journey, the outline of this article contain :
- Data Acquisition: Sourcing the Essential Data for Graph Creation
- Graph-based Modeling with GSage
- Effective Deployment Strategies
Data Acquisition
As I mentioned earlier, data is often stored in Hive columns. If you didn't already know, these data are stored in large containers. We extract, transform, and load this data through techniques known as ETL.
What type of data did I need?
- Physical data and their characteristics (domain, name, data type).
- Lineage (the relationships between physical data, if they have undergone common transformations).
- A mapping of ‘some physical data related to business data' to then "let" the algorithm perform on its own.
1. Characteristics/ Features are obtained directly when we store the data; they are mandatory as soon as we store data. For example (depends on your case) :
For the features, based on empirical experience, we decided to use a feature hasher on three columns.
Feature Hasher: technique used in machine learning to convert high-dimensional categorical data, such as text or categorical variables, into a lower-dimensional numerical representation to reduce memory and computational requirements while preserving meaningful information.
You could have the choice with One Hot Encoding technique if you have similar patterns. If you want to deliver your model, my advice would be to use Feature Hasher.
2. Lineage is a bit more complex but not impossible to understand. Lineage is like a history of physical data, where we have a rough idea of what transformations have been applied and where the data is stored elsewhere.
Imagine big data in your mind and all these data. In some projects, we use data from a table and apply a transformation through a job (Spark).
We gather the informations of all physical data we have to create connections in our graph, or at least one of the connections.
3. The mapping is the foundation that adds value to our project. It's where we associate our business data with our physical data. This provides the algorithm with verified information so that it can classify the new incoming data in the end. This mapping had to be done by someone who understands the process of the company, and has the skills to recognize difficult patterns without asking.
ML advice, from my own experience :
Quoting Mr. Andrew NG, in classical machine learning, there's something called the algorithm lifecycle. We often think about the algorithm, making it complicated, and not just using a good old Linear Regression (I've tried; it doesn't work). In this lifecycle, there are all the stages of preprocessing, modeling and monitoring… but most importantly, there is data focusing.
This is a mistake we often make; we take it for granted and start doing data analysis. We draw conclusions from the dataset without sometimes questioning its relevance. Don't forget data focusing, my friends; it can boost your performance or even lead to a change of project