Advanced Computing in the Age of AI | Wednesday, February 28, 2024

Looking for AI Success? It’s All About the Data 

As the number of industries integrating artificial intelligence (AI) into their operations continues to grow, more organizations are scrutinizing their AI system design workflows, including the roles of modeling and data. Within the workflows, organizations are finding and confirming that starting with good data plays the largest role in producing accurate insights.

That is because when the data is fed into a model, it shapes how the model analyzes, learns, and arrives at its decisions. If that model is forced to analyze substandard data, its insights will be substandard. Conversely, if the model is fed the most accurate and useful data available, its insights will be useful.

So how does one ensure the data fed into an AI model is the best available?

Here are three tips that show how to use a data-centric approach to improve the effectiveness of AI-built systems.

Tip 1: What to Do When There is Not Enough Data

Many organizations implementing AI systems often start with the question, “Do I have enough data to build a successful model?”

One scenario where this is commonly asked is when developing a predictive maintenance application for detecting critical failures. Such failures are often destructive, costly and rare, so gathering enough failure data needed to accurately train an AI model capable of detecting real-world equipment failures can be a difficult task.

Fortunately, several data simulation techniques can be used to generate accurate, realistic input data that can be used for such training.

The first method is to generate realistic synthetic data using a digital twin. Realistic digital twins can be made using methods such as model-based design, where all the components of a large physical system, such as an autonomous car or wind turbine, are combined into a single model. This allows the system to be simulated in multiple scenarios.

Traditionally, model-based design is used to design and simulate systems that have not included AI. When using it with AI, however, it solves several challengers. It allows engineers to run thousands of simulations covering all cases a system will operate in. That generated data can then be used for training an AI model.

In addition, the trained model can then be included back in the original system model, which ultimately enables it to be tested on how it performs under operation in various simulated scenarios. For example, creating a simulation model of a pump that can generate faults and healthy data for building a predictive maintenance application. The data created can then be used to train a multi-class classifier to detect different combinations of faults.

Another type of digital twins is one where an AI model is created based on historical input and output data. For example, an industrial tools and equipment manufacturer is using the digital twin approach for obtaining the data it needs, and then is using the data for building predictive maintenance models that will ultimately be deployed to more than hundreds of thousands air compressors manufactured and operated in their global manufacturing plants.

Synthetic data can also be generated using deep learning. If you don’t have access to a system-level model, techniques such as generative adversarial networks (GANs) can be used. GANs have shown that with a relatively small amount of input data, they can be used to produce synthetic data that resembles real data input to the networks.

Tip 2: What to Do If You Do Not Have the Right Data

A common frustration encountered with AI model design is the realization that more data does not automatically improve a model’s performance. Rather than asking if a new AI system has enough data to produce accurate insights, engineers should ask whether the system has the right data to produce accurate insights. If the data used by a model to learn is not accurate, the outputs it generates will not be as well.

In our example of building an AI model for predictive maintenance, it can be quite common that an engineer will need to sift through data from hundreds or thousands of sensors from the system that is being analyzed.

Finding the features needed to train an accurate model can be time consuming. Fortunately, there are ways to automate or semi-automate the feature engineering process.

Automated feature engineering using higher level functions which allow them to train a model with accuracy in line with the results they would achieve when the process is done manually. Semi-automated feature engineering using app-based workflows can enable engineers to explore, extract, and rank features from the data they have available.

Tip 3: What Data Can Help Interpret a Model?

AI models rarely offer surprises when made with accurate, well-prepared data. There are plenty of tools available to help ensure models are processing data correctly.

There are visualization techniques that help engineers see why a model is making certain decisions. Apps and tools that enables engineers to explore the predictions of an image classification network use several popular deep learning visualization techniques, including occlusion sensitivity, Grad-CAM and gradient attribution.

Engineers can also test a model’s performance using experimentation. For example, an engineer may want to tune a model’s hyperparameters, compare the results of using different input data sets or test different deep network architectures when creating a model. Beneficial tools are those that allow engineers to have all the results of these experiments in one place so that they can make an informed choice of which model performs best and why.

If an organization is looking for AI success, a data-centric approach will lead to better outcomes.

With the right input data, engineers will be able to build better models allowing for improved interpretation of how a model operates. That will lead to a better explanation for the benefits it can bring to the business problem that is being addressed and solved.

About the Author

David Willingham is a principal product manager at MathWorks, responsible for the MATLAB’s Deep Learning Toolbox. David joined the company in 2006, and has amassed more than 15 years of applied engineering experience supporting a variety of application areas in Artificial Intelligence including Deep Learning, Machine Learning, Reinforcement Learning, Predictive Maintenance, Statistics, and Big Data & Cloud Computing. David has worked with clients in Finance, Mining, Medical, Aerospace, Automotive, Industrial Automation & Energy Industries and has published papers on Predictive Maintenance at AUSIMM Industry conferences for Mining, Mill Operators in 2016, and Iron Ore 2017. David received an honors degree in Electrical & Computer Systems Engineering from Monash University in Australia in 2003.