Each week we find a new topic for our readers to learn about in our AI Education column.
Today on AI education, we’re going to talk about data. Data has been referred to as the new oil, and we should probably understand why. We’ll begin by going back a few eons to my days as a young adult working my way through college—which was after a lot of the sophisticated computing technology we use for AI was invented, but well before it was deployed in most work settings.
I worked in the emergency department of a hospital registering patients and providing financial counseling. My job was to sit at a front desk—or to go to the bedside of patients—and record names, addresses, social security numbers, employment information, contact and guarantor information and insurance and billing information. It also entailed organizing charts, entering doctors’ orders into the ancient computer systems, handling all non-emergency communications and keeping an eye on the front lobby for disturbances. When I started, we were doing all this work with clipboards and old black-and-green screen UNIX systems with a light pen interface.
We had all that information being recorded, but few ways to use it to generate inferences about how healthy our city was or how well we were doing our jobs. After I recorded the data, there were few ways to link health information to billing or geographic data. By the time I left to start my journalism career, everything had changed, and part of the reason was the advent of big data.
The Rise of Big Data
Large academic medical centers like the hospital I worked in weren’t the only places where huge amounts of data were already being collected three or four decades ago. Governments, schools, prisons, universities—they were all recording geographic data, census data, tax and income data—but there were few ways to link all of this information together to create a narrative about the people and places they described. Nevertheless, painstaking work by researchers and data scientists helped us use some of this information to inform academic research and institutions like governments.
Then the world became linked together by faster internet connections, and more data storage became possible because of better computers. About 20 years ago, the rise of online social networks and streaming services created a glut of new datapoints, this time not about the people going to college or in our prisons or hospitals, but the regular folk who were signing online
As data became bigger and more complex over time, we needed to find new ways to store and process it—and this problem, in part, is responsible for the rise of GPU-based data centers, which, we now know, have enabled an entire revolution in artificial intelligence. Cloud computing further helped distribute the computing and storage work geographically.
Yeah, but What Is Big Data?
Let’s first try to understand traditional data in computing, which is limited in nature, usually structured and neatly stored—think tables of numbers that are clearly labeled, where the relationships between all of the numbers are well-defined and understood at the outset. Traditional data analytics can take these data sets and derive some inferences from them—perhaps even make a prediction—with long-used statistical and business intelligence methods.
Big data, on the other hand, is everything else—it’s large datasets that are so complex that our old-fashioned data-processing capabilities can’t understand them and therefore are unable to derive any insights from them. Big data can encompass structured or unstructured data—it might be anything from a decade’s worth of patient insurance information to all of the posts made on TikTok over the past 20 minutes. Businesses are already collecting huge amounts of data. Interactions with customers and clients may generate multiple data points. Purchases, page views, social media impressions, account openings, closings and transfers and more all produce information that can be recorded, stored and analyzed repeatedly to make inferences over time.
In order to understand what’s happening in big data, we must resort to methods like mathematical analysis, machine learning or data mining. Until our computers gained the ability to store and analyze big data, it was essentially useless, and few organizations conceived of its potential future value. Luckily, large institutions like universities and hospitals have a long history of collecting and clinging to seemingly useless and worthless things.
Where Does AI Come In?
Artificial intelligence has a two-way relationship with big data. AI relies on big data for training, testing and evaluation—big data is what trains our modern, generative artificial intelligence models, which require massive amounts of information to learn and refine their functionality. On the other hand, AI is also necessary to analyze and use big data. AI data analytics uses machine learning and AI algorithms to quickly derive insights from massive amounts of data.
AI finds trends and relationships within big, unstructured data sets that are too difficult for people to handle—offering entities the power to move from raw data to an informed decision quickly.
What Are the 3, 5 or 6 Vs of Big Data
Big data’s definition is changing as our experience with technology deepens. There were originally three “Vs” that defined big data: Volume, Velocity and Variety.
Volume is the “big” in big data—when there’s more information, it’s more difficult to store, and it takes more time to process. The data is often unstructured, and it’s often impossible to know within an unstructured data set which information might have value, and which might not.
Velocity is the rate at which a system receives data, which often depends on what is being done with the data—streaming data, for example, usually moves faster than data being written on a disk. Data velocity is always accelerating as connectivity speeds increase and the amount of data flowing into a system grows.
Variety is the different formats of data that could be encompassed within a big dataset, not just filetypes, but also text, images, videos, spreadsheets, database files, social media posts, and more
Oracle has added two more “Vs” to the mix: Veracity, or how trustworthy and accurate your data is, and Value, the ability of an entity to discover value within data. Our data is only as good as our ability to trust it, and our ability to find opportunities within it.
Google has added a sixth “V” — Variability “The meaning of collected data is constantly changing, which can lead to inconsistency over time. These shifts include not only changes in context and interpretation but also data collection methods based on the information that companies want to capture and analyze.”