Each week we find a new topic for our readers to learn about in our AI Education column.
If we really want to understand what today’s artificial intelligence does, we need to dig down to its foundations.
AI is really a combination of data and instructions. And this week we’re talking about data. More specifically, topics in data organization and governance. Welcome to another AI Education, where the topic this time around is what is a data lake.
Before you fall asleep on us, let’s be clear—this is indeed a little bit like talking about a cake and starting out explaining where flour, sugar and butter come from. But, to us, cake is all the more delicious and marvelous for knowing about the cows and sugar cane and wheat that made it possible.
Similarly, AI is all the more useful and fascinating when we understand how and why it works—it also gives us insight into how AI might evolve in the near future, and where opportunities for AI deployment (and investment) lie. So we’re going to talk data.
What is data, though? Typically, in computer science applications, data is the stuff that computers record, process, transform and transmit—the figures (usually represented by numerals or symbols) upon which computations are performed, and the results of those computations. That’s the data. From an AI perspective, data has extra dimensions—its not just quantities or characters, but also the facts they represent and the relationships between them.
Data can be raw, where it is collected and stored in the format it’s presented in—or transformed and processed, treated with a purpose in mind. Similarly, data can be structured and optimized for use—think of a spreadsheet with logically defined lines and columns, or left in an unstructured or semi-structured state—think of a random blob of information.
What Is a Data Lake
Some systems to store and process data are built specifically for structured, processed data—but a data lake is not. By definition, a data lake is designed to hold data in any format and in any state of structure.
We look at our own computer for good analogies. Our file system is pretty well organized—documents are sorted into their own folders, downloads are kept in a folder that is regularly sorted through, our desktop is even relatively clean (for an editor). However, our bookmarks folder, where we drag interesting items we find as we search the web, is a cluttered mess. Pages, applications, media are all bookmarked there and left pretty much unsorted and unlabeled. Our bookmarks folder is like a data lake.
Or our Google Drive—we have our invoices for writing and editing work nicely sorted as they are sent over time. Everything else seemingly has been dragged-and-dropped at random, leaving us with bits of text and audio and video that we have to use search functions to sort through. Our invoices are like structured data, everything else is unstructured—and it all has a place in a data lake. Our Google Drive is like a data lake.
Data lakes are intended to collect data from multiple sources into a single repository. Because the data is often stored in raw formats, data lakes are usually capable of storing tremendous volumes of information—terrabytes or petabytes of stuff.
That’s Not All There Is To Data Lakes, Is It?
Data lakes allow us to store data without transforming or processing it—any needed transformations are deferred until a later time. As it turns out, for today’s AI, data lakes are incredibly important because they allow artificial intelligence systems to quickly collect and store massive amounts of information before they process and use it. Data lakes, then, can consolidate the data collected from physical AI systems or an AI-of-Things in a single place, whether on premise or remote, in accordance with an entity’s compliance needs.
So if a highly regulated institution wants to use AI collect to information on client or patient activity, it might use a data lake to store that information. In fact, any place where large amounts of data are being collected in real time uses a data lake to ingest that data. Data lakes also can provide data that eventually moves through repositories of processed and structured data, like data warehouses (we’ll write more on those in the future). So they’re also like a way station for data an entity ingests that will be used for AI/ML applications down the road.
Not only that, but any application of big data usually requires a data lake for storage and scale. Machine learning models use the information in data lakes to learn how to make predictions and recommendations.
The downside to data lakes is that the data they store is not as easily usable and deployable. While it is possible for data analysts to query the data in a data lake, there are fewer safeguards around the validity and reliability of the data. As data becomes bigger—bigger files, and more of them—data lakes themselves struggle with scale and complexity.
What Is a Data Swamp?
Every few months, we clear things out. The desktop is cleared of files, the downloads folder is sorted out, we clear out at least some of our bookmarks and random Google Drive files—we might even get through some of our inbox (for once). In other words, we don’t just let our personal “data lake” like repositories sit there completely unmanaged—because if we did, they would soon become unmanageable.
Managing a data lake requires some level of infrastructure and processing over time. If data is left to pile up unprocessed and uncatalogued, it becomes a data swamp. A data swamp is a disorganized or under-governed data repository—most often a degraded data lake—whose defining feature is that it contains information of some value that cannot be accessed or understood because it is lost in a huge mire of other unstructured, un-labelled, unprocessed data.
So a data lake is not really a laissez faire information dump where ideas and inspiration go to die of neglect (like our bookmarks folder). What’s important is that the data is ingested, stored for a time, and then eventually transformed, moved, used, or deleted, in a logical process. Today’s data lakes are really a combination of massive amounts of storage with the ability to process some or all of the stored data.






