Each week we find a new topic for our readers to learn about in our AI Education column.
Welcome again, friends. Discussing foundational artificial intelligence topics, which we spend a lot of time and space on in AI Education, can quickly go off the theoretical rails, so every so often we try to return to a more tangible, real-world subject, and thus, this week we’re going to talk about Databricks, a cloud-based data engineering and data intelligence powerhouse that seems to be everywhere these days.
Put succinctly, San Francisco-based Databricks has become not just part of the AI infrastructure available to businesses and institutions, it has also become a provider of AI services.
We should start, however, with a foundational topic or two to help elucidate what Databricks is really about. We’re going to start with data engineering, a practice that predates the rise of artificial intelligence—and computers themselves—yet has also been made indispensable by the emergence of AI and machine learning. Data engineering describes collecting, storing, protecting, arranging and analyzing data at scale. Over the past two decades, as tech companies have collected and manipulated massive amounts of information, the profession of data engineer has emerged.
Data intelligence, on the other hand, applies artificial intelligence to an organization’s data not just to manage and process that data, but also to generate insights on how that data can be used and how the organization works, and, down the line, to build new AI solutions for that business or institution. While data analytics uses statistical methods and machine learning to understand past events and make predictions, data intelligence is more interested in understanding where data is coming from and the interrelations within sets of data.
How Did We Get Here?
Databricks seems to be everywhere these days, the company raises money like crazy (including a $10 billion Series J haul in December of last year), and it doesn’t take much searching to find mentions of the company in artificial intelligence news, but the story this week that brought me to the topic was the extension of a strategic partnership between the company and Microsoft which offers Databricks capabilities via Microsoft’s Azure cloud. Both companies announced that they will continue to work together in a multi-year agreement to integrate their solutions. This comes after a big announcement earlier this year that saw Databricks partnering with AI developer Anthropic.
To better understand what Databricks software helps people do, we need to understand the problem the world has with data right now, which largely is this: We started collecting and storing data long before we really understood the importance of storing and processing that data, or the immense capabilities that would be required to store and process that data. As a result, we have data all over the place, both in technological and geographical terms—data that is unstructured, poorly labeled, and not properly linked to other related data. Our personal and business computers, by themselves, were in most cases not up to the task of storing or sorting through that data—but ideas like data analytics, which necessitated some order in our vast stores of information, were already coming over the horizon.
Thus software developers seized upon ideas founded in early computing—distributed computing and distributed storage—as a solution to world’s data problem. While one computer by itself might not be up to the task, a group of computers linked together could tackle all that data—but efficiently figuring out what part of which task to send to which computer, and where and how to move different chunks of data around—required a superhuman level of ability. Only a computer could do that kind of work, and new software would be required to tell that computer what to do.
That’s the problem the team behind Apache Spark solved with their software.
So why are we talking about Apache Spark? Because many of the developers of Spark, an free and open-source software, moved on to develop Databricks, where they (hopefully) are making good money for their efforts. Spark efficiently distributes data engineering and processing tasks among computers and oversees the work automatically like a digital project manager, with the innovation that it stores information in a computer’s running memory instead of on a hard disk, which allowed it to perform significantly faster compared to other data engineering solutions. Spark also offers access to machine learning data tools.
What Does Databricks do?
Databricks was launched initially as a cloud-based complement to Apache Spark—basically an interface that sat on top of Spark and further streamlined the work of setting up a distributed computing cluster to complete a task. Subsequently, it launched a new innovation in data—the data lakehouse.
To understand a data lakehouse, we need to understand the two concepts it combines: A data warehouse and a data lake. A data warehouse is a centralized location where a business or institution can consolidate its data, usually in a processed and organized state, for further analysis from which they can derive insights. A data lake is just a collection of data—usually very large—in its raw, unprocessed format. The data lakehouse, then, combines the capability of a data lake to handle raw, unprocessed data with some of the quality controls and capabilities of a data warehouse.
Working with all that data has led Databricks to AI. In 2023, Databricks released Dolly, its open-sourced language AI model. Last year, Databricks released Mosaic, a suite of AI tools that includes a platform giving enterprises the ability to build and train their own AI models.
Databricks isn’t totally unique—there are other companies that provide many of the same services, often in similar ways. Snowflake, for example, offers cloud-based data warehouse capabilities as its primary business. Google’s BigQuery offers similar efficient and real-time data analytics services. Amazon’s Redshift has proven capable of offering data querying and analysis across both data warehouses and data lakes, and Amazon Web Services gives users cloud-based access. While Databricks is singled out for its unified offering and scalability, these competitors, and others, may offer similar services at a lower cost and with greater ease of use.