You know your data is valuable. Somewhere in there, it has sales insights that could springboard your business into a period of growth. It has patient records with the answer to preventing this medication side effect. It has financial records that will unveil wasteful spending. You know your data is worth keeping so that you can make use of it—but that’s just the problem. Where is it being stored? Do you know what data storage solutions are available that can help you use data to grow your business?
If you don’t store it correctly, you won’t be able to access it correctly either, and then it’ll just collect proverbial data dust. But data storage comes in many forms, and navigating your options can be a dizzying exercise. What’s the difference between a relational database and a data warehouse? What is a data lake, and why would you ever want one? Let’s go over some of these terms, and hopefully it will become clear which approach to data storage is best suited to your organization’s needs.
We’ll start with an easy one. A database stores information from one data source about one part of your organization. A common example might be a database of all the online purchases made through your web store. It’s specific enough in its scope that running queries and generating reports from it is a quick and straightforward process.
A relational database is when multiple databases are interconnected by their overlapping contents. “Relations” between different databases are drawn by identifying and coding these overlaps; for example, a customer in your database of web purchases might also appear in your database of email newsletter recipients. While the purchases database has credit card information that the newsletter database doesn’t, both probably have the User ID and the email address of the customer. It is only by looking at the two databases together, with their relations highlighted, that you can identify if a newsletter intended to prompt a purchase has worked.
It’s important to note that while relational databases are common, they’re not the only way to interconnect multiple databases. The method that relational databases employ requires that the data be expressed with rows and columns that can be connected, but not all data takes that form. A database of social media posts, for example, would be far too unstructured to fit into a spreadsheet format.
A data warehouse is similar to a relational database in the sense that multiple databases are present and have mapped out relationships to each other, but it isn’t tied exclusively to the row/column format. Also, it’s big. Really big. Data warehouses typically store data from across every department of an entire organization, from its entire history. Usually (but not always—more on that when we get to Data Lakes), the data goes through some sort of cleanup before it is stored in its permanent home in the data warehouse; optimizing the data often means transforming it to a format that allows it to “fit” into its designated space in the warehouse.
Data warehouses take a lot of time, planning, and manpower to build. The volume of data involved is so massive and so diverse in scope that designing and maintaining a data warehouse is a full-time job. Actually, it’s many full-time jobs for a team. However, while the analytics may be slow, they’re so powerful that data warehousing is rapidly becoming a business standard. One option is to outsource it, and invest in Data Warehousing as a Service (DWaaS).
If analytics from data warehouses are slow, that can cause problems. Your sales team needs access to leads while they’re hot. Your power company needs to know about a major fault right away. A data mart allows you to leverage all the power of a data warehouse with all the speed of a warehouse’s smaller cousins. Essentially, before or after data is integrated into a data warehouse, it is sorted into its own little “room” in the warehouse that has been set up to serve a particular branch of your organization. Marketing departments are common users of their own data marts.
A data lake is exactly what it sounds like: a giant, disorganized pool of raw data that takes up a ton of space while offering very little information that is immediately helpful.
Why on earth would anyone want that?
Data lakes have two advantages for very specific situations. One, data lakes are useful for storing uncorrupted historical data. Remember earlier, how data usually has to go through cleanup to fit into a data warehouse? Sometimes, information gets lost in that process that can be painful to recover (if you can recover it at all). Usually, data warehouses are carefully designed so that when data is compressed, only irrelevant information falls away; but sometimes, every single data point is just too important to let go of. A data lake allows you to keep all of the information your sources gather, in exactly the form that they gathered it. Extracting that data from the lake and transforming it into something comprehensible and actionable can happen later; if you absolutely must maintain the purest data integrity, your system needs a data lake to store it.
The second use for data lakes is artificial intelligence and machine learning. AIs function best when they have extremely large quantities of simple data points from which to learn and make predictions. Having a huge pile of raw data to unleash a machine onto is one of the surest ways get the most out of your AI/ML investment. This is a particular branch of data analytics that is making data lakes increasingly popular, in spite of their other inconveniences.
Always, ask for advice! At PVM, we work with large data sets regularly for our clients within many of these different data storage solutions. We are very comfortable with the many different data storage options that exist—and this isn’t all of them! We do not offer data warehousing or storage within our own organization, but we have software engineers who are experts in this space and can work to create the right option for you. We want to make sure that your data works for you, not against you. If you’re looking for a data solution to your problem, let us know and we’ll be happy to provide guidance on the best path forward.