What is a data pipeline? What is the difference between a data pipeline and a data stack? How do those things differ from a data model? These are the things you’ll need to know in order to build up your organization’s data integration systems.
Since data pipelines are so aptly named, it might be helpful to build on the plumbing analogy that is already on-hand. In order for water to be accessible in your house for its myriad uses, it needs to go through a series of processes after being pulled from the water cycle. Similarly, there are many steps that data must go through between the moment that it is initially recorded and the moment that it is presented as actionable information. Let’s use this analogy in order to help define our terms.
This big-umbrella phrase encompasses the entire field of technology, processes, and skill sets involved in bringing data from many disparate sources together so that it can be stored, analyzed, and utilized as a collective, such as in a relational database. Data pipelines, data warehouses, data stacks, and data transformation are all subcategories that are a part of the total process of data integration.
When a house is being built, a plumber doesn’t just show up and start putting pipes into the walls at random. He has to ascertain where the water is coming from, where it’s going, and what it’s going to be used for, and then he needs to carefully plan out how to meet those needs effectively and efficiently. Moreover, recording that information in the form of a blueprint is extremely important for the long-term life of the house, as there will undoubtedly be renovations eventually. A data model serves the same purpose: it’s a blueprint showing the movement and storage of data, as well as how various data sets are interconnected. In the age of big data, with so many different types and so many different applications, this step of the process is absolutely critical. Maintaining a detailed data model will keep data organized and aid in planning changes to your process... not to mention the role it plays in your cyber security.
A data source is the origin of data. Like a river, an aquifer, or rainfall, it is where the water—data—begins its journey. Some examples of data sources include GPS logging, social media posts, and the Internet of Things.
We can’t always rely on rainstorms and rivers to provide enough water for an entire modernized community; that’s why we have to store it in reservoirs and water towers. Similarly, data must be stored between when it is collected and when it is put to use. The two most popular ways to store large quantities of data are data warehouses and data lakes. You can read more about the best uses of different data storage types in our blog on the subject.
As the name implies, a data pipeline is the series of movements and steps that one particular type of data goes through while being integrated. A single droplet of water might enter into your hot water heater, then be pumped through your house’s pipes towards your bathtub when it’s time for a bath. The journey that that individual droplet takes can be charted and considered a single pipeline. Similarly, an individual data point—say, a cash withdrawal from a checking account—is logged by its source, an ATM. Then it is pulled from the ATM into a database where all transactions at that particular bank are stored. From there, that withdrawal could then be pulled out in order to be used for a variety of purposes, whether a simple banking statement or something more involved, such as an investigation into criminal activity. The pathways between where data points are collected, where they are stored, and where they are analyzed, are called data pipelines.
The form that water comes in from the moment it first arrives on your property is usually not the form it’s going to take when you plan to use it. Sometimes it needs to be boiled so you can cook with it; sometimes it needs to be mixed with soap so the dishwasher can work; sometimes it needs to be filtered so you can drink it, and frozen so you can enjoy your drink better. Raw data is neither easy on the eyes nor easy to make useful. It needs to be transformed, usually into a structured format like a spreadsheet, so that people (and AIs!) can make sense of it. A plain list of a Facebook post’s likes, who liked it, the date and time they liked it, and whether they commented or shared, would not be easy to interpret. But putting that information onto spreadsheet makes patterns more noticeable, and automated analytics in the form of graphs and rankings possible.
Pipes, tanks, and water alone are not a finished utility; plumbers and software engineers alike need specialized tools to put all the pieces together (not to mention fix things when they break). Data extraction and loading technologies are necessary to transport the data from its source to a warehouse. The warehouse itself is a tool in the stack, as well as the technologies that query the warehouse contents. There are technologies for transforming the data, and for analyzing it as well. There are one-stop-shop brands that offer a full stack. In short, data stack is a tool belt: without one, you have no plumbing at all.
With technology evolving so quickly, it’s more important than ever to have the right tools to do the job. That’s why PVM is partnered with cutting-edge data stack technologies such as Palantir and Vertica, and most recently, Amazon Web Services. Read more about our partnerships on our Partnerships page, or contact us for more information about how we can handle your big data needs.