Today, big data rules companies. Software development engineers are in high demand to handle it all. With the dominance of social media, the rise of the Internet of Things, and the government’s reliance on data collection for defense and public health, there are few remaining corners of modern life that are not recorded as data points for someone’s analytics.
But this wealth of new data comes in a massive variety of new data formats, and with so many different types to store and analyze, how do you organize it to get the most out of it? The first step is to recognize whether the data is Structured or Unstructured. This blog will help illuminate what these terms mean so that you can equip yourself with the knowledge you need to gather actionable intelligence from your data.
Structured data is quantitative. It has a highly organized makeup to it that makes it compatible with a predefined format, such as a spreadsheet with articulated rows and columns. Structured data has to go through a process called Extract, Transform, Load (ETL), in which it is extracted from its source, transformed so that the data points can be expressed the way their destination expresses information, and then loaded into the database (usually a Relational Database). The term for this approach to data structuring is “Schema-on-write,” because the structure is applied when the data is entered into its storage.
Some common examples of structured data include inventory control systems, contact information lists, ATM records, and online sales data. Any electronic action that acquires the same few very simple alphanumeric data points each time it is taken can be considered structured data.
Structured Data: The Good
Easy SQL Queries: Because the data is so neatly structured into its relational database, a Structured Query Language can be applied to perform basic analytics quickly and easily.
Intuitive: End-users are usually able to glean at least some new and interesting information from the data just by looking at it. Being so organized makes it comprehensible.
Artificial Intelligence: AIs can be taught the patterns inherent to a highly organized data set, or will learn those patterns quickly.
Time-tested: Structured data has been around a long time, meaning that software, hardware, and expertise in the labor pool are all easy to come by.
Structured Data: The Challenge
Inflexible: Because of the rigid structure of relational databases, introducing a new data type can be labor-intensive. It requires that you adapt the database in preparation for the introduction of the new data.
Data loss: During the “T” in “Extract, Transform, Load,” extraneous information might be dropped from your data set so that it can fit neatly into its place in the database. Usually this isn’t a problem, and it helps save a lot on storage, but in the event that you discover that that information wasn’t extraneous after all, it might be unrecoverable.
Unstructured Data is qualitative. This data stays in its native format, such as a .jpg, mostly because it doesn’t have any intrinsic organizational traits the way that a spreadsheet does. It is entered, unprocessed, into a data lake. Extract, Transform, Load still happens... it just happens later, during the analytics process. The term for this approach to data structuring is “Schema-on-read,” because the structure is applied only when the user is making use of it, rather than when it is stored.
Some common examples of unstructured data include emails, social media posts, chats, slide decks, pictures, audio recordings, Internet of Things sensor data, etc. Social media and product reviews are such a common application for unstructured data that it is sometimes casually referred to as “opinion mining” by marketing departments.
Unstructured Data: The Good
There’s a ton of it: At time of writing, roughly 80% of enterprise data is unstructured, and that number is still growing.
Data integrity: Storing it in a data lake and only transforming it later means that every aspect of the data is preserved.
Cheap but flexible storage: Data lakes are much more affordable than Relational Databases and can fit a near-infinite variety of data types.
It’s new: Being the relative newcomer, most businesses are not making as much use of their unstructured data as they could. This might give you an edge on your competition.
Unstructured Data: The Challenge
Analytics are hard: You’re going to need all those savings from storage to dedicate to one or more data scientists with a high level of expertise, as well as the tools they need to get the job done.
It’s slow: Actionable intelligence can’t be gleaned right away from unstructured data. Transforming it into a comprehensible data set takes man hours.
Surprise! There is a third type of data structure. Semi-structured data typically lacks a fixed schema and/or doesn’t fit into a database format, but it has some organizational or hierarchical properties to it.
Most of the time, what Semi-structured data is referring to is the metadata that is attached to a piece of unstructured data. For example, an email is a piece of unstructured data, but taken with the sender, recipient, date & time, and subject line, it can be processed as semi-structured data. Email providers use semi-structured data to automatically sort emails into spam folders.
Other examples include tweets organized by hashtag, or videos and photos with camera settings, GPS data, date & time, and file types.
Semi-Structured Data: The Good
Variety of formats: Like unstructured data, semi-structured data does not (necessarily) need to fit into a relational database and therefore can come in virtually any format.
Flexibility: The metadata of the semi-structured data can be expressed separately as structured data, and therefore reap all the associated benefits.
Semi-Structured Data: The Challenge
Slow to analyze: Since it’s less organized than structured data, it still has many of the same sluggish analytics problems that are typical of unstructured data.
Simple analytics can miss nuance: Just because someone tweets with a specific hashtag doesn’t mean they’re saying something positive about it. Since semi-structured data exists in the gray area between qualitative and quantitative, there’s some human element to it that needs to be accounted for when trying to understand it.
PVM Case Study
Working with our financial services client, PVM ingests structured data about hundreds of millions of consumer transactions each day to help merchants make decisions about how best to manage their businesses and to help government agencies assess the health of the economy. With such massive volumes of data involved, the structured nature of the data is an indispensable boon to the missions of those end-users.
Getting your data organized properly can be an enormous undertaking, and it’s natural to feel intimidated by the processes involved. Whatever your data organization needs, PVM can help! Contact us today to discuss our offerings.