Julius Černiauskas is the CEO at Oxylabs, a leading proxy networks and data gathering solutions provider.
Big data is often differentiated by the four V’s: velocity, veracity, volume and variety. Researchers assign various measures of importance to each of the metrics, sometimes treating them equally, sometimes separating one out of the pack.
We will do the latter today. Velocity has been impacted by such a large margin since the development of the definition of “big data” that real-time acquisition has become possible. In other words, velocity is nearing its maximum capacity, which I think indicates not only a quantitative change but a qualitative one as well.
Various Iterations Of Big Data
For some time in the past, big data has been treated as a buzzword without any meaning. Such a view might have been influenced by the inherent complexity of the phenomenon as big data is composed of four distinct pieces, each of which can have different combinations.
As such, there may seem to be many “big data” companies as some businesses might have focused on volume, others on variety and a third on variety or velocity. Much like the ancient theory of humorism, different combinations of the four V’s might have led to different processes and results, all of which were headed under the umbrella of big data.
MORE FROMFORBES ADVISOR
There’s an important caveat, though. Building up one aspect of the four V’s means foregoing another. There’s always an opportunity cost associated with processes, and the same goes for big data. If a company focuses on the variety of data, for example, then volume or velocity might perish.
We get to see a lot of that in practice with web scraping (i.e., automated public online data collection). At the current moment in time, there’s no one-size-fits-all web scraping solution as minor adjustments need to be made according to the website in question. While there have been some promising machine learning and artificial intelligence advancement, we’re not there yet.
Tinkering with web scraping applications nets us a larger variety of data. However, every minute spent working on that is a minute not spent working on something else. Additionally, it’s unlikely a specific application would also be running while it’s being worked upon, meaning we’re losing out on efficiency for that specific one.
Yet, velocity and veracity are somewhat different from volume and variety. The former two are not dependent upon third parties, at least in the same sense as the other two.
Infinite Volume And Variety
While there have been calculations on the number of petabytes of content produced online every day, we might as well treat the total volume of big data as infinite. Lots of what constitutes big data include other sources such as sensor data, GPS signals and even photographs.
As such, the production of data happens around the clock, and it keeps growing exponentially. These days even data collection applications will leave around various data points and cause some of them to change over time (such as the layouts of websites). So, there’s a constant production and acceleration of data.
In other words, data is infinite in volume as it exceeds the possibilities of any current iteration of collection and analysis methods. Volume will likely continue to outpace our capabilities for the foreseeable future, if not forever.
Variety is much the same. While new data types aren’t invented, at least on a large scale, there’s always the possibility to go more granular with variety. We can treat all text-based data as the same, but most would agree that there’s some difference between a long-form article and a single comment. While both are of the same variety, they may exert different real-world effects.
After all, variety wouldn’t be much of a category otherwise, as we would be able to separate every piece of data into either structured or unstructured and be done with it. There’s tons of granularity involved, and new types will be invented along the way.
Finite Velocity And Veracity
On the other hand, velocity and veracity are finite and independent from third parties. The flow of data has reached its peak—there are plenty of ways to acquire real-time data. From company-provided APIs, such as the Twitter API, to web scraping solutions, all of these have enabled real-time data acquisition.
Even in the latter case, where the data is acquired without having direct access to the internal sources of a company (rather, acquired through external public sources) has reached real-time capabilities. As such, velocity, in the sense of the flow of data from the source to the destination, has reached its peak.
While we will certainly see many optimizations along the way that will reduce the costs of real-time acquisition, growth in velocity is somewhat limited. Even if a new type that necessitates new acquisition methods appears, real time is the end of velocity.
Veracity follows the same trend. As it’s defined by the accuracy of data, there is a limit to how truthful it can be. Things get a little more complicated than with velocity, as verifying and measuring veracity is closer to a theoretical undertaking. While the limit of veracity exists somewhere, it’s unlikely that it can be maximized in practice.
Conclusion
While theory allows us to separate phenomena into smaller bits without any cost, practical application requires us to pick sides. Businesses, for example, can’t focus on each V at once, causing some to progress faster.
Understanding that big data involves several distinct pieces, however, lets us better divide our focus. Providing absolute guidelines for businesses is impossible as there are so many different needs to be matched.
I believe a good starting point is to prioritize veracity over velocity (accurate insights matter more over the possibility of insights) and volume over variety (analyzing different types of data requires new costly methods, pipelines and expertise).
An important part of any business is efficiency, and focusing on these aspects reduces the likelihood of being led astray. In veracity over velocity, we spend our resources on a smaller scale but ensure that our allocation delivers can be turned into actions that will more reliably deliver value.
In volume over variety, we take advantage of the fact that large-scale data can reveal new and more reliable insights as we are less likely to run into sampling and variance issues. Additionally, variety nearly always will require finding new data sources. Each one will entail certain costs such as maintenance, analysis hours or financial costs.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?