Tag Archives: Big Data
What do all marketers have in common? Marketing guru Seth Godin famously said that all marketers are storytellers. Stories, not features and benefits, sell.
Anyone who buys a slightly more expensive brand of laundry detergent because it’s “better” proves this. Godin wrote that if someone buys shoes because he or she wants to be associated with a brand that is “cool,” that brand successfully told its story to the right market.
A story has heroes we identify with. It has a conflict, which the heroes try to overcome. A good story’s DNA is an ordinary person in unusual circumstances. When is the last time you had an unusual result from your marketing campaigns? Perhaps a pay-per-click ad does poorly in your A/B testing. Or, there’s a high bounce rate from your latest email campaign.
Many marketers aren’t data scientists. But savvy marketers know they have to deal with big data, since it has become a hot topic central to many businesses. Marketers simply want to do their jobs better — and big data should be seen as an opportunity, not a hindrance.
When you have big data that could unlock great insight into your business, look beyond complexity and start with your strength as a marketer: Storytelling.
To get you started, I took the needs of marketers and applied them to these “who, what, why and how” principles from a recent article in the Harvard Business Review by the author of Big Data at Work, Tom Davenport:
Who is your hero? He or she is likely your prospective or existing customer.
What problem did the hero have? This is the action of the story. Here’s a real-life example from the Harvard Business Review article: Your hero visits your website, and adds items to the shopping cart. However, when you look at your analytics dashboard, you notice he or she never finishes the transaction.
Why do you care about the hero’s problem? Identifying with the hero is important for a story’s audience. It creates tension, and gives you and other stakeholders the incentive you need to dig into your data for a resolution.
How do you resolve the problem? Now you see what big data can do — it solves marketing problems and gives you better results. In the abandoned shopping cart example, the company found that people in Ireland were not checking out. The resolution came from the discovery that the check-out process asked for a postal code. Some areas of Ireland have no postal codes, so visitors would give up.
Remember it’s possible that the data itself is the problem. If you have bad contact data, you can’t reach your customers. Find the source of your bad data, and then you can return to your marketing efforts with confidence.
While big data may sound complicated or messy, if you have a storytelling path like this to take, you can find the motivation you need to uncover the powerful information required to better engage with your audience.
Engaging your audience starts with having accurate, validated information about your audience. Marketers can use data to fuel their campaigns and make better decisions on strategy and planning. Learn more about data quality management in this white paper.
Time to Celebrate! Informatica is Once Again Positioned as a Leader in Gartner’s Magic Quadrant for Data Quality Tools!
It’s holiday season once again at Informatica and this one feels particularly special because we just received an early present from Gartner: Informatica has just been positioned as a leader in Gartner’s Magic Quadrant for Data Quality Tools report for 2014! Click here to download the full report.
And as it turns out, this is a gift that keeps on giving. For eight years in a row, Informatica has been ranked as a leader in Gartner’s Magic Quadrant for Data Quality Tools. In fact, for the past two years running, Informatica has been positioned highest and best for ability to execute and completeness of vision, the two dimensions Gartner measures in their report. These results once again validate our operational excellence as well as our prescience with our data quality products offerings. Yes folks, some days it’s hard to be humble.
Consistency and leadership are becoming hallmarks for Informatica in these and other analyst reports, and it’s hardly an accident. Those milestones are the result of our deep understanding of the market, continued innovation in product design, seamless execution on sales and marketing, and relentless dedication to customer success. Our customer loyalty has never been stronger with those essential elements in place. However, while celebrating our achievements, we are equally excited about the success our customers have achieved using our data quality products.
Managing and producing quality data is indispensable in today’s data-centric world. Gaining access to clean, trusted information should be one of a company’s most important tasks, and has previously been shown to be directly linked to growth and continued innovation.
We are truly living in a digital world – a world revolving around the Internet, gadgets and apps – all of which generate data, and lots of it. Should your organization take advantage of its increasing masses of data? You bet. But remember: only clean, trusted data has real value. Informatica’s mission is to help you excel by turning your data into valuable information assets that you can put to good use.
To see for yourself what the industry leading data quality tool can do, click here.
And from all of our team at Informatica, Happy holidays to you and yours.
It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.
Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.
The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.
This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.
But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT
You can also check out a demo and learn a little more about Cloudera Navigator and the Informatica integration in the recorded TechTalk hosted by Informatica at this link:
Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon. But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices. This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.
Now, let’s be real. Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler. This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization. In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data. Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in. SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization. We are merely in the early phases of the revolutionary opportunity of big data.
But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment. Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines. Unmanaged “data swamps” are simply unpractical for business operations. For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems. Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.
Now I have a secret to share with you: nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation. In the world of enterprise big data, schemas and metadata still matter.
I’ll share some examples with you. I attended a customer panel at Strata + Hadoop World in October. One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation. He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake. Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information. Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake. (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON). One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas. The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.
Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop. These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.
I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment. They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment. They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists. Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake. Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant. Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems. Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.
Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop. With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.
No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.” Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches. To learn more, check out this Big Data Workbook
Insurance companies serve as a fantastic example of big data technology use since data is such a pervasive asset in the business. From a cost savings and risk mitigation standpoint, insurance companies use data to assess the probable maximum loss of catastrophic events as well as detect the potential for fraudulent claims. From a revenue growth standpoint, insurance companies use data to intelligently price new insurance offerings and deploy cross-sell offers to customers to maximize their lifetime value.
New data sources are enabling insurance companies to mitigate risk and grow revenues even more effectively. Location-based data from mobile devices and sensors are being used inside insured properties to proactively detect exposure to catastrophic events and deploy preventive maintenance. For example, automobile insurance providers are increasingly offering usage-based driving programs, whereby insured individuals install a mobile sensor inside their car to relay the quality of their driving back to their insurance provider in exchange for lower premiums. Even healthcare insurance providers are starting to analyze the data collected by wearable fitness bands and smart watches to monitor insured individuals and inform them of personalized ways to be healthier. Devices can also be deployed in the environment that triggers adverse events, such as sensors to monitor earthquake and weather patterns, to help mitigate the costs of potential events. Claims are increasingly submitted with supporting information in a variety of formats like text files, spreadsheets, and PDFs that can be mined for insights as well. And with the growth on insurance sales online, web log and clickstream data is more important than ever to help drive online revenue.
Beyond the benefits of using new data sources to assess risk and grow revenues, big data technologies are enabling insurance companies to fundamentally rethink the basis of their analytical architecture. In the past, probable maximum loss modeling could only be performed on statistically aggregated datasets. But with big data technologies, insurance companies have the opportunity to analyze data at the level of an insured individual or a unique insurance claim. This increased depth of analysis has the potential to radically improve the quality and accuracy of risk models and market predictions.
Informatica is helping insurance companies accelerate the benefits of big data technologies. With multiple styles of ingestion available, Informatica enables insurance companies to leverage nearly any source of data. Informatica Big Data Edition provides comprehensive data transformations for ETL and data quality, so that insurance companies can profile, parse, integrate, cleanse, and refine data using a simple user-friendly visual development environment. With built-in data lineage tracking and support for data masking, Informatica helps insurance companies ensure regulatory compliance across all data.
To try out the Big Data Edition, download a free trial today in the Informatica Marketplace and get started with big data today!
Service and support is a critical part of this engagement strategy. Retail and consumer goods companies recognize the importance of support to the overall customer relationship. Subsequently, these companies have integrated their before and after-purchase support into their multi-channel marketing and omni-channel marketing strategies. While retail and consumer products companies have led the way on support an integral part of on-going customer engagement, B2B companies have begun to do the same. Enterprise IT companies, which are primarily B2B companies, have been expanding their service and support capabilities to create more engagement between their customers and themselves. Service offerings have expanded to include mobile tools, analytics-driven self-help, and support over social media and other digital channels. The goal of these investments has been to make interactions more productive for the customer, strengthen relationships through positive engagement, and to gather data that drives improvements in both the product and service.
A great example of an enterprise software company that understands the value in customer engagement though support is Informatica. Known primarily for their data integration products, Informatica has been quickly expanding their portfolio of data management and data access products over the past few years. This growth in their product portfolio has introduced many new types of customers Informatica and created more complex customer relationships. For example, the new SpringBok product is aimed at making data accessible to the business user, a new type of interaction for Informatica. Informatica has responded with a collection of new service enhancements that augment and extend existing service channels and capabilities.
What these moves say to me is that Informatica has made a commitment to deeper engagement with customers. For example, Informatica has expanded the avenues from which customers can get support. By adding social media and mobile capabilities, they are creating additional points of presence that address customer issues when and where customers are. Informatica provides support on the customers’ terms instead of requiring customers to do what is convenient for Informatica. Ultimately, Informatica is creating more value by making it easier for customers to interact with them. The best support is that which solves the problem quickest with the least amount of effort. Intuitive knowledge base systems, online support, sourcing answers from peers, and other tools that help find solutions immediately are more valued than traditional phone support. This is the philosophy that drives the new self-help portal, predicative escalation, and product adoption services.
Informatica is also shifting the support focus from products to business outcomes. They are manage problems holistically and are not simply trying to create product band-aids. This shows a recognition that technical problems with data are actually business problems that have broad effects on a customer’s business. Contrast this with the traditional approach to support that focuses fixing a technical issue but doesn’t necessarily address the wider organizational effects of those problems.
More than anything, these changes are preparation for a very different support landscape. With the launch of the Springbok data analytics tool, Informatica’s support organization is clearly positioning itself to help business analysts and similar semi-technical end-users. The expectations of these end-users have been set by consumer applications. They expect more automation and more online resources that help them to use and derive value from their software and are less enamored with fixing technical problems.
In the past, technical support was mostly charged with solving immediate technical issues. That’s still important since the products have to work first to be useful. Now, however, support organizations has an expanded mission to be part of the overall customer experience and to enhance overall engagement. The latest enhancements to the Informatica support portfolio reflects this mission and prepares them for the next generation of non-IT Informatica customers.
A couple comments on the importance of integration platforms like Informatica in an EDW/Hadoop environment.
- Hadoop does mean you can do some quick and inexpensive exploratory analysis with little or no ETL. The issue is that it will not perform at the level you need to take it to production. As the webinar points out, applying some structure to the data with columnar files (not RDBMS) will dramatically speed up query performance.
- The other thing that makes an integration platform more important than ever is the explosion of data complexity. As Dr. Kimball put it:
“Integration is even more important these days because you are looking at all sorts of data sources coming in from all sorts of directions.”
To perform interesting analyses, you are going to have to be able to join data with different formats and different semantic meaning. And that is going to require integration tools.
- Thirdly, if you are going to put this data into production, you will want to incorporate data cleansing, metadata management, and possibly formal data governance to ensure that your data is trustworthy, auditable, and has business context. There is no point in serving up bad data quickly and inexpensively. The result will be poor business decisions and flawed analyses.
For Data Warehouse Architects
The challenge is to deliver actionable content from the exploding amount of data available. You will need to be constantly scanning for new sources of data and looking for ways to quickly and efficiently deliver that to the point of analysis.
For Enterprise Architects
The challenge with adding Big Data to Your EDW Architecture is to define and drive a coherent enterprise data architecture across your organization that standardizes people, processes, and tools to deliver clean and secure data in the most efficient way possible. It will also be important to automate as much as possible to offload routine tasks from the IT staff. The key to that automation will be the effective use of metadata across the entire environment to not only understand the data itself, but how it is used, by whom, and for what business purpose. Once you have done that, then it will become possible to build intelligence into the environment.
For more on Informatica’s vision for an Intelligent Data Platform and how this fits into your enterprise data architecture see Think “Data First” to Drive Business Value
I ended my previous blog wondering if awareness of Data Gravity should change our behavior. While Data Gravity adds Value to Big Data, I find that the application of the Value is under explained.
Exponential growth of data has naturally led us to want to categorize it into facts, relationships, entities, etc. This sounds very elementary. While this happens so quickly in our subconscious minds as humans, it takes significant effort to teach this to a machine.
A friend tweeted this to me last week: I paddled out today, now I look like a lobster. Since this tweet, Twitter has inundated my friend and me with promotions from Red Lobster. It is because the machine deconstructed the tweet: paddled <PROPULSION>, today <TIME>, like <PREFERENCE> and lobster <CRUSTACEANS>. While putting these together, the machine decided that the keyword was lobster. You and I both know that my friend was not talking about lobsters.
You may think that this maybe just a funny edge case. You can confuse any computer system if you try hard enough, right? Unfortunately, this isn’t an edge case. 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is communicated in smaller and smaller amounts of language, and this trend is only going to continue.
When will the machine understand that “I look like a lobster” means I am sunburned?
I believe the reason that there are not hundreds of companies exploiting machine-learning techniques to generate a truly semantic web, is the lack of weighted edges in publicly available ontologies. Keep reading, it will all make sense in about 5 sentences. Lobster and Sunscreen are 7 hops away from each other in dbPedia – way too many to draw any correlation between the two. For that matter, any article in Wikipedia is connected to any other article within about 14 hops, and that’s the extreme. Completed unrelated concepts are often just a few hops from each other.
But by analyzing massive amounts of both written and spoken English text from articles, books, social media, and television, it is possible for a machine to automatically draw a correlation and create a weighted edge between the Lobsters and Sunscreen nodes that effectively short circuits the 7 hops necessary. Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naïvely attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.
For example – if you hear the name Babe Ruth, what is the first thing that pops to mind? Roman Catholics from Maryland born in the 1800s or Famous Baseball Player?
If you look in Wikipedia today, he is categorized under 28 categories in Wikipedia, each of them with the same level of attachment. 1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers.
Now imagine how confused a machine would get when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.
If I were to design an algorithm that uses weighted edges (on a scale of 1-5, with 5 being the highest), the same search would yield a much more obvious result.
1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers .
Now the machine starts to think more like a human. The above example forces us to ask ourselves the relevancy a.k.a. Value of the response. This is where I think Data Gravity’s becomes relevant.
You can contact me on twitter @bigdatabeat with your comments.
As retailers move from looking in the rear view mirror (what happened) to the road ahead (what will happen) they have turned to Big Data and Analytics for answers. While, Big Data holds great promise for retailers, many are skeptical. Retailers are already drinking from the data fire hose, whether its transaction data, recording every product sold to every customer across all channels or research data, covering detailed consumer profiles or web log and social data. The questions retailers are asking; will the investment drive more revenues, increase customer loyalty and create a more rewarding customer experience? Will I gain a deeper insight into customer transactions and interactions across the organization? Can we use existing resources and infrastructure?
The answer is Yes, Big Data presents the opportunity to better analyse everything from customer shopping behaviors at each stage of purchase journey, to inventory planning to delivering relevant and personalized offers. By analyzing how shoppers found your products, how long they spend browsing product pages and which products they added to their basket provides greater insight into what decision process they went through before purchase and helps retailers quickly identify cross sell and up-sell opportunities in real-time. In addition, combining transaction data and what your customers are saying on social channels (ratings, likes, dislikes, what’s trending etc.) can feed into the decisions you make on placing the right product, in the right store at the right price and ultimately deliver very personalize and contextual offers to the customers.
Data Driven Decisions Getting value from Big Data
Turning Big Data into actionable insight is not just about dumping data in to a “Data Lake” and pointing an analytics tool at it and saying job done! Retailers need to take a number of steps to profit from Big Data and Analytics.
- Firstly, you need to gather data from all available sources in batch or real-time, from internal and external, and from an ever increasing number of devices (beacons, mobile devices). Once you have gathered the data, it needs to be connected, validated, cleansed and a governance process put in place before integrating with analytic tools and systems.
- Secondly, put clean and trusted data in the hands of data scientists who can distill the relevant from irrelevant and formulate commercial insights that the business can action and profit from it.
- Lastly, plan and organize for success. IT and business need to align behind the same agenda, regularly reviewing business priorities and adjusting as needed. Maximize existing scare IT resources by leveraging existing technologies, Cloud platforms and forming alliances with 3rd party vendors to fill skills gap. Secure quick wins for your Big Data initiatives; maybe start with integrating historical transaction data with real-time purchase data to make personalized offers at point of sale. Look outside your organization and to other industries like retail banking or telecommunications and learn from their successes and failures.
With the right approach, Big Data will deliver the return on investment for retailers.
If you’ve wondered why so many companies are eager to control data storage, the answer can be summed up in a simple term: data gravity. Ultimately, where data is determines where the money is. Services and applications are nothing without it.
Dave McCrory introduced his idea of Data Gravity with a blog post back in 2010. The core idea was – and is – Interesting. More recently, Data Gravity featured in this year’s EMC World keynote. But, beyond the observation that large or valuable agglomerations of data exert a pull that tends to see them grow in size or value, what is a recognition of Data Gravity actually good for?
As a concept, Data Gravity seems closely associated with current enthusiasm for Big Data. In addition, like Big Data, the term’s real-world connotations can be unhelpful almost as often as they are helpful. Big Data exhibits at least three characteristics, which are Volume, Velocity, and Variety. Various other V’s, including Value, is mentioned from time to time, but with less consistency. Yet, Big Data’s name says it’s all about size. The speed with which data must be ingested, processed, or excreted is less important. The complexity and diversity of the data doesn’t matter either.
On its own, the size of a data set is unimportant. Coping with lots of data certainly raises some not-insignificant technical challenges, but the community is actually doing a good job of coming up with technically impressive solutions. The interesting aspect of a huge data set isn’t its size, but the very different modes of working that become possible when you begin to unpick the complex interrelationships between data elements.
Sometimes, Big Data is the vehicle by which enough data is gathered about enough aspects of enough things from enough places for those interrelationships to become observable against the background noise. Other times, Big Data is the background noise, and any hope of insight is drowned beneath the unending stream of petabytes.
To a degree, Data Gravity falls into the same trap. More gravity must be good, right? And more mass leads to more gravity. Mass must be connected to volume, in some vague way that was explained when I was 11, and which involves STP. Therefore, bigger data sets have more gravity. This means that bigger data sets are better data sets. That assertion is clearly nonsense, but luckily, it’s not actually what McCrory is suggesting. His arguments are more nuanced than that, and potentially far more useful.
Instinctively, I like that the equation attempts to move attention away from ‘the application’ toward the pools of data that support many, many applications at once. The data is where the potential lies. Applications are merely the means to unlock that potential in various ways. So maybe notions of Potential Energy from elsewhere in Physics need to figure here.
But I’m wary of the emphasis given to real numbers that are simply the underlying technology’s vital statistics; network latency, bandwidth, request sizes, numbers of requests, and the rest. I realize that these are the measurable things that we have, but feel that more abstract notions of value need to figure just as prominently.
So I’m left reaffirming my original impression that Data Gravity is “interesting”. It’s also intriguing, and I keep feeling that it should be insightful. I’m just not — yet — sure exactly how. Is a resource with a Data Gravity of 6 twice as good as a resource with a Data Gravity of 3? Does a data set with a Data Gravity of 15 require three times as much investment/infrastructure/love as a data set scoring a humble 5? It’s unlikely to be that simple, but I do look forward to seeing what happens as McCrory begins to work with the parts of our industry that can lend empirical credibility to his initial dabbling in mathematics.
If real numbers show the equations to stand up, all we then need to do is work out what the numbers mean. Should an awareness of Data Gravity change our behavior, should it validate what gut feel led us to do already, or is it just another ‘interesting’ and ultimately self-evident number that doesn’t take us anywhere?
I don’t know, but I will continue to explore. You can contact me on twitter @bigdatabeat