Category Archives: Data Transformation
Over and over, when talking with people who are starting to learn Data Science, there’s a frustration that comes up: “I don’t know which programming language to start with.”
Moreover, it’s not just programming languages; it’s also software systems like Tableau, SPSS, etc. There is an ever-widening range of tools and programming languages and it’s difficult to know which one to select.
I get it. When I started focusing heavily on data science a few years ago, I reviewed all of the popular programming languages at the time: Python, R, SAS, D3, not to mention a few that in hindsight, really aren’t that great for analytics like Perl, Bash, and Java. I once read a suggestion to use arcane tools like UNIX’s AWK and SED.
There are so many suggestions, so much material, so many options; it becomes difficult to know what to learn first. There’s a mountain of content, and it’s difficult to know where to find the “gold nuggets”; the things to learn that will bring you the high return on time investment.
That’s the crux of the problem. The fact is – time is limited. Learning a new programming language is a large investment in your time, so you need to be strategic about which one you select. To be clear, some languages will yield a very high return on your investment. Other languages are purely auxiliary tools that you might use only a few times per year.
Let me make this easy for you: learn R first. Here’s why:
R is becoming the “lingua franca” of data science
R is becoming the lingua franca for data science. That’s not to say that it’s the only language, or that it’s the best tool for every job. It is, however, the most widely used and it is rising in popularity.
As I’ve noted before, O’Reilly Media conducted a survey in 2014 to understand the tools that data scientists are currently using. They found that R is the most popular programming language (if you exclude SQL as a “proper” programing language).
Looking more broadly, there are other rankings that look at programming language popularity in general. For example, Redmonk measures programming language popularity by examining discussion (on Stack Overflow) and usage (on GitHub). In their latest rankings, R placed 13th, the highest of any statistical programming language. Redmonk also noted that R has been rising in popularity over time.
A similar ranking by TIOBE, which ranks programming languages by the number of search engine searches, indicates a strong year over year rise for R.
Keep in mind that the Redmonk and TIOBE rankings are for all programming languages. When you look at these, R is now ranking among the most popular and most commonly used over all.
It’s often said that 80% of the work in data science is data manipulation. More often than not, you’ll need to spend significant amounts of your time “wrangling” your data; putting it into the shape you want. R has some of the best data management tools you’ll find.
The dplyr package in R makes data manipulation easy. It is the tool I wish I had years ago. When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.
ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization.
I’ve said numerous times, that there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.
Finally, there’s machine learning. While I think most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools.
When you’re ready to start using (and learning) machine learning, R has some of the best tools and resources.
One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.
Summary: Learn R, and focus your efforts
Once you start to learn R, don’t get “shiny new object” syndrome.
You’re likely to see demonstrations of new techniques and tools. Just look at some of the dazzling data visualizations that people are creating.
Seeing other people create great work (and finding out that they’re using a different tool) might lead you to try something else. Trust me on this: you need to focus. Don’t get “shiny new object” syndrome. You need to be able to devote a few months (or longer) to really diving into one tool.
And as I noted above, you really want to build up your competence in skills across the data science workflow. You need to have solid skills at least in data visualization and data manipulation. You need to be able to do some serious data exploration in R before you start moving on.
Spending 100 hours on R will yield vastly better returns than spending 10 hours on 10 different tools. In the end, your time ROI will be higher by concentrating your efforts. Don’t get distracted by the “latest, sexy new thing.”
The thing that resonates today, in the odd context of big data, is that we may all need to look in the mirror, hold a thumb drive full of information in our hands, and concede once and for all It’s not the data… it’s us.
Many organizations have a hard time making something useful from the ever-expanding universe of big-data, but the problem doesn’t lie with the data: It’s a people problem.
The contention is that big-data is falling short of the hype because people are:
- too unwilling to create cultures that value standardized, efficient, and repeatable information, and
- too complex to be reduced to “thin data” created from digital traces.
Evan Stubbs describes poor data quality as the data analyst’s single greatest problem.
About the only satisfying thing about having bad data is the schadenfreude that goes along with it. There’s cold solace in knowing that regardless of how poor your data is, everyone else’s is equally as bad. The thing is poor quality data doesn’t just appear from the ether. It’s created. Leave the dirty dishes for long enough and you’ll end up with cockroaches and cholera. Ignore data quality and eventually you’ll have black holes of untrustworthy information. Here’s the hard truth: we’re the reason bad data exists.
I will tell you that most data teams make “large efforts” to scrub their data. Those “infrequent” big cleanups however only treat the symptom, not the cause – and ultimately lead to inefficiency, cost, and even more frustration.
It’s intuitive and natural to think that data quality is a technological problem. It’s not; it’s a cultural problem. The real answer is that you need to create a culture that values standardized, efficient, and repeatable information.
If you do that, then you’ll be able to create data that is re-usable, efficient, and high quality. Rather than trying to manage a shanty of half-baked source tables, effective teams put the effort into designing, maintaining, and documenting their data. Instead of being a one-off activity, it becomes part of business as usual, something that’s simply part of daily life.
However, even if that data is the best it can possibly be, is it even capable of delivering on the big-data promise of greater insights about things like the habits, needs, and desires of customers?
Despite the enormous growth of data and the success of a few companies like Amazon and Netflix, “the reality is that deeper insights for most organizations remain elusive,” write Mikkel Rasmussen and Christian Madsbjerg in a Bloomberg Businessweek blog post that argues “big-data gets people wrong.”
Big-data delivers thin data. In the social sciences, we distinguish between two types of human behavior data. The first – thin data – is from digital traces: He wears a size 8, has blue eyes, and drinks pinot noir. The second – rich data – delivers an understanding of how people actually experience the world: He could smell the grass after the rain, he looked at her in that special way, and the new running shoes made him look faster. Big-data focuses solely on correlation, paying no attention to causality. What good is thin “information” when there is no insight into what your consumers actually think and feel?
Accenture reported only 20 percent of the companies it profiled had found a proven causal link between “what they measure and the outcomes they are intending to drive.”
Now, I can contend they keys to transforming big-data to strategic value are critical thinking skills.
Where do we get such skills? People, it seems, are both the problem and the solution. Are we failing on two fronts: failing to create the right data-driven cultures, and failing to interpret the data we collect?
Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.
Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.
In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.
Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.
Every two years, the typical company doubles the amount of data they store. However, this Data is inherently “dumb.” Acquiring more of it only seems to compound its lack of intellect.
When revitalizing your business, I won’t ask to look at your data – not even a little bit. Instead, we look at the process of how you use the data. What I want to know is this:
How much of your day-to-day operations are driven by your data?
The Case for Smart Data
I recently learned that 7-Eleven Japan has pushed decision-making down to the store level – in fact, to the level of clerks. Store clerks decide what goes on the shelves in their individual 7-Eleven stores. These clerks push incredible inventory turns. Some 70% of the products on the shelves are new to stores each year. As a result, this chain has been the most profitable Japanese retailer for 30 years running.
Instead of just reading the data and making wild guesses on why something works and why something doesn’t, these clerks acquired the skill of looking at the quantitative and the qualitative and connected dots. Data told them what people are talking about, how it’s related to their product and how much weight it carried. You can achieve this as well. To do so, you must introduce a culture that emphasizes discipline around processes. A disciplined process culture uses:
- A template approach to data with common processes, reuse of components, and a single face presented to customers
- Employees who consistently follow standard procedures
If you cannot develop such company-wide consistency, you will not gain benefits of ERP or CRM systems.
Make data available to the masses. Like at 7-Eleven Japan, don’t centralize the data decision-making process. Instead, push it out to the ranks. By putting these cultures and practices into play, businesses can use data to run smarter.
That second question is a killer because most people — no matter if they’re in marketing, sales or manufacturing — rely on incomplete, inaccurate or just plain wrong information. Regardless of industry, we’ve been fixated on historic transactions because that’s what our systems are designed to provide us.
“Moneyball: The Art of Winning an Unfair Game” gives a great example of what I mean. The book (not the movie) describes Billy Beane hiring MBAs to map out the factors that would win a baseball game. They discovered something completely unexpected: That getting more batters on base would tire out pitchers. It didn’t matter if batters had multi-base hits, and it didn’t even matter if they walked. What mattered was forcing pitchers to throw ball after ball as they faced an unrelenting string of batters. Beane stopped looking at RBIs, ERAs and even home runs, and started hiring batters who consistently reached first base. To me, the book illustrates that the most useful knowledge isn’t always what we’ve been programmed to depend on or what is delivered to us via one app or another.
For years, people across industries have turned to ERP, CRM and web analytics systems to forecast sales and acquire new customers. By their nature, such systems are transactional, forcing us to rely on history as the best predictor of the future. Sure it might be helpful for retailers to identify last year’s biggest customers, but that doesn’t tell them whose blogs, posts or Tweets influenced additional sales. Isn’t it time for all businesses, regardless of industry, to adopt a different point of view — one that we at Informatica call “Data-First”? Instead of relying solely on transactions, a data-first POV shines a light on interactions. It’s like having a high knowledge IQ about relationships and connections that matter.
A data-first POV changes everything. With it, companies can unleash the killer app, the killer sales organization and the killer marketing campaign. Imagine, for example, if a sales person meeting a new customer knew that person’s concerns, interests and business connections ahead of time? Couldn’t that knowledge — gleaned from Tweets, blogs, LinkedIn connections, online posts and transactional data — provide a window into the problems the prospect wants to solve?
That’s the premise of two startups I know about, and it illustrates how a data-first POV can fuel innovation for developers and their customers. Today, we’re awash in data-fueled things that are somehow attached to the Internet. Our cars, phones, thermostats and even our wristbands are generating and gleaning data in new and exciting ways. That’s knowledge begging to be put to good use. The winners will be the ones who figure out that knowledge truly is power, and wield that power to their advantage.
Configuring your Oracle environment for using PowerExchange CDC can be challenging, but there are some best practices you can follow that will greatly simplify the process. There are two major factors to consider when approaching this: latency requirements for your data and the ability to restart your environment.
Data Latency Requirements
The first factor that will effect latency of your data is the location of your PowerExchange CDC installation. From a best practice perspective, it is optimal to install the PowerExchange Listener on the source database server as this eliminates the need to pass data across the network and will provide the smallest amount of latency from source to target.
The volume of data that PowerExchange CDC has to process can also have a significant impact on performance. There are several items in addition to the changed data that can effect performance, including, but are not limited to, Oracle catalog dumps, Oracle workload monitor customizations and other non-Oracle tools that use the redo logs. You should conduct a review of all the processes that access Oracle redo logs, and make an effort to minimize them in terms both volume and frequency. For example, you could monitor the redo log switches and the creation of archived log files to see how busy the source database is. The size of your production archive logs and knowing how often they are being created will provide the information necessary to properly configure PowerExchange CDC.
Environment Restart Ability
When certain changes are made to the source database environment, the PowerExchange CDC process will need to be stopped and restarted. The amount of time this restart takes should be considered whenever this needs to occur. PowerExchange CDC must be restarted when any of the following changes occur:
– A change is made to the schema or a table that is part of the CDC process
– An existing Capture Registration is changed
– A change is made to the PowerExchange configuration files
– An Oracle patch is applied
– An Operating System patch or upgrade is applied
– A PowerExchange version upgrade or service pack is applied
If using the CDC with LogMiner, a copy of the Oracle catalog must be placed on the archive log in order to function properly. The frequency of these copies is site-specific and will have an impact on the amount of time it will take to restart the CDC process.
Once your PowerExchange CDC process is in production, any changes to the environment must have extensive impact analysis performed to ensure the integrity of the data and the transactions remains intact upon restart. Understanding the configurable parameters in the PowerExchange configuration files that will assist restart performance is of the utmost importance.
Even with the challenges presented when configuring PowerExchange CDC for Oracle, there are trusted and proven methods that can significantly improve your ability to complete this process and have real time or near real time access to your data. At SSG, we’re committed to always utilizing best practice methodology with our PowerExchange Baseline Deployments. In addition, we provide in-depth knowledge transfer to set end users up with a solid foundation for optimizing PowerExchange functionality.
Visit the Informatica Marketplace to learn more about SSG’s Baseline Deployment offerings.
Ah yes, the Old Mainframe. It just won’t go away. Which means there is still valuable data sitting in it. And that leads to a question that I have been asked about repeatedly in the past few weeks, about why an organization should use a tool like Informatica PowerExchange to extract data from a mainframe when you can also do it with a script that extracts the data as a flat file.
So below, thanks to Phil Line, Informatica’s Product Manager for Mainframe connectivity, are the top ten reasons to use PowerExchange over hand coding a flat file extraction.
1) Data will be “fresh” as of the time the data is needed – not already old based on when the extraction was run.
2) Any data extracted directly from files will be as the file held it, any additional processes needed to run in order to extract/transfer data to LUW could potentially alter the original formats.
3) The consuming application can get the data when it needs it; there wouldn’t be any scheduling issues between creating the extract file and then being able to use it.
4) There is less work to do if PowerExchange reads the data directly from the mainframe, data type processing as well as potential code page issues are all handled by PowerExchange.
5) Unlike any files created with ftp type processes, where problems could cut short the expected data transfer, PowerExchange/PowerCenter provide log messages so as to ensure that all data has been processed.
6) The consumer has the capacity only to select the data that is needed for the consumer application, use of filtering can reduce the amount of data being transferred as well as any potential security aspects.
7) Any data access of mainframe based data can be secured according to the security tools in place on the mainframe; PowerExchange is fully compliant to RACF, ACF2 & Top-Secret security products.
8) Using Informatica’s PowerExchange, along with Informatica consuming tools (PowerCenter, Mercury etc.) provides a much simpler and cleaner architecture. The simpler the architecture the easier it is to find problems as well as audit the processes that are touching the data.
9) PowerExchange generally can help avoid the normal bottlenecks associated to getting data off of the mainframe, programmers are not needed to create the extract processes, new schedules don’t need to be created to ensure that the extracts run, in the event of changes being necessary they can be controlled by the Business group consuming the data.
10) Helps control mainframe data extraction processes that are still being run but from which no one uses the generated data as the original system that requested the data has now become obsolete.
This creative thinking to solve a problem came from a request to build a soldier knife from the Swiss Army. In the end, the solution was all about getting the right tool for the right job in the right place. In many cases soldiers didn’t need industrial strength tools, all they really needed was a compact and lightweight tool to get the job at hand done quickly.
Putting this into perspective with today’s world of Data Integration, using enterprise-class data integration tools for the smaller data integration project is over kill and typically out of reach for the smaller organization. However, these smaller data integration projects are just as important as those larger enterprise projects, and they are often the innovation behind a new way of business thinking. The traditional hand-coding approach to addressing the smaller data integration project is not-scalable, not-repeatable and prone to human error, what’s needed is a compact, flexible and powerful off-the-shelf tool.
Thankfully, over a century after the world embraced the Swiss Army Knife, someone at Informatica was paying attention to revolutionary ideas. If you’ve not yet heard the news about the Informatica platform, a version called PowerCenter Express has been released and it is free of charge so you can use it to handle an assortment of what I’d characterize as high complexity / low volume data integration challenges and experience a subset of the Informatica platform for yourself. I’d emphasize that PowerCenter Express doesn’t replace the need for Informatica’s enterprise grade products, but it is ideal for rapid prototyping, profiling data, and developing quick proof of concepts.
PowerCenter Express provides a glimpse of the evolving Informatica platform by integrating four Informatica products into a single, compact tool. There are no database dependencies and the product installs in just under 10 minutes. Much to my own surprise, I use PowerCenter express quite often going about the various aspects of my job with Informatica. I have it installed on my laptop so it travels with me wherever I go. It starts up quickly so it’s ideal for getting a little work done on an airplane.
For example, recently I wanted to explore building some rules for an upcoming proof of concept on a plane ride home so I could claw back some personal time for my weekend. I used PowerCenter Express to profile some data and create a mapping. And this mapping wasn’t something I needed to throw away and recreate in an enterprise version after my flight landed. Vibe, Informatica’s build once / run anywhere metadata driven architecture allows me to export a mapping I create in PowerCenter Express to one of the enterprise versions of Informatica’s products such as PowerCenter, DataQuality or Informatica Cloud.
As I alluded to earlier in this article, being a free offering I honestly didn’t expect too much from PowerCenter Express when I first started exploring it. However, due to my own positive experiences, I now like to think of PowerCenter Express as the Swiss Army Knife of Data Integration.
To start claiming back some of your personal time, get started with the free version of PowerCenter Express, found on the Informatica Marketplace at: https://community.informatica.com/solutions/pcexpress
By now, the business benefits of effectively leveraging big data have become well known. Enhanced analytical capabilities, greater understanding of customers, and ability to predict trends before they happen are just some of the advantages. But big data doesn’t just appear and present itself. It needs to be made tangible to the business. All too often, executives are intimidated by the concept of big data, thinking the only way to work with it is to have an advanced degree in statistics.
There are ways to make big data more than an abstract concept that can only be loved by data scientists. Four of these ways were recently covered in a report by David Stodder, director of business intelligence research for TDWI, as part of TDWI’s special report on What Works in Big Data.
The time is ripe for experimentation with real-time, interactive analytics technologies, Stodder says. The next major step in the movement toward big data is enabling real-time or near-real-time delivery of information. Real-time data has been a challenge with BI data for years, with limited success, Stodder says. The good news is that Hadoop framework, originally built for batch processing, now includes interactive querying and streaming applications, he reports. This opens the way for real-time processing of big data.
Design for self-service
Interest in self-service access to analytical data continues to grow. “Increasing users’ self-reliance and reducing their dependence on IT are broadly shared goals,” Stodder says. “Nontechnical users—those not well versed in writing queries or navigating data schemas—are requesting to do more on their own.” There is an impressive array of self-service tools and platforms now appearing on the market. “Many tools automate steps for underlying data access and integration, enabling users to do more source selection and transformation on their own, including for data from Hadoop files,” he says. “In addition, new tools are hitting the market that put greater emphasis on exploratory analytics over traditional BI reporting; these are aimed at the needs of users who want to access raw big data files, perform ad-hoc requests routinely, and invoke transformations after data extraction and loading (that is, ELT) rather than before.”
Nothing gets a point across faster than having data points visually displayed – decision-makers can draw inferences within seconds. “Data visualization has been an important component of BI and analytics for a long time, but it takes on added significance in the era of big data,” Stodder says. “As expressions of meaning, visualizations are becoming a critical way for users to collaborate on data; users can share visualizations linked to text annotations as well as other types of content, such as pictures, audio files, and maps to put together comprehensive, shared views.”
Unify views of data
Users are working with many different data types these days, and are looking to bring this information into a single view – “rather than having to move from one interface to another to view data in disparate silos,” says Stodder. Unstructured data – graphics and video files – can also provide a fuller context to reports, he adds.
The interesting thing is that many of the upstarts do not even intend to take on the market leader in the segment. Christensen cites the classic example of Digital Equipment Corporation in the 1980s, which was unable to make the transition from large, expensive enterprise systems to smaller, PC-based equipment. The PC upstarts in this case did not take on Digital directly – rather they addressed unmet needs in another part of the market.
Christensen wrote and published The Innovator’s Dilemma more than 17 years ago, but his message keeps reverberating across the business world. Lately, Jill Lapore questioned some of thinking that has evolved around disruptive innovation in a recent New Yorker article. “Disruptive innovation is a theory about why businesses fail. It’s not more than that. It doesn’t explain change. It’s not a law of nature,” she writes. Christensen responded with a rebuttal to Lapore’s thesis, noting that “disruption doesn’t happen overnight,” and that “[Disruptive innovation] is not a theory about survivability.”
There is something Lapore points out that both she and Christensen can agree on: “disruption” is being oversold and misinterpreted on a wide scale these days. Every new product that rolls out is now branded as “disruptive.” As stated above, the true essence of disruption is creating new markets where the leaders would not tread.
Data itself can potentially be a source of disruption, as data analytics and information emerge as strategic business assets. While the ability to provide data analysis at real-time speeds, or make new insights possible isn’t disruption in the Christensen sense, we are seeing the rise of new business models built around data and information that could bring new leaders to the forefront. Data analytics can either play a role in supporting this movement, or data itself may be the new product or service disrupting existing markets.
We’ve already been seeing this disruption taking place within the publishing industry, for example – companies or sites providing real-time or near real-time services such as financial updates, weather forecasts and classified advertising have displaced traditional newspapers and other media as information sources.
Employing data analytics as a tool for insights never before available within an industry sector also may be part of disruptive innovation. Tesla Motors, for example, is disruptive to the automotive industry because it manufactures entirely electric cars. But the formula to its success is its employment of massive amounts of data from its array of vehicle in-devices to assure quality and efficiency.
Likewise, data-driven disruption may be occurring in places that may have been difficult to innovate. For example, it’s long been speculated that some of the digital giants, particularly Google, are poised to enter the long-staid insurance industry. If this were to happen, Google would not enter as a typical insurance company with a new web-based spin. Rather, the company would be employing new techniques of data gathering, insight and analysis to offer an entirely new model to consumers – one based on data. As Christopher Hernaes recently related in TechCrunch, Google’s ability to collect and mine data on homes, business and autos give it a unique value proposition n the industry’s value chain.
We’re in an era in which Christensen’s mode of disruptive innovation has become a way of life. Increasingly, it appears that enterprises that are adept and recognizing and acting upon the strategic potential of data may be joining the ranks of the disruptors.