Category Archives: Data Integration
When it comes to cloud-based data analytics, a recent study by Ventana Research (as found in Loraine Lawson’s recent blog post) provides a few interesting data points. The study reveals that 40 percent of respondents cited lowered costs as a top benefit, improved efficiency was a close second at 39 percent, and better communication and knowledge sharing also ranked highly at 34 percent.
Ventana Research also found that organizations cite a unique and more complex reason to avoid cloud analytics and BI. Legacy integration work can be a major hindrance, particularly when BI tools are already integrated with other applications. In other words, it’s the same old story:
The ability to deal with existing legacy systems when moving to concepts such as big data or cloud-based analytics is critical to the success of any enterprise data analytics strategy. However, most enterprises don’t focus on data integration as much as they should, and hope that they can solve the problems using ad-hoc approaches.
You can’t make sense of data that you can’t see.
These approaches rarely work as well a they should, if at all. Thus, any investment made in data analytics technology is often diminished because the BI tools or applications that leverage analytics can’t see all of the relevant data. As a result, only part of the story is told by the available data, and those who leverage data analytics don’t rely on the information, and that means failure.
What’s frustrating to me about this issue is that the problem is easily solved. Those in the enterprise charged with standing up data analytics should put a plan in place to integrate new and legacy systems. As part of that plan, there should be a common understanding around business concepts/entities of a customer, sale, inventory, etc., and all of the data related to these concepts/entities must be visible to the data analytics engines and tools. This requires a data integration strategy, and technology.
As enterprises embark on a new day of more advanced and valuable data analytics technology, largely built upon the cloud and big data, the data integration strategy should be systemic. This means mapping a path for the data from the source legacy systems, to the views that the data analytics systems should include. What’s more, this data should be in real operational time because data analytics loses value as the data becomes older and out-of-date. We operate a in a real-time world now.
So, the work ahead requires planning to occur at both the conceptual and physical levels to define how data analytics will work for your enterprise. This includes what you need to see, when you need to see it, and then mapping a path for the data back to the business-critical and, typically, legacy systems. Data integration should be first and foremost when planning the strategy, technology, and deployments.
Amazon Redshift, one of the fast-rising stars in the AWS ecosystem has taken the data warehousing world by storm ever since it was introduced almost two years ago. Amazon Redshift operates completely in the cloud, and allows you to provision nodes on-demand. This model allows you to overcome many of the pains associated with traditional data warehousing techniques, such as provisioning extra server hardware, sizing and preparing databases for loading or extensive SQL scripting.
However, when loading data into Redshift, you may find it challenging to do so in a timely manner. To reduce the time taken to load this data, you may have to spend a tremendous amount of time writing SQL optimization queries which takes away the value proposition of using Redshift in the first place.
Informatica Cloud helps you load this data quickly into Redshift in just a few minutes. To start using Informatica Cloud, you’ll need to establish connections from Redshift and your other data source first. Here are a few easy steps to help you get started with establishing connections from a relational database such as MySQL as well as Redshift into Informatica Cloud:
- Login into your Informatica Cloud account, go to Configure -> Connections, click “New”, and select “MySQL” for “Type”
- Select your Secure Agent and fill in the rest of the database details:
- Test your connection and then click ‘OK’ to save and exit
- Now, login to your AWS account and go to Redshift service page
- Go to your cluster configuration page and make a note of the cluster and cluster database properties: Number of Nodes, Endpoint, Port, Database Name, JDBC URL. You also will need:
- The Redshift database user name and password (which is different from your AWS account)
- AWS account Access Key
- AWS account Secret Key
- Exit the AWS console.
- Now, back in your Informatica Cloud account, go to Configure -> Connections and click “New”.
- Select “AWS Redshift (Informatica)” for “Type” and fill in the rest of the details from the information you have from above
- Test the connection and then click ‘OK’ to save and exit
As you can see, establishing connections was extremely easy and can be done in less than 5 minutes. To learn how customers such as UBM used Informatica Cloud to deliver next-generation customer insights with Amazon Redshift, please join us on September 16 for a webinar where we’ll have product experts from Amazon and UBM explaining how your company can benefit from cloud data warehousing for petabyte-scale analytics using Amazon Redshift.
Come and get it. For developers hungry to get their hands on Informatica on Hadoop, a downloadable free trial of Informatica Big Data Edition was launched today on the Informatica Marketplace. See for yourself the power of the killer app on Hadoop from the leader in data integration and quality.
Thanks to the generous help of our partners, the Informatica Big Data team has preinstalled the Big Data Edition inside the sandbox VMs of the two leading Hadoop distributions. This empowers Hadoop and Informatica developers to easily try the codeless, GUI driven Big Data Edition to build and execute ETL and data integration pipelines natively on Hadoop for Big Data analytics.
Informatica Big Data Edition is the most complete and powerful suite for Hadoop data pipelines and can increase productivity up to 5 times. Developers can leverage hundreds of out-of-the-box Informatica pre-built transforms and connectors for structured and unstructured data processing on Hadoop. With the Informatica Vibe Virtual Data Machine running directly on each node of the Hadoop cluster, the Big Data Edition can profile, parse, transform and cleanse data at any scale to prepare data for data science, business intelligence and operational analytics.
The Informatica Big Data Edition Trial Sandbox VMs will have a 60 day trial version of the Big Data Edition preinstalled inside a 1-node Hadoop cluster. The trials include sample data and mappings as well as getting started documentation and videos. It is possible to try your own data with the trials, but processing is limited to the 1-node Hadoop cluster and the machine you have it running on. Any mappings you develop in the trial can be easily moved on to a production Hadoop cluster running the Big Data Edition. The Informatica Big Data Edition also supports MapR and Pivotal Hadoop distributions, however, the trial is currently only available for Cloudera and Hortonworks.
Accelerate your ability to bring Hadoop from the sandbox into production by leveraging Informatica’s Big Data Edition. Informatica’s visual development approach means that more than one hundred thousand existing Informatica developers are now Hadoop developers without having to learn Hadoop or new hand coding techniques and languages. Informatica can help organizations easily integrate Hadoop into their enterprise data infrastructure and bring the PowerCenter data pipeline mappings running on traditional servers onto Hadoop clusters with minimal modification. Informatica Big Data Edition reduces the risk of Hadoop projects and increases agility by enabling more of your organization to interact with the data in your Hadoop cluster.
To get the Informatica Big Data Edition Trial Sandbox VMs and more information please visit Informatica Marketplace
The way I see it, the biggest impact of the Apple Watch will come from how it will finally make data fashionable. For starters, the three Apple Watch models and interchangeable bands will actually make it hip to wear a watch again. But I think the ramifications of this genuinely good-looking watch go well beyond the skin deep. The Cupertino company has engineered its watch and its mobile software to recognize related data and seamlessly share it across relevant apps. And those capabilities allow it to, for instance, monitor our fitness and health, show us where we parked the car, open the door to our hotel room and control our entertainment centers.
Think what this could mean for any company with a Data-First point of view. I like to say that a data-first POV changes everything. With it, companies can unleash the killer app, killer marketing campaign and killer sales organization.The Apple Watch finally gives people a reason to have that killer app with them at all times, wherever they are and whatever they’re doing. Looked at a different way, it could unleash a new culture of Data-Only consumers: People who rely on being told what they need to know, in the right context.
But while Apple may the first to push this Data-First POV in unexpected ways, history has shown they won’t be the last. It’s time for every company to tap into the newest fashion accessory, and make data their first priority.
This got me thinking: What is the biggest bottleneck in the delivery of business value today? I know I look at things from a data perspective, but data is the biggest bottleneck. Consider this prediction from Gartner:
“Gartner predicts organizations will spend one-third more on app integration in 2016 than they did in 2013. What’s more, by 2018, more than half the cost of implementing new large systems will be spent on integration. “
When we talk about application integration, we’re talking about moving data, synchronizing data, cleansing, data, transforming data, testing data. The question for architects and senior management is this: Do you have the Data Foundation for Execution you need to drive the business results you require to compete? The answer, unfortunately, for most companies is; No.
All too often data management is an add-on to larger application-based projects. The result is unconnected and non-interoperable islands of data across the organization. That simply is not going to work in the coming competitive environment. Here are a couple of quick examples:
- Many companies are looking to compete on their use of analytics. That requires collecting, managing, and analyzing data from multiple internal and external sources.
- Many companies are focusing on a better customer experience to drive their business. This again requires data from many internal sources, plus social, mobile and location-based data to be effective.
When I talk to architects about the business risks of not having a shared data architecture, and common tools and practices for enterprise data management, they “get” the problem. So why aren’t they addressing it? The issue is that they find that they are only funded to do the project they are working on and are dealing with very demanding timeframe requirements. They have no funding or mandate to solve the larger enterprise data management problem, which is getting more complex and brittle with each new un-connected project or initiative that is added to the pile.
Studies such as “The Data Directive” by The Economist show that organizations that actively manage their data are more successful. But, if that is the desired future state, how do you get there?
Changing an organization to look at data as the fuel that drives strategy takes hard work and leadership. It also takes a strong enterprise data architecture vision and strategy. For fresh thinking on the subject of building a data foundation for execution, see “Think Data-First to Drive Business Value” from Informatica.
* By the way, Informatica is proud to announce that we are now a sponsor of the MIT Center for Information Systems Research.
Last time I talked about how benchmark data can be used in IT and business use cases to illustrate the financial value of data management technologies. This time, let’s look at additional use cases, and at how to philosophically interpret the findings.
So here are some additional areas of investigation for justifying a data quality based data management initiative:
- Compliance or any audits data and report preparation and rebuttal (FTE cost as above)
- Excess insurance premiums on incorrect asset or party information
- Excess tax payments due to incorrect asset configuration or location
- Excess travel or idle time between jobs due to incorrect location information
- Excess equipment downtime (not revenue generating) or MTTR due to incorrect asset profile or misaligned reference data not triggering timely repairs
- Equipment location or ownership data incorrect splitting service cost or revenues incorrectly
- Party relationship data not tied together creating duplicate contacts or less relevant offers and lower response rates
- Lower than industry average cross-sell conversion ratio due to inability to match and link departmental customer records and underlying transactions and expose them to all POS channels
- Lower than industry average customer retention rate due to lack of full client transactional profile across channels or product lines to improve service experience or apply discounts
- Low annual supplier discounts due to incorrect or missing alternate product data or aggregated channel purchase data
I could go on forever, but allow me to touch on a sensitive topic – fines. Fines, or performance penalties by private or government entities, only make sense to bake into your analysis if they happen repeatedly in fairly predictable intervals and are “relatively” small per incidence. They should be treated like M&A activity. Nobody will buy into cost savings in the gazillions if a transaction only happens once every ten years. That’s like building a business case for a lottery win or a life insurance payout with a sample size of a family. Sure, if it happens you just made the case but will it happen…soon?
Use benchmarks and ranges wisely but don’t over-think the exercise either. It will become paralysis by analysis. If you want to make it super-scientific, hire an expensive consulting firm for a 3 month $250,000 to $500,000 engagement and have every staffer spend a few days with them away from their day job to make you feel 10% better about the numbers. Was that worth half a million dollars just in 3rd party cost? You be the judge.
In the end, you are trying to find out and position if a technology will fix a $50,000, $5 million or $50 million problem. You are also trying to gauge where key areas of improvement are in terms of value and correlate the associated cost (higher value normally equals higher cost due to higher complexity) and risk. After all, who wants to stand before a budget committee, prophesy massive savings in one area and then fail because it would have been smarter to start with something simpler and quicker win to build upon?
The secret sauce to avoiding this consulting expense and risk is a natural curiosity, willingness to do the legwork of finding industry benchmark data, knowing what goes into them (process versus data improvement capabilities) to avoid inappropriate extrapolation and using sensitivity analysis to hedge your bets. Moreover, trust an (internal?) expert to indicate wider implications and trade-offs. Most importantly, you have to be a communicator willing to talk to many folks on the business side and have criminal interrogation qualities, not unlike in your run-of-the-mill crime show. Some folks just don’t want to talk, often because they have ulterior motives (protecting their legacy investment or process) or hiding skeletons in the closet (recent bad performance). In this case, find more amenable people to quiz or pry the information out of these tough nuts, if you can.
Lastly; if you find ROI numbers, which appear astronomical at first, remember that leverage is a key factor. If a technical capability touches one application (credit risk scoring engine), one process (quotation), one type of transaction (talent management self-service), a limited set of people (procurement), the ROI will be lower than a technology touching multiple of each of the aforementioned. If your business model drives thousands of high-value (thousands of dollars) transactions versus ten twenty-million dollar ones or twenty-million one-dollar ones, your ROI will be higher. After all, consider this; retail e-mail marketing campaigns average an ROI of 578% (softwareprojects.com) and this with really bad data. Imagine what improved data can do just on that front.
I found massive differences between what improved asset data can deliver in a petrochemical or utility company versus product data in a fashion retailer or customer (loyalty) data in a hospitality chain. The assertion of cum hoc ergo propter hoc is a key assumption how technology delivers financial value. As long as the business folks agree or can fence in the relationship, you are on the right path.
What’s your best and worst job to justify someone giving you money to invest? Share that story.
This blog post initially appeared on Exterro and is reblogged here with their consent.
As data volumes increase and become more complex, having an integrated e-discovery environment where systems and data sources are automatically synching information and exchanging data with e-discovery applications has become critical for organizations. This is true of unstructured and semi-structured data sources, such as email servers and content management systems, as well as structured data sources, like databases and data archives. The topic of systems integration will be discussed on Exterro’s E-Discovery Masters series webcast, “Optimizing E-Discovery in a Multi-Vendor Environment.” The webcast is CLE-accredited and will air on Thursday, September 4 at 1pm ET/10am PT. Learn more and register here.
I recently interviewed Jim FitzGerald, Sr. Director for Exterro, and Josh Alpern, VP ILM Domain Experts for Informatica, about the important and often overlooked role structured data plays during the course of e-discovery.
Q: E-Discovery demands are often discussed in the context of unstructured data, like email. What are some of the complications that arise when a matter involves structured data?
Jim: A lot of e-discovery practitioners are comfortable with unstructured data sources like email, file shares, or documents in SharePoint, but freeze up when they have to deal with structured data. They are unfamiliar with the technology and terminology of databases, extracts, report generation, and archives. They’re unsure about the best ways to preserve or collect from these sources. If the application is an old one, this fear often gets translated into a mandate to keep everything just as it is, which translates to mothballed applications that just sit there in case data might be needed down the road. Beyond the costs, there’s also the issue that IT staff turnover means that it’s increasingly hard to generate the reports Legal and Compliance need from these old systems.
Josh: Until now, e-discovery has largely been applied to unstructured data and email for two main reasons: 1) a large portion of relevant data resides in these types of data stores, and 2) these are the data formats that everyone is most familiar with and can relate to most easily. We all use email, and we all use files and documents. So it’s easy for people to look at an email or a document and understand that everything is self-contained in that one “thing.” But structured data is different, although not necessarily any less relevant. For example, someone might understand conceptually what a “purchase order” is, but not realize that in a financial application a purchase order consists of data that is spread across 50 different database tables. Unlike with an email or a PDF document, there might not be an easy way to simply produce a purchase order, in this example, for legal discovery without understanding how those 50 database tables are related to each other. Furthermore, to use email as a comparison, everyone understands what an email “thread” is. It’s easy to ask for all the emails in single thread, and usually it’s relatively easy to identify all of those emails: they all have the same subject line. But in structured data the situation can be much more complicated. If someone asks to see every financial document related to a single purchase order, you would have to understand all of the connections between the many database tables that comprise all of those related documents and how they related back to the requested purchase order. Solutions that are focused on email and unstructured data have no means to do this.
Q: What types of matters tend to implicate structured data and are they becoming more or less common?
Jim: The ones I hear about most common are product liability cases where they need to look back at warranty claims or drug trial data, or employment disputes around pay history and practices, or financial cases where they need to look at pricing or trading patterns.
Josh: The ones that Jim mentioned are certainly prevalent. But in addition, I would add that all kinds of financial data are now governed by retention policies largely because of the same concerns that arise from potential legal situations: at some point, someone may ask for it. Anything related to consumer packaged goods, vehicle parts (planes, boats, cars, trucks, etc.) as well as industrial and durable goods, which tend to have very long lifecycles, are increasingly subject to these types of inquiries.
Q: Simply accessing legacy data to determine its relevance to a matter can present significant challenges. What are some methods by which organizations can streamline the process?
Jim: If you are keeping around mothballed applications and databases purely for reporting purposes, these are prime targets to migrate to a structured data archive. Cost savings from licenses, CPU, and storage can run to 65% per year, with the added benefit that it’s much easier to enforce a retention policy on this data, roll it off when it expires, and compliance reporting is easier to do with modern tools.
Josh: One huge challenge that comes from these legacy applications stems from the fact that there are typically a lot of them. That means that when a discovery request arises, someone – or more likely multiple people – have to go to each one of those applications one by one to search for and retrieve relevant data. Not only is that time consuming and cumbersome, but it also assumes that there are people with the skill sets and application knowledge necessary to interact with all of those different applications. In any given company, that might not be a problem *today*, shortly after the applications have been decommissioned, because all the people that used the applications when they were live are still around. But will that still be the case 5, 7, 10 or 20 years from now? Probably not. Retiring all of these legacy applications into a “platform neutral” format is a much more sustainable, not to mention cost effective, approach.
Q: How can e-discovery preservation and collection technologies be leveraged to help organizations identify and “lock down” structured data?
Jim: Integrating e-discovery — legal holds and collections — with your structured data archive can make it a lot easier to coordinate preservation and collection activities across the two systems. This reduces the chances of stranded holds — data under preservation that could have been released, and reduces the ambiguity about what needs to happen to the data to support the needs of legal and compliance teams.
Josh: Just as there are solutions for “locking down” unstructured and semi-structured (email) data, there are solutions for locking down structured data. The first and perhaps most important step is recognizing that the solutions for unstructured and semi-structured data are simply incapable of handling structured data. Without something that is purpose built for structured data, your discovery preservation and collection process is going to ignore this entire category of data. The good news is that some of the solutions that are purpose built for structured data have built in integrations to the leading e-discovery platforms.
You can hear more from Informatica’s Josh Alpern and Exterro’s Jim FitzGerald by attending Exterro’s CLE-accredited webcast, “Optimizing E-Discovery in a Multi-Vendor Environment,” airing on Thursday, September 4. Learn more and register here.
Get connected. Be connected. Make connections. Find connections. The Internet of Things (IoT) is all about connecting people, processes, data and, as the name suggests, things. The recent social media frenzy surrounding the ALS Ice Bucket Challenge has certainly reminded everyone of the power of social media, the Internet and a willingness to answer a challenge. Fueled by personal and professional connections, the craze has transformed fund raising for at least one charity. Similarly, IoT may potentially be transformational to the business of the public sector, should government step up to the challenge.
Government is struggling with the concept and reality of how IoT really relates to the business of government, and perhaps rightfully so. For commercial enterprises, IoT is far more tangible and simply more fun. Gaming, televisions, watches, Google glasses, smartphones and tablets are all about delivering over-the-top, new and exciting consumer experiences. Industry is delivering transformational innovations, which are connecting people to places, data and other people at a record pace.
It’s time to accept the challenge. Government agencies need to keep pace with their commercial counterparts and harness the power of the Internet of Things. The end game is not to deliver new, faster, smaller, cooler electronics; the end game is to create solutions that let devices connecting to the Internet interact and share data, regardless of their location, manufacturer or format and make or find connections that may have been previously undetectable. For some, this concept is as foreign or scary as pouring ice water over their heads. For others, the new opportunity to transform policy, service delivery, leadership, legislation and regulation is fueling a transformation in government. And it starts with one connection.
One way to start could be linking previously siloed systems together or creating a golden record of all citizen interactions through a Master Data Management (MDM) initiative. It could start with a big data and analytics project to determine and mitigate risk factors in education or linking sensor data across multiple networks to increase intelligence about potential hacking or breaches. Agencies could stop waste, fraud and abuse before it happens by linking critical payment, procurement and geospatial data together in real time.
This is the Internet of Things for government. This is the challenge. This is transformation.