edocr - IDC's Worldwide Data Integration and Integrity Software Taxonomy, 2017

Three primary changes have occurred since the last published taxonomy for the data integration and access (DIA) software functional market (see IDC's Worldwide Data Integration and Access Software Taxonomy, 2015, IDC #255856, June 2015):  Name change from data integration and access software to data integration and integrity software  Consolidation of the general data quality and domain-based matching and cleansing market segments into one data quality software segment  Addition of the self-service data preparation software segment

About Techcelerate Ventures

Tech Investment and Growth Advisory for Series A in the UK, operating in £150k to £5m investment market, working with #SaaS #FinTech #HealthTech #MarketPlaces and #PropTech companies.

Tag Cloud

April 2017, IDC #US41715317
TAXONOMY
IDC's Worldwide Data Integration and Integrity Software
Taxonomy, 2017
Stewart Bond
IDC'S WORLDWIDE DATA INTEGRATION AND INTEGRITY SOFTWARE TAXONOMY
FIGURE 1
Data Integration and Integrity Software Primary Segments
Source: IDC, 2017
©2017 IDC
#US41715317
2
DATA INTEGRATION AND INTEGRITY SOFTWARE TAXONOMY CHANGES FOR 2017
Three primary changes have occurred since the last published taxonomy for the data integration and
access (DIA) software functional market (see IDC's Worldwide Data Integration and Access Software
Taxonomy, 2015, IDC #255856, June 2015):
 Name change from data integration and access software to data integration and integrity
software
 Consolidation of the general data quality and domain-based matching and cleansing market
segments into one data quality software segment
 Addition of the self-service data preparation software segment
Table 1 maps these changes as they were made in 2016, and how they relate to the 2015 market
definition.
TABLE 1
Summary of Data Integration and Integrity Software Functional Market Changes
for 2016
2015 Market
2016 Market
Comments
Application development
and deployment
Data integration and
access software
Data integration and
integrity software
Renamed and expanded the scope of
this market (Self-service data
preparation is added as a submarket.
Overall market name changed [to data
integration and integrity] to emphasize
the growing importance of data quality.)
Data integration and
integrity software
General data quality
segment consolidated
with domain-based
matching and cleansing
segment
Data quality segment
Market consolidation – software vendors
with general data quality software are
investing in domain-based matching and
cleansing and vice versa
Self-service data
preparation segment
New segment describing data access
and preparation tools suitable for
business users
Source: IDC, 2017
Data is core to digital transformation, and data without integrity will not be able to support digital
transformation initiatives. Data with integrity is data that is trusted, available, secure, and compliant
across the enterprise IT environment, regardless of whether data is persisted on-premises or in the
cloud. Data integration software vendors have the tools, intellectual property, and products that can
improve the integrity of data. Integrity has become the secondary focus of solutions in this market, and
for this reason, IDC changed the name of the software market from data integration and access to data
integration and integrity (DII).
©2017 IDC
#US41715317
3
Until 2016, IDC tracked data quality software spend within the data integration and access software
functional market as two subsegments: general data quality, and domain-based matching and
cleansing software. These two segments are consolidating as software vendors in the general data
quality segment add domain-based matching and cleansing software, and conversely, domain-based
matching and cleansing software vendors are adding general data quality software to their portfolio.
While in some cases the products continue to be separate, it is becoming increasingly difficult to track
spend as the lines are blurring. As the market consolidates, so must the segments that IDC is tracking.
A lot happened on the way to big data insights, including a need to put data preparation into the hands
of business users. 2015 saw a surge of activity in a self-service data preparation software category,
and therefore, IDC started tracking spend in this category in 2016. In many cases, the purpose of
software in this segment is to prepare data for analysis, and because functionality is similar to extract,
transform, and load (ETL), IDC decided to record and track revenue within the DII software market
instead of the business analytics market.
TAXONOMY OVERVIEW
Data integration and integrity (DII) software enables the access, blending, movement, and integrity of
data among multiple data sources. The purpose of data integration and integrity software is to ensure
the consistency of information where there is a logical overlap of the information contents of two or
more discrete systems. Data integration is increasingly being used to capture, prepare, and curate
data for analytics. It is also the conduit through which new data types, structures, and content
transformation can occur in modern IT environments that are inclusive of relational, nonrelational, and
semistructured data repositories.
DII software employs a wide range of technologies, including, but not limited to, extract, transform, and
load; change data capture (CDC); federated access; format and semantic mediation; data quality and
profiling; and associated metadata management. Data access is increasingly becoming open with
standards-based APIs; however, there is still a segment of this market focused on data connectivity
software, which includes data connectors and connectivity drivers for legacy technologies and
orchestration of API calls.
DII software may be used in a wide variety of functions. The most common is that of building and
maintaining a data warehouse, but other uses include enterprise information integration, data
migration, database consolidation, master data management (MDM), reference data management,
metadata management, and database synchronization, to name a few. Data integration may be
deployed and executed as batch processes, typical for data warehouses, or in near-real-time modes
for data synchronization or dedicated operational data stores. A data migration can also be considered
a one-time data integration. Although most applications of DII software are centered on structured data
in databases, they may also include integrating data from disparate, distributed unstructured or
semistructured data sources, including flat files, XML files, JSON, and legacy applications, big data
persistence technologies, and the proprietary data sources associated with commercial applications
from vendors such as SAP and Oracle. More recently, an interactive form of data integration has
emerged in the application of self-service data preparation.
The data integration and integrity software market includes the submarkets as illustrated in Figure 2
and discussed in the Definitions section that follows.
©2017 IDC
#US41715317
4
FIGURE 2
IDC Data Integration and Integrity Software Taxonomy Model
Source: IDC, 2017
DEFINITIONS
Bulk Data Movement Software
This software, commonly referred to as extract, transform, and load software, selectively draws data
from source databases, transforms it into a common format, merges it according to rules governing
possible collisions, and loads it into a target. A derivative of this process, extract, load, and transform
(ELT), can be considered synonymous with ETL for the purpose of market definition and analysis. ELT
is an alternative commonly used by database software vendors to leverage the inherent data
processing performance of database platforms. ETL and/or ELT software normally runs in batch but
may also be invoked dynamically by command. It could be characterized as providing the ability to
move many things (data records) in one process.
©2017 IDC
#US41715317
5
The following are representative companies in this submarket:

IBM (InfoSphere DataStage)

Informatica (PowerCenter)
 Oracle (Oracle Data Integrator)
 SAP (SAP Data Integrator, SAP Data Services, HANA Smart Data Integration)
 SAS (SAS Data Management)
 Talend (Open Studio for Data Integration)
Dynamic Data Movement Software
Vendors often refer to this functionality as providing "real time" data integration; however, it can only
ever be "near" real time because of the latency inherent in sensing and responding to a data change
event occurring dynamically inside a database. Data changes are either captured through a stored
procedure associated with a data table or field trigger or a change data capture (CDC) facility that is
either log or agent based.
These solutions actively move the data associated with the change among correspondent databases,
driven by metadata that defines interrelationships among the data managed by those databases. The
software is able to perform transformation and routing of the data, inserting or updating the target,
depending on the rules associated with the event that triggered the change. It normally either features
a runtime environment or operates by generating the program code that does the extracting,
transforming, and routing of the data and updating of the target. It could be characterized as software
that moves one thing (data record or change), many times.
The following are representative companies in this segment:

IBM (InfoSphere Data Replication)

Informatica (PowerCenter Real Time Edition)
 Oracle (Oracle GoldenGate)
 SAP (SAP Replication Server, SAP Data Services)
Data Quality Software
This submarket includes products used to identify errors or inconsistencies in data, normalize data
formats, infer update rules from changes in data, validate against data catalog and schema definitions,
and match data entries with known values. Data quality activities are normally associated with data
integration tasks such as match/merge and federated joins but may also be used to monitor the quality
of data in the database, either in near real time or on a scheduled basis.
The data quality software submarket also includes purpose-built software that can interpret,
deduplicate, and correct data in specific domains such as customers, locations, and products. Vendors
with domain-specific capabilities typically offer products that manage mailing lists and feed data into
customer relationship management and marketing systems. The following are representative
companies in this submarket:
 Experian Data Quality

IBM (InfoSphere Discovery, BigInsights BigQuality)

Informatica (Informatica Data Quality and Data-as-a-Service)
©2017 IDC
#US41715317
6
 Melissa Data
 Pitney Bowes (Code-1 Plus and Finalist)
 SAP (SAP Data Services, SAP HANA Smart Data Quality)
 SAS Data Quality
 Talend (Talend Open Studio for Data Quality)
 Syncsort (acquired Trillium Software)
Data Access Infrastructure
This software is used to establish connections between users or applications and data sources without
requiring a direct connection to the data source's API or hard-coded database interface. It includes
cross-platform connectivity software. It also includes ODBC and JDBC drivers and application and
database adapters. The following are representative companies in this submarket:
 Micro Focus (Attachmate Databridge)

Information Builders (iWay Integration Solutions)
 Progress Software (DataDirect Connect and DataDirect Cloud)
 Rocket Software (Rocket Data)
Composite Data Framework
This market segment includes data federation and data virtualization software, enabling multiple
clients (usually applications, application components, or databases running on the same or different
servers in a network) to share data in a way that reduces latency. Usually involving a cache, this
software sometimes also provides transaction support, including maintaining a recovery log.
Data federation permits the access to multiple databases as if they were one database. Most are read
only, but some provide update capabilities. Data virtualization products are similar but offer full schema
management coordinated with the source database schemas to create a complete database
environment that sits atop multiple disparate physical databases. Originally created to provide the data
services layer of a service-oriented architecture (SOA), data virtualization has been applied to multiple
use cases including data abstraction, rapid prototyping of data marts and reports, and providing
abilities for data distribution without replication.
The following are representative companies in this submarket:
 Cisco Data Virtualization Platform
 Denodo

IBM (InfoSphere Federation Server)

Informatica Data Virtualization
 Red Hat (JBoss Data Virtualization)
 SAS (SAS Federation Server)
Master Data Definition and Control Software
The master data definition and control software market segment includes products that help
organizations define and maintain master data, which is of significance to the enterprise and multiple
systems. Master data is usually categorized by one of four domains: party (people, organizations),
©2017 IDC
#US41715317
7
product (including services), financial assets (accounts, investments), and location (physical property
assets).
Master data definition and control software manages metadata regarding entity relationships,
attributes, hierarchies, and processing rules to ensure the integrity of master data for where it is to be
used. Key functionalities includes master data modeling, import and export, and the definition and
configuration of rules for master data handling such as versioning, access, synchronization, and
reconciliation.
This market segment also includes reference data management software. Reference data is used to
categorize or classify other data and is typically limited to a finite set of allowable values aligned with a
value domain. Domains can be public such as time zones, countries and subdivisions, currencies,
units of measurement, or proprietary codes used within an organization. Domains can also be industry
specific such as medical codes, SWIFT BIC codes, and ACORD ICD codes. As with master data,
reference data is referenced in multiple places within business systems, and in some cases, can
deviate across disparate systems. Reference data management software could be considered a
subset of master data management (MDM) software and, in many cases, may share the same code
base as master data management software sold by the same vendor.
Master data definition and control products can include operational orchestration facilities to coordinate
master data management processes across multiple information sources in either a batch, a service,
or an event-based context. Master data definition and control software provides capabilities to facilitate
single or multiple master data entity domain definition and processing and usually serves as the core
technical component to a broader master data management solution (a competitive market in the IDC
taxonomy).
Representative vendors and products include:

IBM (IBM InfoSphere Master Data Management)

Informatica (Multi-Domain MDM, PIM)
 Oracle (Oracle Product Hub, Oracle Customer Hub, Oracle Site Hub, Oracle Higher Education
Constituent Hub, and Oracle Data Relationship Management)
 SAP (SAP Master Data Governance)
 TIBCO (Master Data Management Platform)
Metadata Management Software
Metadata is a data about data. It is commonly associated with unstructured data, as a structure in
which to describe the content. However, it is increasingly being used in the structured data world, as
the data becomes more complex in its definition, usage, and distribution across an enterprise. At a
basic level, metadata includes definitions of the data, when, how, and by whom the data was created
and last modified. More advanced metadata adds context to the data, traces lineage of the data, cross-
references where and how the data is used, and improves interoperability of the data. Metadata has
become critical in highly regulated industries as a useful source of compliance information.
The metadata management submarket has grown out of the database management, data integration,
and data modeling markets. Metadata management solutions provide the functionality to define
metadata schema, automated and/or manual population of the metadata values associated with
structured data, and basic analytic capabilities for quick reporting. These solutions also offer
©2017 IDC
#US41715317
8
application program interfaces (APIs) for programmatic access to metadata for data integration and
preparation for analytics.
Representative vendors and products include:
 ASG (Enterprise Data Intelligence)

IBM (InfoSphere Information Governance Catalog)

Informatica (Business Glossary, Enterprise Information Catalog, Live Data Map)
 Oracle (Enterprise Metadata Management)
 SAP (SAP Information Steward Metadata Management and Metapedia)
Self-Service Data Preparation Software
Self-service data preparation is an emerging segment in the DII software functional market. Demand is
coming from today's tech-savvy business users wanting access to their data in increasingly complex
environments, without IT getting in the way. IT has also been looking for technology that can put data
preparation capabilities into the hands of business users, where the requirements and desires of
analytic outcomes are best understood.
These solutions are targeted at business analysts, but some offer a range of user interfaces by
persona. Some require deeper levels of data analytics knowledge, and others require deeper levels of
data integration knowledge. Functionality ranges from cataloging data sets within repositories,
providing interactive capabilities for cleansing and standardizing data sets, joining disparate data sets,
performing calculations, and varying degrees of analytics for validation prior to visualization.
Software in this segment represents vendors that explicitly sell self-service data preparation
components. Business intelligence and analytics vendors that offer self-service data preparation
capability within a suite as part of the analytics process they support are excluded from this segment.
Representative vendors and products include:
 Alteryx (Designer)
 Datawatch (Monarch)
 Paxata
 SAP (Agile Data Preparation)
 Talend (Data Preparation)
 Trifacta (Wrangler)
 UniFi
RELATED MARKETS
Table 2 provides data integration and integrity software related markets.
©2017 IDC
#US41715317
9
TABLE 2
Data Integration and Integrity Software Related Markets
Competitive Market
Relationship
Master data management
(MDM) software
The master data definition and control segment of the DII market, combined with portions of
data access, movement, quality and metadata segments make up much of the MDM platform
segment of the MDM competitive market.
Business analytics
software
A portion of the data integration and integrity software market is used to derive the business
analytics market, specifically within the data warehouse generation segment.
Source: IDC, 2017
LEARN MORE
Appendix: Patterns of Use
A pattern of use is a commonly observed assembly of software components, meeting the needs of
commonly observed business and technical requirements, within specific organizational and technical
constraints. The DII software market segments are aligned with data integration patterns of use. Some
patterns that leverage components from across the segments, and conversely some patterns support
solutions in other segments and competitive markets (e.g., master data management).
Vendors could consider patterns in the packaging and licensing of software products. Vendors are in a
unique position to help identify new patterns that are emerging in the install base and respond with
new products, packaging, and licensing options to gain a competitive advantage.
The patterns discussed in the sections that follow are typical of many (but not all) data integration
solution implementations, applied to meet business requirements within technical and organizational
constraints. They are presented in a way that vendors could help customers identify the pattern and
justify motivation for, and implications of, using the pattern.
Scheduled Bulk Data Integration
Scheduled bulk data integration is used to move large sets of data from a source to a destination on a
frequent schedule. This pattern primarily leverages software in the bulk data movement segment of the
market, namely ETL or the derivative ELT. Figure 3 illustrates the two types of implementation within
this pattern.
©2017 IDC
#US41715317
10
FIGURE 3
Scheduled Bulk Data Integration Patterns
Source: IDC, 2017
The extract process happens on a scheduled basis, depending on the data latency requirements in the
target. This pattern is often used to populate data warehouses, data marts, or operational data stores
(ODSs) on a regularly scheduled basis. It can also be used as a one-time data migration.
This pattern is efficient for processing large data sets, as bulk data movement technology is built to
handle large volumes of data. Some data transformation requirements can only be achieved with
ETL/ELT, such as match/merge between multiple source data sets, complex de-normalization, and
summarizing and sorting of transactional data.
Use of this pattern introduces latency between the source and target and is constrained by the
physical data stores and processing capabilities of the underlying infrastructure.
Although the bulk data movement market segment primarily serves this pattern, vendors could bundle
components from the data access segment into solutions for this pattern. Vendors could also consider
metadata management add-on components for traceability of data lineage through the transformation
process, and data quality components for cleansing the data.
Near-Real-Time Data Integration
Near-real-time data integration is used to move smaller sets of data more frequently. This pattern
primarily leverages components in the dynamic data movement market segment. It is most often used
when changes in source data need to be captured and distributed to other systems. Figure 4 illustrates
the pattern and the multiple methods used to capture changes.
©2017 IDC
#US41715317
11
FIGURE 4
Near-Real-Time Data Integration Pattern
Source: IDC, 2017
Changes in source data can be captured using a stored procedure or database trigger. The trigger
may populate a change log inside the source database or send to an external persistence service.
CDC is a technology that runs outside of the source database and monitors database log files for
changes. Once a change is recognized, "before" and "after" images of the data in scope are sent to a
persistence service. A micro-batch is another alternative for picking up changes. These run every few
minutes and pull only the data that has changed since the last time the micro-batch was executed. It
may pull data out of a change log or directly out of source tables. The persistence service could be a
message queue, file system, external data store, or other types of containers. Mediation services pick
up the changed data from the persistence service and perform the necessary transformation,
formatting, and distribution to the target system(s) that need a copy of the change.
This pattern is efficient for processing data changes when latency between source and target needs to
be minimized. Change data capture is an efficient, noninvasive method for recognizing changes in
application databases when API-level access is not available, such as packaged applications with
proprietary APIs, or legacy systems that are costly to modify. Database triggers, change tables, or
other source data schema modifications are invasive and can nullify application vendor support
agreements.
Dynamic data movement is the primary market segment that serves this pattern. Vendors could also
bundle data access and metadata management components for solutions in this pattern. Vendors
could also bundle this pattern as the underlying middleware in master data management solutions,
distributing changes to master data from the systems of entry to systems of record and reference.
Real-Time Data Integration
Real-time data integration provides blending and transformation of data across multiple disparate
sources in a virtualized view. Also known as data virtualization, it could be considered data federation
without persistence. Figure 5 illustrates this pattern assuming two disparate data sources.
©2017 IDC
#US41715317
12
FIGURE 5
Real-Time Data Integration Pattern
Source: IDC, 2017
In this pattern, the composite data framework accesses the disparate data sources, querying for the
data in scope from both, joins the results in memory and provides a virtual view of the integrated data,
isolating the data access and integration logic from the consumer of the view. The virtual view is made
available to consumers via APIs.
This pattern is efficient when zero latency is required between the source data and consumers. It is
also useful in situations where flexibility and agility of reporting schemas are important. A virtual view
can be changed without any underlying impacts.
With real-time access comes impact. Querying against live systems can impact performance of the
source databases and in turn can impact the response time of the virtual views. Availability of sufficient
memory on the infrastructure running the composite data framework software, in addition to sufficient
network bandwidth, is important when considering this pattern. It is also important that all underlying
data security policies are replicated in the virtual views, to prevent unauthorized access.
Packaging of composite data framework, metadata management, and data access software
components is required for end-to-end solutions in this pattern. Vendors that have composite data
frameworks, but not data quality solutions, could encourage the application of this pattern to identify
quality inconsistencies across disparate data sources.
Data Federation
Data federation pulls data from multiple disparate sources, blends and transforms it into one data set,
and persists the results into another physical data store. It could be used to populate an operational
data store, reporting database, or data mart that needs a federated view of disparate data. Figure 6
illustrates the data federation pattern.
©2017 IDC
#US41715317
13
FIGURE 6
Data Federation Pattern
Source: IDC, 2017
Data federation can be used when latency is allowed between source systems and the federated
views. Latency can be reduced with near-real-time population of the target data store. It is also
appropriate when memory and network bandwidth constraints preclude data virtualization.
Federated data should be used in read-only mode for reporting and analysis purposes. Allowing write
access to the federated repository can cause a data governance issue as it creates an alternate
version of the source data, leading to inconsistencies.
The data federation use case leverages software in the composite data framework segment and can
use any combination of dynamic and bulk data movement components to move data from source to
target. Vendors could consider bundling these components together and adding metadata
management components for a comprehensive solution that will also provide context and lineage of
data in the federated data store.
Data in Motion Pattern
Data in motion represents the application of data integration technology against streaming data. Many
3rd Platform technologies constantly emit data, available to organizations that have an interest or need
for the data. Data is constantly in motion, coming from social networks, the Internet of Things, stock
market tickers, and so forth. Figure 7 illustrates an application of data integration technology to data in
motion.
FIGURE 7
Streaming Integration Pattern
Source: IDC, 2017
©2017 IDC
#US41715317
14
This is an emerging pattern and is applicable to those organizations that are working with, or have a
desire to leverage, streaming data. Any foray into the use of data streams needs to be made taking
into consideration the abilities of the organization to capture and respond to business-relevant events
in the stream. Available system resources also need to be understood in the context of the velocity and
volume of the streaming data.
This pattern leverages several components in the data integration software market, and vendors are
encouraged to consider bundling alternatives to meet the requirements of this pattern, as applicable to
customer requirements. Data access software provides the capability to interface with social media,
Internet of Things, and other data stream sources. Mediation capabilities available in bulk and dynamic
data movement components assist with identifying and filtering data in the streams that are relevant to
the organization. Data quality components can be used to validate the data prior to allowing into the
persistent storage of the organization.
Related Research

IDC's Worldwide Services Taxonomy, 2017 (IDC #US42356617, March 2017)
 Market Analysis Perspective: Worldwide Data Integration and Integrity Software, 2016 (IDC
#US40799016, September 2016)
 Worldwide Data Integration and Integrity Software Market Shares, 2015: Year of Big Data and
the Emergence of Self-Service (IDC #US40696216, June 2016)
 Worldwide Data Integration and Integrity Software Forecast, 2016–2020 (IDC #US40696116,
June 2016)
Synopsis
This IDC study provides taxonomical definitions for the data integration and integrity software market.
The study segments the market into data access, movement, integration, and governance functional
categories. Market segments include bulk data movement, dynamic data movement, composite data
frameworks, data quality, master data definition and control, metadata management, and self-service
data preparation.
"Data is core to digital transformation, and data without integrity will not be able to support digital
transformation initiatives," says Stewart Bond, research director for Data Integration and Integrity
Software at IDC. "Data integration software provides organizations with the ability to catalog, move,
transform, match, cleanse, and prepare data improving the level of trust and integrity of data,
improving the outcomes of digital transformation initiatives on the 3rd Platform."
About IDC
International Data Corporation (IDC) is the premier global provider of market intelligence, advisory
services, and events for the information technology, telecommunications and consumer technology
markets. IDC helps IT professionals, business executives, and the investment community make fact-
based decisions on technology purchases and business strategy. More than 1,100 IDC analysts
provide global, regional, and local expertise on technology and industry opportunities and trends in
over 110 countries worldwide. For 50 years, IDC has provided strategic insights to help our clients
achieve their key business objectives. IDC is a subsidiary of IDG, the world's leading technology
media, research, and events company.
Global Headquarters
5 Speen Street
Framingham, MA 01701
USA
508.872.8200
Twitter: @IDC
idc-community.com
www.idc.com
Copyright Notice
This IDC research document was published as part of an IDC continuous intelligence service, providing written
research, analyst interactions, telebriefings, and conferences. Visit www.idc.com to learn more about IDC
subscription and consulting services. To view a list of IDC offices worldwide, visit www.idc.com/offices. Please
contact the IDC Hotline at 800.343.4952, ext. 7988 (or +1.508.988.7988) or sales@idc.com for information on
applying the price of this document toward the purchase of an IDC service or for information on additional copies
or web rights.
Copyright 2017 IDC. Reproduction is forbidden unless authorized. All rights reserved.