About Jack Zheng
Faculty of IT at Kennesaw.edu
Tag Cloud
The World of Data
A general introduction to data and
information processing
Jack G. Zheng
Fall 2022
http://zheng.kennesaw.edu/teaching/it3703
http://zheng.kennesaw.edu/teaching/it7123
IT 3703 Intro to Analytics and Technology
IT 7123 Business Intelligence
Overview
•
Fundamental data related concepts and terms
– What is data, what is the data as we concern in business processing
– Data, information, knowledge
•
The data and data technology in today’s world
– Data sources
– Value and uses
– Data characteristics
– Data challenges
– Data technology and capabilities
• Data knowledge areas
– Data management, data engineering, data analytics, and data science
– DAMA data knowledge areas
• Practical job roles and career paths
– Focus on data management, data engineering, data analytics, business intelligence, and data science,
information/knowledge management
– Data related jobs and careers, corresponding to knowledge areas
2
This lecture focus on fundamental, high-level concepts and philosophical view of
data (instead of technical details), and a general view of the industry and
technology. It sets a context for our main topic on data analytics and technology.
Basic Data Terms and Concepts
• DIKW
• Data type
• Data format
• Data model
• Data structure
• Data processing
3
DIKW
• The DIKW hierarchy depicts relationships between data, information,
knowledge (and wisdom).
– Data: raw value elements or facts
–
Information: the result of collecting and organizing data that provides context and meaning
– Knowledge: the concept of understanding information that provides insight to information,
thus useful and actionable
– Wisdom: the understanding of interactions and an integrated view, and the understanding
of implications and indirect results beyond a target domain.
4
Extended readings on DIKW
•
https://en.wikipedia.org/wiki/DIKW_pyramid
•
https://www.i-scoop.eu/big-data-action-value-
context/dikw-model/
•
https://towardsdatascience.com/rootstrap-
dikw-model-32cef9ae6dfb
•
https://www.youtube.com/watch?v=K4i2FK52
698
•
https://www.youtube.com/watch?v=jSWC23m
HXJM
Image from https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Core readings on DIKW
•
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
•
DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4
•
DIKW Pyramid with Sample Data https://www.youtube.com/watch?v=MFUyQsJyKgg
Data Related Terms
• Raw data
– Directly generated from an event, preserved in its original format
• Processed data
– Transformed or cleaned using any techniques for the purpose of
storage, analysis, or presentation
• Facts
– Data represents reality or things that actually happened
• Simulated/forecast/estimated data
– Data generated based on mathematical or statistical models
• Measures/metrics
– A piece of data that is used for measuring an activity or a
phenomenon
– Calculated results from underlying factors/data; a kind of processed
data
5
Data Type
• Data type is an attribute of data that tells a computer
system how to interpret and process its value, and
how a human can use its value.
• Data types may include
– Simple (primitive) numeric types: int, decimal, etc.
– Qualitative (textual) and composite types: string, text,
etc.
– Extended (app specific) data types: date/time, Boolean,
money, geo, etc.
– Abstract data types: object, array, etc.
– It even can include more digital multimedia forms like
sound, image, and video.
• The exact definition and application of data types
depend on the system.
6
Extended reading: https://en.wikipedia.org/wiki/Data_type
Data Model
• Data model is about representation
of data, with a set of business
concepts and rules.
– A data model conceptualizes data elements and standardizes
how the data elements relate to one another
• Extended reading: https://en.wikipedia.org/wiki/Data_model
– Answers questions like: How is data grouped and associated?
What is this data about? How are they related?
• Data models depict and enable an organization to
understand its data assets through core building blocks
such as entities, relationships, and attributes. These
represent the core concepts of the business such as
customer, product, employee, and more.
• Typical examples: flat model, relational model, network
model, dimensional model
7
Data model will be covered with more details in
IT 3703 module 3 and IT 7123 module 3.
Example:
relational model
Data Format or File Format
• Data format defines the way how data and information is structured and
recorded in a computer file, particularly flat file.
• Flat files are machine readable, meaning data in flat files is formatted in
a way that it can be automatically read and processed by a computer
program. Machine-readable data must have some structures, even if
they are implicit and may be defined outside the file.
– http://opendatahandbook.org/glossary/en/terms/machine-readable/
• Three major types of data formats are used in flat files in today’s
analytics
– Comma Separated Values (CSV) https://en.wikipedia.org/wiki/Comma-
separated_values
– JavaScript Object Notation (JSON)
– eXtensible Markup Language (XML)
• They are constantly used in data download/export, transfer, and
storage.
– Example: https://schoolgrades.georgia.gov/dataset
8
Data format will be covered with more details
in IT 3703 module 5 and IT 7123 module 5.
Data Structure
•
In computer science, a data structure is a data organization,
management, and storage format that enables efficient access
and modification. More precisely, a data structure is a collection
of data values, the relationships among them, and the functions
or operations that can be applied to the data, i.e., it is an
algebraic structure about data.
– https://en.wikipedia.org/wiki/Data_structure
• A data structure is a more technical and lower-level term. It
emphasizes a structure or a model that targets computer system
(vs. human) for optimal processing.
• For example, an array is a data structure, as well as dictionaries.
Also, the classes that form your data model, are data
structures too, any representation of a specific data object has to
be in form of a data structure.
– https://stackoverflow.com/questions/24228038/difference-between-
datastructure-and-datamodel-with-example
9
Structured, Semi-structured, and Unstructured
• Structured data
– Structured data is data whose elements are addressable for effective analysis. It has been organized into
a formatted repository that is typically a database. It concerns all data which can be stored in database
SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre-
designed fields. Today, those data are most processed in the development and simplest way to manage
information. Example: Relational data.
• Semi-Structured data
– Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in the
relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist
to ease space. Example: XML data.
• Unstructured data
– Unstructured data is a data which is not organized in a predefined manner or does not have a predefined
data model, thus it is not a good fit for a mainstream relational database. So, for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used
by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
• Key readings
– https://www.geeksforgeeks.org/difference-between-structured-semi-structured-and-unstructured-data/
– https://www.g2.com/articles/structured-vs-unstructured-data
10
Data Processing
• Data processing is, generally, "the collection
and manipulation of items of data to produce
meaningful information.” (particularly using
the computing technologies and systems).
– https://en.wikipedia.org/wiki/Data_processing
• Data processing can be considered as two
broad categories
– Transactional processing
– Analytical processing
11
Types of Data/Information Processing
Transactional Processing
• Focus on data item processing
(storage, insertion,
modification, deletion),
transmission, and even some
non-analytical query
Analytical Processing
• Focus on queries, calculation,
reporting, analysis, and
decision support
12
For a more detailed comparison of OLTP and OLAP:
https://techdifferences.com/difference-between-oltp-and-olap.html
https://www.ibm.com/cloud/blog/olap-vs-oltp
• Change product price.
•
Increase customer credit limit.
•
Import data from another source
• What are the top 10 most
profitable products?
•
Is there a significant increase of
operational cost?
Notice the difference between these
terms as general concepts vs. as
particular technologies/systems.
DIKW and Data Processing
• The DIKW model can be loosely related to
the levels of transactional processing and
analytical processing
13
Transactional
Processing
Analytical
Processing
For more extensive reading: http://en.wikipedia.org/wiki/DIKW_Pyramid
Different opinion: https://hbr.org/2010/02/data-is-to-info-as-info-is-not
Data in Today’s World
• Data is the new oil (of the digital economy)
– https://towardsdatascience.com/is-data-really-the-new-oil-in-the-
21st-century-17d014811b88
– Considered as one of important resources, just like financial
resource and human resource.
• Big data
– Four V’s: Volume (Scale), Variety, Velocity (Speed), Veracity
(Uncertainty)
• Data technology and capabilities advancement
– Computing power and storage capacity increases
– New processing techniques such as parallel and distributed
processing, AI, etc.
• Evolving user needs
– Analytical needs
– Communication needs
14
Big Data
• Big Data is high volume, high velocity, and/or high variety information assets
that require new forms of processing to enable enhanced decision making,
insight discovery and process optimization.
• Basic 4Vs (Gartner)
•
“Big Data is not a system; it is simply a way to say that you have a lot of data.
– https://www.linkedin.com/pulse/big-data-silver-bullet-tomas-kratky
15
Volume (Scale)
Data volume is increasing exponentially, not linearly Even large amounts of
small data can result into Big Data .
Variety
(Complexity) Various formats, types, and structures. Big data covers non-
structure and various data formats including text, blob, multimedia, etc.
Information and knowledge management is the management of both
structured data (15% of information) and unstructured data (85% of
information), according to the Butler Group.
80 percent of business is conducted on unstructured information (Gartner
Group).
Velocity (Speed)
Data is being generated fast and needs to be processed fast.
Veracity
(Uncertainty)
Uncertainty due to inconsistency, incompleteness, latency, ambiguities, or
approximations.
16
Figures from https://www.ksi.mff.cuni.cz/~svoboda/courses/192-MDK/lectures/MDK-Lecture-01-Introduction.pdf
Developments in Data Technology Capabilities
• Computing power and capacity (hardware)
– Computing power for data processing increased: https://towardsdatascience.com/the-future-of-
computation-for-machine-learning-and-data-science-fad7062bc27d
– Data storage capacity increased: https://www.frontierinternet.com/gateway/data-storage-timeline/
– Data generation method proliferate; automated data collection devices and sensors, such as IoT
devices.
• Algorithm and techniques advancement such as parallel processing, AI, machine learning.
• Self-service tools expand the user base.
• Cloud-based systems lower the cost of ownership.
17
https://cacm.acm.org/magazines/2011/8/114953-an-
overview-of-business-intelligence-technology/fulltext
Evolving User Needs
• Data democratization
– Consumption of data reaches to the general public
– https://www.techtarget.com/whatis/definition/data-
democratization
• Prevalence and wide expectation of data
visualizations
• Evolving analytical needs
– Need real-time and most recent data
– Business user driven, agile, instant
– Exploratory and interactive
18
Common Data Use Challenges
•
Information overloading
–
too much data and information with varied formats and structure
– difficulty of data organization for effective access and retrieval
– difficult to find useful information (knowledge) from them
– multiple copies of data exists sometimes with conflicts
• Data everywhere
– Data in separate systems and different sources; internal and external
– Problem of spreadmart http://en.wikipedia.org/wiki/Spreadmart
– Over 43 percent of organizations have more than six content stores. (Forrester Research).
• Difficulty of access
– We may have that data, but we cannot access it (or difficult to get it), because of technical
issues or administrative issues.
• Don’t have that data
– The data is simply not available.
– The collection of data may need additional process and is costly.
• The organizational data problem: https://www.youtube.com/watch?v=y5-
3Pjbk8Zk
19
Data Knowledge Areas
• The Data Management Association
(DAMA) is a non-profit and vendor-
independent association of business and
technical professionals that is dedicated
to the advancement of data resource
management (DRM) and information
resource management (IRM).
https://www.dama.org
• DAMA publishes a guidebook called
"The DAMA Guide to the Data
Management Body of Knowledge"
(DAMA-DMBOK). It defines 11 data
management knowledge areas.
– OVERVIEW OF DMBOK
https://www.dama-
dk.org/onewebmedia/DAMA%20DMBOK
2_PDF.pdf
20
11 Knowledge Areas by DAMA
1.
Data Governance – planning, oversight, and control over management of data and the use of data and data-related resources. While we
understand that governance covers ‘processes’, not ‘things’, the common term for Data Management Governance is Data Governance, and
so we will use this term.
2.
Data Architecture – the overall structure of data and data-related resources as an integral part of the enterprise architecture
3.
Data Modeling & Design – analysis, design, building, testing, and maintenance (was Data Development in the DAMA-DMBOK 1st edition)
4.
Data Storage & Operations – structured physical data assets storage deployment and management (was Data Operations in the DAMA-
DMBOK 1st edition)
5.
Data Security – ensuring privacy, confidentiality and appropriate access
6.
Data Integration & Interoperability –acquisition, extraction, transformation, movement, delivery, replication, federation, virtualization and
operational support ( a Knowledge Area new in DMBOK2)
7.
Documents & Content – storing, protecting, indexing, and enabling access to data found in unstructured sources (electronic files and physical
records), and making this data available for integration and interoperability with structured (database) data.
8.
Reference & Master Data – Managing shared data to reduce redundancy and ensure better data quality through standardized definition and
use of data values.
9.
Data Warehousing & Business Intelligence – managing analytical data processing and enabling access to decision support data for reporting
and analysis
10.
Metadata – collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata
11.
Data Quality – defining, monitoring, maintaining data integrity, and improving data quality
21
Four Categories of Job Roles
• Management role
– Focusing on the management of data and information assets, setting policies and
processes; usually less technical
• Administration role
– Technical administration of data systems, including data storage (database, data
warehouse, etc.), BI systems, reporting system, and other analytics and application
systems
– Maintain and monitor the security, integrity, reliability, and performance of systems
– Usually are system specific
• Development role
– Data engineering
• Designing data models, systems, architectures
• Build data pipelines and move data
– Application development
• Build reports, dashboards, and other applications that help analysts, managers, and customers.
• Consume data APIs and services
• Analysis role
– Focusing on analysis and reporting; building analytical models and presenting results
–
Involving various degree of math, statistics, and computing algorithm
22
Three Careers of Focus
Data Analyst
Data Engineer/Developer*
Data Scientist
Data Query
Data Warehousing & ETL
Statistical & Analytical skills
Business Domain Knowledge
Business intelligence, reporting
Data Mining
Programming knowledge
Data Analytics
Machine Learning & Deep
learning principles
Scripting & Statistical skills
In-depth knowledge of SQL/
database
In-depth programming
knowledge (SAS/R/ Python
coding)
Reporting & data visualization
Data architecture & pipelining
Hadoop-based analytics
SQL/ database knowledge
Machine learning concept
knowledge
Data optimization
Spread-Sheet knowledge
Scripting, reporting & data
visualization
Decision making and soft skills
23
Table adapted based on
https://www.edureka.co/blog/data-analyst-vs-data-engineer-vs-data-scientist/
BSIT/MSIT focus
Data Engineer
• The role of the data engineer is mostly to ensure the quality and
availability of the data. This include the following most important
tasks
– Build and maintain data pipeline systems
– Clean and wrangle data into a usable state
– Design/build data storage systems, architectures, and
infrastructures
• Data engineer skills
– Programming
– Knowledge of tools and systems
– Data model, structure, format, architecture
– Relational and non-relational database design
– Data storage system design
– Data/information flow
– SQL, query execution and optimization
24
Reference reading: https://www.oreilly.com/content/data-engineering-a-quick-and-simple-definition/
Extended reading:
https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data-
warehouse-and-data-engineer-role/
Data Science
• Data Science is multidisciplinary
– Computer Scientists
– Information Technologist
– Statisticians/Mathematicians
– Domain Experts
• Data in Data Science
– Pretty much similar to “data” in data analytics
• Science in Data Science
– Implying scientific methods
– More exploratory
25
Another view of jobs and careers
• https://search
datamanage
ment.techtar
get.com/feat
ure/Data-
management
-roles-Data-
architect-vs-
data-
engineer-
others
26
Data
developer
Also include BI
analyst, or BI
developer.
Data Education at KSU
• BSIT - the new concentration on “data analytics and
technology”
– https://www.edocr.com/v/0jmn189y/jgzheng/ksu-bsit-data-
concentration-overview
• MSIT/BSIT - Graduate Certificate in Data Analytics and
Intelligent Technology
– https://www.edocr.com/v/z1dwxbpy/jgzheng/ksu-msit-data-
certificate
– https://msit.kennesaw.edu/future-students/program-
requirements.php
• Other departments
– Ph.D. in Analytics and Data Science
https://datascience.kennesaw.edu
– ACS 8310 Data Warehousing
–
IS 8935 Business Intelligence - Traditional and Big Data
Analytics
– Certificate in High Performance Cluster Computing
http://ccse.kennesaw.edu/cs/programs/cert-hpcc.php
•
For more information
– http://zheng.kennesaw.edu/advising
– Lecture notes on BI and Data Visualization
https://www.edocr.com/user/jgzheng
27
Industry Certifications
• Certiport - Marketing Resource Library
(filecamp.com)
28
Core Readings
• DIKW pyramid:
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
• Short video lectures on DIKW
– DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4
– DIKW Pyramid with Sample Data
https://www.youtube.com/watch?v=MFUyQsJyKgg
• Overview of DMBOK V2 – 11 knowledge areas: https://www.dama-
dk.org/onewebmedia/DAMA%20DMBOK2_PDF.pdf
• Data engineering: https://www.oreilly.com/content/data-engineering-a-
quick-and-simple-definition/
• Data Analyst vs Data Engineer vs Data Scientist: Skills,
Responsibilities, Salary https://www.edureka.co/blog/data-analyst-vs-
data-engineer-vs-data-scientist/ - from some job and career
perspectives.
29
Additional Good Resource
•
https://techcrunch.com/2021/05/02/data-was-the-new-oil-until-the-oil-caught-fire/
•
Data governance https://www.youtube.com/watch?v=sHPY8zIhy60&list=RDCMUCrR22MmDd5-
cKP2jTVKpBcQ&index=13
•
Data science
– Why Data Science Matters and How It Powers Business Value https://www.simplilearn.com/why-and-how-data-science-
matters-to-business-article
–
Programming Languages for Data Scientists https://towardsdatascience.com/programming-languages-for-data-scientists-
afde2eaf5cc5
– Data Science Tutorial for Beginners https://www.guru99.com/data-science-tutorial.html
–
https://www.dataversity.net/ten-myths-about-data-science/
–
https://www.dataversity.net/data-science-trends-in-2020/
–
https://www.mastersindatascience.org/careers/data-scientist/
–
https://towardsdatascience.com/learn-the-art-of-data-science-programming-languages-of-the-decade-a2850830ab76
–
https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-science
–
https://www.simplilearn.com/data-analyst-vs-data-scientist-article
•
https://ischoolonline.berkeley.edu/data-science/what-is-data-science/
• More about jobs and careers
–
https://www.discoverdatascience.org/career-information/
– Data engineer: https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data-
warehouse-and-data-engineer-role/
–
https://searchdatamanagement.techtarget.com/feature/Data-management-roles-Data-architect-vs-data-engineer-others
–
https://dzone.com/articles/five-data-tasks-that-keep-data-engineers-awake-at
– Data analyst: https://www.investopedia.com/articles/professionals/121515/data-analyst-career-path-qualifications.asp
–
https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html
30
A general introduction to data and
information processing
Jack G. Zheng
Fall 2022
http://zheng.kennesaw.edu/teaching/it3703
http://zheng.kennesaw.edu/teaching/it7123
IT 3703 Intro to Analytics and Technology
IT 7123 Business Intelligence
Overview
•
Fundamental data related concepts and terms
– What is data, what is the data as we concern in business processing
– Data, information, knowledge
•
The data and data technology in today’s world
– Data sources
– Value and uses
– Data characteristics
– Data challenges
– Data technology and capabilities
• Data knowledge areas
– Data management, data engineering, data analytics, and data science
– DAMA data knowledge areas
• Practical job roles and career paths
– Focus on data management, data engineering, data analytics, business intelligence, and data science,
information/knowledge management
– Data related jobs and careers, corresponding to knowledge areas
2
This lecture focus on fundamental, high-level concepts and philosophical view of
data (instead of technical details), and a general view of the industry and
technology. It sets a context for our main topic on data analytics and technology.
Basic Data Terms and Concepts
• DIKW
• Data type
• Data format
• Data model
• Data structure
• Data processing
3
DIKW
• The DIKW hierarchy depicts relationships between data, information,
knowledge (and wisdom).
– Data: raw value elements or facts
–
Information: the result of collecting and organizing data that provides context and meaning
– Knowledge: the concept of understanding information that provides insight to information,
thus useful and actionable
– Wisdom: the understanding of interactions and an integrated view, and the understanding
of implications and indirect results beyond a target domain.
4
Extended readings on DIKW
•
https://en.wikipedia.org/wiki/DIKW_pyramid
•
https://www.i-scoop.eu/big-data-action-value-
context/dikw-model/
•
https://towardsdatascience.com/rootstrap-
dikw-model-32cef9ae6dfb
•
https://www.youtube.com/watch?v=K4i2FK52
698
•
https://www.youtube.com/watch?v=jSWC23m
HXJM
Image from https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Core readings on DIKW
•
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
•
DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4
•
DIKW Pyramid with Sample Data https://www.youtube.com/watch?v=MFUyQsJyKgg
Data Related Terms
• Raw data
– Directly generated from an event, preserved in its original format
• Processed data
– Transformed or cleaned using any techniques for the purpose of
storage, analysis, or presentation
• Facts
– Data represents reality or things that actually happened
• Simulated/forecast/estimated data
– Data generated based on mathematical or statistical models
• Measures/metrics
– A piece of data that is used for measuring an activity or a
phenomenon
– Calculated results from underlying factors/data; a kind of processed
data
5
Data Type
• Data type is an attribute of data that tells a computer
system how to interpret and process its value, and
how a human can use its value.
• Data types may include
– Simple (primitive) numeric types: int, decimal, etc.
– Qualitative (textual) and composite types: string, text,
etc.
– Extended (app specific) data types: date/time, Boolean,
money, geo, etc.
– Abstract data types: object, array, etc.
– It even can include more digital multimedia forms like
sound, image, and video.
• The exact definition and application of data types
depend on the system.
6
Extended reading: https://en.wikipedia.org/wiki/Data_type
Data Model
• Data model is about representation
of data, with a set of business
concepts and rules.
– A data model conceptualizes data elements and standardizes
how the data elements relate to one another
• Extended reading: https://en.wikipedia.org/wiki/Data_model
– Answers questions like: How is data grouped and associated?
What is this data about? How are they related?
• Data models depict and enable an organization to
understand its data assets through core building blocks
such as entities, relationships, and attributes. These
represent the core concepts of the business such as
customer, product, employee, and more.
• Typical examples: flat model, relational model, network
model, dimensional model
7
Data model will be covered with more details in
IT 3703 module 3 and IT 7123 module 3.
Example:
relational model
Data Format or File Format
• Data format defines the way how data and information is structured and
recorded in a computer file, particularly flat file.
• Flat files are machine readable, meaning data in flat files is formatted in
a way that it can be automatically read and processed by a computer
program. Machine-readable data must have some structures, even if
they are implicit and may be defined outside the file.
– http://opendatahandbook.org/glossary/en/terms/machine-readable/
• Three major types of data formats are used in flat files in today’s
analytics
– Comma Separated Values (CSV) https://en.wikipedia.org/wiki/Comma-
separated_values
– JavaScript Object Notation (JSON)
– eXtensible Markup Language (XML)
• They are constantly used in data download/export, transfer, and
storage.
– Example: https://schoolgrades.georgia.gov/dataset
8
Data format will be covered with more details
in IT 3703 module 5 and IT 7123 module 5.
Data Structure
•
In computer science, a data structure is a data organization,
management, and storage format that enables efficient access
and modification. More precisely, a data structure is a collection
of data values, the relationships among them, and the functions
or operations that can be applied to the data, i.e., it is an
algebraic structure about data.
– https://en.wikipedia.org/wiki/Data_structure
• A data structure is a more technical and lower-level term. It
emphasizes a structure or a model that targets computer system
(vs. human) for optimal processing.
• For example, an array is a data structure, as well as dictionaries.
Also, the classes that form your data model, are data
structures too, any representation of a specific data object has to
be in form of a data structure.
– https://stackoverflow.com/questions/24228038/difference-between-
datastructure-and-datamodel-with-example
9
Structured, Semi-structured, and Unstructured
• Structured data
– Structured data is data whose elements are addressable for effective analysis. It has been organized into
a formatted repository that is typically a database. It concerns all data which can be stored in database
SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre-
designed fields. Today, those data are most processed in the development and simplest way to manage
information. Example: Relational data.
• Semi-Structured data
– Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in the
relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist
to ease space. Example: XML data.
• Unstructured data
– Unstructured data is a data which is not organized in a predefined manner or does not have a predefined
data model, thus it is not a good fit for a mainstream relational database. So, for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used
by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
• Key readings
– https://www.geeksforgeeks.org/difference-between-structured-semi-structured-and-unstructured-data/
– https://www.g2.com/articles/structured-vs-unstructured-data
10
Data Processing
• Data processing is, generally, "the collection
and manipulation of items of data to produce
meaningful information.” (particularly using
the computing technologies and systems).
– https://en.wikipedia.org/wiki/Data_processing
• Data processing can be considered as two
broad categories
– Transactional processing
– Analytical processing
11
Types of Data/Information Processing
Transactional Processing
• Focus on data item processing
(storage, insertion,
modification, deletion),
transmission, and even some
non-analytical query
Analytical Processing
• Focus on queries, calculation,
reporting, analysis, and
decision support
12
For a more detailed comparison of OLTP and OLAP:
https://techdifferences.com/difference-between-oltp-and-olap.html
https://www.ibm.com/cloud/blog/olap-vs-oltp
• Change product price.
•
Increase customer credit limit.
•
Import data from another source
• What are the top 10 most
profitable products?
•
Is there a significant increase of
operational cost?
Notice the difference between these
terms as general concepts vs. as
particular technologies/systems.
DIKW and Data Processing
• The DIKW model can be loosely related to
the levels of transactional processing and
analytical processing
13
Transactional
Processing
Analytical
Processing
For more extensive reading: http://en.wikipedia.org/wiki/DIKW_Pyramid
Different opinion: https://hbr.org/2010/02/data-is-to-info-as-info-is-not
Data in Today’s World
• Data is the new oil (of the digital economy)
– https://towardsdatascience.com/is-data-really-the-new-oil-in-the-
21st-century-17d014811b88
– Considered as one of important resources, just like financial
resource and human resource.
• Big data
– Four V’s: Volume (Scale), Variety, Velocity (Speed), Veracity
(Uncertainty)
• Data technology and capabilities advancement
– Computing power and storage capacity increases
– New processing techniques such as parallel and distributed
processing, AI, etc.
• Evolving user needs
– Analytical needs
– Communication needs
14
Big Data
• Big Data is high volume, high velocity, and/or high variety information assets
that require new forms of processing to enable enhanced decision making,
insight discovery and process optimization.
• Basic 4Vs (Gartner)
•
“Big Data is not a system; it is simply a way to say that you have a lot of data.
– https://www.linkedin.com/pulse/big-data-silver-bullet-tomas-kratky
15
Volume (Scale)
Data volume is increasing exponentially, not linearly Even large amounts of
small data can result into Big Data .
Variety
(Complexity) Various formats, types, and structures. Big data covers non-
structure and various data formats including text, blob, multimedia, etc.
Information and knowledge management is the management of both
structured data (15% of information) and unstructured data (85% of
information), according to the Butler Group.
80 percent of business is conducted on unstructured information (Gartner
Group).
Velocity (Speed)
Data is being generated fast and needs to be processed fast.
Veracity
(Uncertainty)
Uncertainty due to inconsistency, incompleteness, latency, ambiguities, or
approximations.
16
Figures from https://www.ksi.mff.cuni.cz/~svoboda/courses/192-MDK/lectures/MDK-Lecture-01-Introduction.pdf
Developments in Data Technology Capabilities
• Computing power and capacity (hardware)
– Computing power for data processing increased: https://towardsdatascience.com/the-future-of-
computation-for-machine-learning-and-data-science-fad7062bc27d
– Data storage capacity increased: https://www.frontierinternet.com/gateway/data-storage-timeline/
– Data generation method proliferate; automated data collection devices and sensors, such as IoT
devices.
• Algorithm and techniques advancement such as parallel processing, AI, machine learning.
• Self-service tools expand the user base.
• Cloud-based systems lower the cost of ownership.
17
https://cacm.acm.org/magazines/2011/8/114953-an-
overview-of-business-intelligence-technology/fulltext
Evolving User Needs
• Data democratization
– Consumption of data reaches to the general public
– https://www.techtarget.com/whatis/definition/data-
democratization
• Prevalence and wide expectation of data
visualizations
• Evolving analytical needs
– Need real-time and most recent data
– Business user driven, agile, instant
– Exploratory and interactive
18
Common Data Use Challenges
•
Information overloading
–
too much data and information with varied formats and structure
– difficulty of data organization for effective access and retrieval
– difficult to find useful information (knowledge) from them
– multiple copies of data exists sometimes with conflicts
• Data everywhere
– Data in separate systems and different sources; internal and external
– Problem of spreadmart http://en.wikipedia.org/wiki/Spreadmart
– Over 43 percent of organizations have more than six content stores. (Forrester Research).
• Difficulty of access
– We may have that data, but we cannot access it (or difficult to get it), because of technical
issues or administrative issues.
• Don’t have that data
– The data is simply not available.
– The collection of data may need additional process and is costly.
• The organizational data problem: https://www.youtube.com/watch?v=y5-
3Pjbk8Zk
19
Data Knowledge Areas
• The Data Management Association
(DAMA) is a non-profit and vendor-
independent association of business and
technical professionals that is dedicated
to the advancement of data resource
management (DRM) and information
resource management (IRM).
https://www.dama.org
• DAMA publishes a guidebook called
"The DAMA Guide to the Data
Management Body of Knowledge"
(DAMA-DMBOK). It defines 11 data
management knowledge areas.
– OVERVIEW OF DMBOK
https://www.dama-
dk.org/onewebmedia/DAMA%20DMBOK
2_PDF.pdf
20
11 Knowledge Areas by DAMA
1.
Data Governance – planning, oversight, and control over management of data and the use of data and data-related resources. While we
understand that governance covers ‘processes’, not ‘things’, the common term for Data Management Governance is Data Governance, and
so we will use this term.
2.
Data Architecture – the overall structure of data and data-related resources as an integral part of the enterprise architecture
3.
Data Modeling & Design – analysis, design, building, testing, and maintenance (was Data Development in the DAMA-DMBOK 1st edition)
4.
Data Storage & Operations – structured physical data assets storage deployment and management (was Data Operations in the DAMA-
DMBOK 1st edition)
5.
Data Security – ensuring privacy, confidentiality and appropriate access
6.
Data Integration & Interoperability –acquisition, extraction, transformation, movement, delivery, replication, federation, virtualization and
operational support ( a Knowledge Area new in DMBOK2)
7.
Documents & Content – storing, protecting, indexing, and enabling access to data found in unstructured sources (electronic files and physical
records), and making this data available for integration and interoperability with structured (database) data.
8.
Reference & Master Data – Managing shared data to reduce redundancy and ensure better data quality through standardized definition and
use of data values.
9.
Data Warehousing & Business Intelligence – managing analytical data processing and enabling access to decision support data for reporting
and analysis
10.
Metadata – collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata
11.
Data Quality – defining, monitoring, maintaining data integrity, and improving data quality
21
Four Categories of Job Roles
• Management role
– Focusing on the management of data and information assets, setting policies and
processes; usually less technical
• Administration role
– Technical administration of data systems, including data storage (database, data
warehouse, etc.), BI systems, reporting system, and other analytics and application
systems
– Maintain and monitor the security, integrity, reliability, and performance of systems
– Usually are system specific
• Development role
– Data engineering
• Designing data models, systems, architectures
• Build data pipelines and move data
– Application development
• Build reports, dashboards, and other applications that help analysts, managers, and customers.
• Consume data APIs and services
• Analysis role
– Focusing on analysis and reporting; building analytical models and presenting results
–
Involving various degree of math, statistics, and computing algorithm
22
Three Careers of Focus
Data Analyst
Data Engineer/Developer*
Data Scientist
Data Query
Data Warehousing & ETL
Statistical & Analytical skills
Business Domain Knowledge
Business intelligence, reporting
Data Mining
Programming knowledge
Data Analytics
Machine Learning & Deep
learning principles
Scripting & Statistical skills
In-depth knowledge of SQL/
database
In-depth programming
knowledge (SAS/R/ Python
coding)
Reporting & data visualization
Data architecture & pipelining
Hadoop-based analytics
SQL/ database knowledge
Machine learning concept
knowledge
Data optimization
Spread-Sheet knowledge
Scripting, reporting & data
visualization
Decision making and soft skills
23
Table adapted based on
https://www.edureka.co/blog/data-analyst-vs-data-engineer-vs-data-scientist/
BSIT/MSIT focus
Data Engineer
• The role of the data engineer is mostly to ensure the quality and
availability of the data. This include the following most important
tasks
– Build and maintain data pipeline systems
– Clean and wrangle data into a usable state
– Design/build data storage systems, architectures, and
infrastructures
• Data engineer skills
– Programming
– Knowledge of tools and systems
– Data model, structure, format, architecture
– Relational and non-relational database design
– Data storage system design
– Data/information flow
– SQL, query execution and optimization
24
Reference reading: https://www.oreilly.com/content/data-engineering-a-quick-and-simple-definition/
Extended reading:
https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data-
warehouse-and-data-engineer-role/
Data Science
• Data Science is multidisciplinary
– Computer Scientists
– Information Technologist
– Statisticians/Mathematicians
– Domain Experts
• Data in Data Science
– Pretty much similar to “data” in data analytics
• Science in Data Science
– Implying scientific methods
– More exploratory
25
Another view of jobs and careers
• https://search
datamanage
ment.techtar
get.com/feat
ure/Data-
management
-roles-Data-
architect-vs-
data-
engineer-
others
26
Data
developer
Also include BI
analyst, or BI
developer.
Data Education at KSU
• BSIT - the new concentration on “data analytics and
technology”
– https://www.edocr.com/v/0jmn189y/jgzheng/ksu-bsit-data-
concentration-overview
• MSIT/BSIT - Graduate Certificate in Data Analytics and
Intelligent Technology
– https://www.edocr.com/v/z1dwxbpy/jgzheng/ksu-msit-data-
certificate
– https://msit.kennesaw.edu/future-students/program-
requirements.php
• Other departments
– Ph.D. in Analytics and Data Science
https://datascience.kennesaw.edu
– ACS 8310 Data Warehousing
–
IS 8935 Business Intelligence - Traditional and Big Data
Analytics
– Certificate in High Performance Cluster Computing
http://ccse.kennesaw.edu/cs/programs/cert-hpcc.php
•
For more information
– http://zheng.kennesaw.edu/advising
– Lecture notes on BI and Data Visualization
https://www.edocr.com/user/jgzheng
27
Industry Certifications
• Certiport - Marketing Resource Library
(filecamp.com)
28
Core Readings
• DIKW pyramid:
https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
• Short video lectures on DIKW
– DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4
– DIKW Pyramid with Sample Data
https://www.youtube.com/watch?v=MFUyQsJyKgg
• Overview of DMBOK V2 – 11 knowledge areas: https://www.dama-
dk.org/onewebmedia/DAMA%20DMBOK2_PDF.pdf
• Data engineering: https://www.oreilly.com/content/data-engineering-a-
quick-and-simple-definition/
• Data Analyst vs Data Engineer vs Data Scientist: Skills,
Responsibilities, Salary https://www.edureka.co/blog/data-analyst-vs-
data-engineer-vs-data-scientist/ - from some job and career
perspectives.
29
Additional Good Resource
•
https://techcrunch.com/2021/05/02/data-was-the-new-oil-until-the-oil-caught-fire/
•
Data governance https://www.youtube.com/watch?v=sHPY8zIhy60&list=RDCMUCrR22MmDd5-
cKP2jTVKpBcQ&index=13
•
Data science
– Why Data Science Matters and How It Powers Business Value https://www.simplilearn.com/why-and-how-data-science-
matters-to-business-article
–
Programming Languages for Data Scientists https://towardsdatascience.com/programming-languages-for-data-scientists-
afde2eaf5cc5
– Data Science Tutorial for Beginners https://www.guru99.com/data-science-tutorial.html
–
https://www.dataversity.net/ten-myths-about-data-science/
–
https://www.dataversity.net/data-science-trends-in-2020/
–
https://www.mastersindatascience.org/careers/data-scientist/
–
https://towardsdatascience.com/learn-the-art-of-data-science-programming-languages-of-the-decade-a2850830ab76
–
https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-science
–
https://www.simplilearn.com/data-analyst-vs-data-scientist-article
•
https://ischoolonline.berkeley.edu/data-science/what-is-data-science/
• More about jobs and careers
–
https://www.discoverdatascience.org/career-information/
– Data engineer: https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data-
warehouse-and-data-engineer-role/
–
https://searchdatamanagement.techtarget.com/feature/Data-management-roles-Data-architect-vs-data-engineer-others
–
https://dzone.com/articles/five-data-tasks-that-keep-data-engineers-awake-at
– Data analyst: https://www.investopedia.com/articles/professionals/121515/data-analyst-career-path-qualifications.asp
–
https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html
30