Data Science - Journey through Life Cycles. Part 1 (2022)

Introduction

Everyone is talking about Data Science and its different steps and phases. This article explores Data Science lifecycles, different steps (in each lifecycle) and would be a great start for Data Scientist beginners in Data Science journey.

Data Science Lifecycle

By its simple definition, Data Science is a multi-disciplinary field that contains multiple processes to extract knowledge or useful output from Input Data. The output may be Predictive or Descriptive analysis, Report, Business Intelligence, etc. Data Science has well-defined lifecycles similar to any other projects and CRISP-DM and TDSP are some of the proven standards.

CRISPDM: Cross Industry Process for Standard Data Mining

TDSP: Team Data Science Process by Microsoft

Let's check common lifecycles in Data Science:

  1. Business Understanding
  2. Data Understanding
  3. Data Explore and Preparation
  4. Create and Evaluate Model
  5. Deploy Model and turn out effective output

Each lifecycle has different steps and rules to achieve the desired outcome. Multiple iterations of each lifecycle and different component in each cycle make Data Science output as more accurate. In this article, we are going to discuss the first 3 phases in Data Science lifecycles - Business Understanding, Data Understanding and Data Explore and Preparation.

Data Science - Journey through Life Cycles. Part 1 (1)

Business Understanding

Business Understanding is always a key phase in any SDLC but it is more critical in Data Science lifecycle. If we misunderstood business, then we would end up with the wrong outcome or even we predicted good output but not acceptable by the customer. The main steps in this phase are:

Identify Stakeholder(s)

Stakeholders are Business Analyst/Expert people and their responsibility is clear all business query from Data Scientist in any phase of Data Science lifecycles.

Set Objective

Understand the business problem and identify whether the problem is applicable to an analytical solution or in other words, if Data science can target business problems. To achieve this, Data Scientist frames the business objective by asking relevant and sharp question to stakeholders. Please find this blog for some relevant questions to be asked to stakeholders.

After this step, Data Scientist summarizes:

  1. Proper business requirement and define requirement like - How to increase business profit from 50 to 100 / How to prevent Customer Churn rate, etc.

    (Video) IDSC 102 : Data’s Journey - Data Science Life Cycle

  2. Identity Data Science problem type from the Business requirement and find some Data Science Problem Types.

Data Science Problem Type
Predictive AnalysisWhat will happen in the future?
Descriptive AnalysisWhat is happening in the past and now?
Prescriptive AnalysisWhat should be done to enhance or prevent current or future happening?

Identify and Define Target Variable

Each Data science project contains either Supervised Or Unsupervised learning data.

Supervised learning – In Supervised learning, Data Scientist identifies input and output(target) feature. There are 2 types of Supervised learning.

  1. Classification Problem

    1. Binary Classification - Identified target feature value is either 0 or 1. Examples – whether people survived or not, is this email Spam or not, etc.

    2. Category Classification - Identified target feature value contains multiple discrete values. Examples - How Products are tagged by different category.

  2. Regression Problem – Identified target feature values is a continuous type. Examples - Expected mileage for the different type of Cars.

Unsupervised learning – Learn data without being given the correct output feature and mainly focus on the grouping of data or Clustering of Data. Examples - Identify human behavior from video visuals.

Data Science - Journey through Life Cycles. Part 1 (2)

Feature and Observation (Columns and Rows)

Data Science Project Execution Plan

Use Project Management Tool like VSTS, JIRA, etc. and create a project execution plan and track each milestone and deliverable in different stages of Data Science lifecycles.

Data Understanding

The main steps in this phase are:

(Video) What Voyager Just Detected In Space Shocked The Whole Industry!

Collect Proper Data Set

Data Scientist collects proper data set that covers all business objectives and ensures Data Sets have required input features that answer all business questions. Data might be stored in a CSV file, database or in different formats and storage media. We can access either entire data and download from its source or through data streaming using a secured API.

Setup Environment for Data

Set up Data hosting environment after Data Scientist collect Data Set. This environment might be either in Local Computer, Cloud and On-premise, etc. Example of Cloud environment is - Azure Blob Storage, Azure SQL Database.

Also, sometimes Data could be in an unstructured format and it has to be converted as structured format before analyzing the same.

Please find Azure environments.

  • Azure Notebooks - Free subscription and supports multiple target environments and URL is - https://notebooks.azure.com/
  • Data Science Virtual Machine (DSVM) / Deep Learning Virtual Machine (IaaS solution) - Customized virtual machines that contain different tools preconfigures and preinstalled on Azure
  • Cloud-based Notebook VM
  • ML Studio - UI based tool from Microsoft - https://studio.azureml.net/
  • ML Service - PaaS solution

Setup Tools and Package

Setup Tools and Install Packages for data processing. Please find some Tools used in DS - Python, R, Azure Machine Language Learning Environment, SQL and RapidMiner, TensorFlow, etc. There are multiple packages available in each tool to process, manipulate and visualize the data. Panda and Numpy are some of the main packages with Python.

Identity Feature Category

Each feature in Data Set is broadly divided into either Categorical (for string type) or Continuous (for numeric type) Type. Categorical type is further divided into - Ordinal and Binary.

Feature TypeDescription
Categorical (Or Nominal)One or more Category but not quantitative values. Examples - Color with values Red, Green, Yellow, etc. and no numeric significance compare to others.
OrdinalOne or more categories but they are quantifiable compared to others. Examples - Bug Priority with values Low, Medium, High and Critical and these values have numerical significance compared to others.
BinaryValue of this feature should be either 1 or 0. Examples - Male or Female
Continuous

Value between negative infinity to positive infinity. Example: Age

Also, there are some other feature types like - Interval, Image, Text, Audio, and Video.

Data Explore and Preparation

The main steps in this phase are:

Data Explore

In this phase, Data Scientist familiarizes each feature in Data Set that includes: Identifies feature type like Categorical, Continuous, etc. How data spread and their distribution, identify if there is any relationship between two features, etc.

Data Scientist uses a different visualization tool to explore the Data and this step helps to fix - missing value, extreme value (outlier) and noisy value in Data preparation phase and determine proper scaling factor in model creation step as well. Here, we are mainly focused on two feature type - Categorical (assume for all string type) and Continuous (for numeric type). Let's discuss different approaches in Data exploring.

Data Exploring - Continuous Feature Type
  • Check Centrality Measure Data - Best value that summarizes the measurement of the specific feature
  • Check Dispersion Measure - How Data are spreading and its distribution of the specific feature
MeanAverage value but affected by extreme values
MedianCentral value of Sorted List and it is not affected by the extreme values and of course Median is best centrality measure if data having extreme values.
Check Centrality Measure Data
RangeDifference between maximum and minimum values and if the difference is a small number, then Data packed closely otherwise data packed widely.
Box-Whisker Plot

This chart shows how data spread and represented with different Percentiles such as 25%, 50%,75%, minimum, and maximum. Also this chart show any lower and higher outlier values. Examples: For Age feature with 1000 values and if 25% is 32 meant, 25% values are below 32.

(Video) Spotlight on Australia and New Zealand | CLC01

Data Science - Journey through Life Cycles. Part 1 (3)

VarianceRepresents how far each value is from the mean. Small variance meant Data closely packed otherwise widely packed.
Histogram

This Graph represents data spread across different bin and this would help to identify outlier and missing value. From Histogram, we can identify – Data Normal Distribution (skewness =0), Positively skewed distribution and Negatively skewed distribution.

Data Science - Journey through Life Cycles. Part 1 (4)

Dispersion Measure

Data Exploring - Categorical Feature Type

Following options are used are for Categorical feature type data explore.

  • Total count for category
  • Unique count for category
  • BAR chart used to show individual category status

Also, Scatter Plot, Grouping, Cross Tab and Pivot Table and other statistical function can be used to explore both continuous and Categorical Data. There are plenty of frameworks and packages available in each Data Science language to practice any of the above-explained methods. Scatter plot (check the relation between two features) and line chart (for Time series data) are other options for Data explore and mainly used in Advance Data Analysis.

IDEAR and AMAR are custom build utilities for data exploration and these utilities provide clear Data Insight report for each feature based on its type.

Data Preparation

Based on Data Explore, some abnormal behavior can be identified like – Missing Data, Extreme Values (outlier) and Noisy data. These behaviors may impact the accuracy of Data science output and recommended to fix it before creating a model.

  • Remove Unwanted Feature - Remove features which make no impact in Data Analysis like Name, Roll Number, etc.
  • Fix missing values
    • Remove row (observation) which contain the missing feature
    • Imputation
      • Replace the missing value with a possible value using Mean, Median or Range
      • Replace the missing value with Dummy value
      • Replace with most frequent value and this would be more applicable for the category feature type
    • Forward or Backward fill especially for Time series data
  • Fix Outlier (Extreme value) - Outlier value can be identified by Histogram, Box Plot or Scatter plot
    • Remove row (observation) which contain the missing feature
    • Imputation
      • Replace the missing value with a possible value using Mean, Median or Range
      • Replace the missing value with Dummy value
      • Replace with most frequent value and this would be more applicable for the category feature type
    • Binning - Create discrete category from the continuous feature and this would help to place outlier value in any of the bins.
  • Fix Erroneous or noisy Values - Use the same approach of outlier fix

  • Text cleaning - Clean character for Enter, Tab in text data.

In some cases, missing or outlier value cannot be fixed and so Data Scientist creates two models - one with excluding missing data and outlier and other with all data and takes an average of both result as an output.

Feature Engineering

Process of transforming data to another better understanding feature to create better Predictive model. Feature Engineering is a crucial step in any of the Data Science journey and cannot implement without proper Domain knowledge.

(Video) What is the Purpose of Life? - Sadhguru

Examples: Derive new feature called IsAdult from the Age feature, like IsAdult = 1 if Age > 18 else 0.

Categorical Feature Encoding

Most of the Machine learning algorithms work on numeric data and not Categorical Data. So Data Scientist needs to convert Categorical Type feature to Continuous feature.

The following methods can be used to convert continuous feature from a categorical feature based on feature Type.

Binary Encoding - Use if Category feature has only 2 values like - Male/Female, In this case, Data Scientist creates a new feature like Is_Male with value 0 or 1 accordingly.

Label Encoding - Use if the Category value type is Ordinal Type, i.e., feature values have clear intrinsic sort order. Examples: Software Bug severity - Low, Medium, High and Critical. Here, Data Scientist creates 1 to 4 for Low to Critical values.

One Hot Encoding - Used for Nominal Category Type. Examples - Color with values Red, Blue, and Green. And here, we have to create 3 features like - color_red, color_blue and color_green and fill values appropriately.

Data Science - Journey through Life Cycles. Part 1 (5)

So far, we have covered 3 important lifecycles of Data Science - Business Understanding, Data Understanding and Data Preparation and now the Data are ready for model creation. I will discuss the other two lifecycles in the next article and requesting your valuable feedback.

Points of Interest

Familiarizing Data Science lingo like - Supervised learning, Feature Engineering, etc. is one of the most important aspects in Data Science journey.

In this article, I have covered most of the definitions as part of the first 3 lifecycles but recommended to read some other definitions such as Data Correlation, Data Wrangling, Data Factorization and Normalization and advanced statistical functions, etc.

Thank you for reading this article.

History

  • 29th April, 2019: First version
  • 26th November, 2019: Added Azure environments

Architect, Full Stack Developer, and Exploring Cloud and AI.

(Video) Infinite Worlds: A Journey through Parallel Universes

FAQs

What are the 5 steps in data science lifecycle? ›

The data science life cycle
  • Define and understand the problem. A problem cannot be solved if you don't know what the problem is. ...
  • Data collection. ...
  • Data cleaning and preparation. ...
  • Exploratory data analysis. ...
  • Model building and deployment.
3 Oct 2022

What are the steps involved in data science life cycle? ›

It has six steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Validation, and Deployment.

What is data science life cycle of data science? ›

A general data science lifecycle process includes the use of machine learning algorithms and statistical practices that result in better prediction models. Some of the most common data science steps involved in the entire process are data extraction, preparation, cleansing, modelling, and evaluation etc.

How can I start my data science journey? ›

How to launch your data science career
  1. Step 0: Figure out what you need to learn.
  2. Step 1: Get comfortable with Python.
  3. Step 2: Learn data analysis, manipulation, and visualization with pandas.
  4. Step 3: Learn machine learning with scikit-learn.
  5. Step 4: Understand machine learning in more depth.

What is the first step in a data science project? ›

Obtain Data

The very first step of a data science project is straightforward. We obtain the data that we need from available data sources. In this step, you will need to query databases, using technical skills like MySQL to process the data. You may also receive data in file formats like Microsoft Excel.

What is the first activity of a data scientist? ›

Ask Questions to Frame the Business Problem

In the first step, try to get an idea of what are the needs of a company and extract data based on it. You begin the process of data science by asking the right questions to find what the problem is.

What are the 3 main concepts of data science? ›

Statistics, Visualization, Deep Learning, Machine Learning are important Data Science concepts.

What is ML lifecycle? ›

The machine learning life cycle is the cyclical process that data science projects follow. It defines each step that an organization should follow to take advantage of machine learning and artificial intelligence (AI) to derive practical business value.

Is data preprocessing a first step in data science life cycle? ›

Hence, Data Acquisition is the initial step in the Data Science Life Cycle.

What is a data scientist salary? ›

Despite a recent influx of early-career professionals, the median starting salary for a data scientist remains high at $95,000. Mid-level data scientist salary. The median salary for a mid-level data scientist is $130,000. If this data scientist is also in a managerial role, the median salary rises to $195,000.

Can I learn data science on my own? ›

You Don't Need a Degree to Be a Data Scientist

And if you can find a mentor or community who can help guide and support your learning then that's even better! But don't be afraid to start learning on your own either because there's plenty of resources out there.

Can I become a data scientist in 6 months? ›

Becoming a data scientist in six months is possible if you have a strong background in mathematics and coding. If you are one such candidate, follow the steps below: Download simple datasets and perform Exploratory Data Analysis on them.

Can an average student become data scientist? ›

If you have strong knowledge of algorithms, you can easily build data processing models. However, even if you don't have strong coding knowledge and a special degree in data science, you can still become a data scientist. With good learning capability, you can be a data scientist without a degree in it.

How can a beginner learn data science? ›

8 Best Online Data Science Classes to Take in 2022
  1. Introduction to Data Science Using Python, Udemy. ...
  2. Learn SQL, Codecademy. ...
  3. Linear Algebra for Beginners: Open Doors to Great Careers, Skillshare. ...
  4. Introduction to Machine Learning for Data Science, Udemy. ...
  5. Supervised Machine Learning: Regression and Classification, Coursera.
27 Jun 2022

How do you start a data project? ›

Fundamental Steps of a Data Analytics Project Plan
  1. Find an Interesting Topic. ...
  2. Obtain and Understand Data. ...
  3. Data Preparation. ...
  4. Data Modelling. ...
  5. Model Evaluation. ...
  6. Deployment and Visualization.
26 Apr 2022

Which step in data science takes more time? ›

The survey statistics clearly reveal that most of a data scientist's time is spent in data preparation (collecting, cleaning and organizing) before they can begin doing data analysis. There are several valuable data science tasks like data exploration, data visualization, etc.

Is data scientist a stressful job? ›

Several data professionals have defined data analytics as a stressful career. So, if you are someone planning on taking up data analytics and science as a career, it is high time that you rethink and make an informed decision.

Are data scientists happy? ›

A solid majority of data scientists enjoy their work environment, probably contributing to overall higher satisfaction with working as a data scientist.

Is data scientist an IT job? ›

Data Scientist is an IT enabled job

Like most IT jobs focus on helping their organization using a particular technology, Data Scientists focus on helping their organization use Data. They are experts in handling large amounts of data and are responsible for deriving business value.

Who is the father of data science? ›

There is no father of data science! Many have contributed to domain knowledge of Data science but none particular could be termed as his contributions were exceptional, as every contribution was special and ground-breaking in their own perspectives.

What skills are required for data science? ›

One of the most important technical data scientist skills are:
  • Statistical analysis and computing.
  • Machine Learning.
  • Deep Learning.
  • Processing large data sets.
  • Data Visualization.
  • Data Wrangling.
  • Mathematics.
  • Programming.
19 Oct 2022

What are the six steps of machine learning cycle? ›

In this book, we break down how machine learning models are built into six steps: data access and collection, data preparation and exploration, model build and train, model evaluation, model deployment, and model monitoring.

Which is the last step in data science life cycle? ›

5) Model Deployment

This is naturally the last step in the life cycle of data science projects.

What are hot topics in data science? ›

The terms data analytics, big data, artificial intelligence and data science are all hot right now. Businesses desire to use data-driven models to simplify their operations and make better decisions based on data analytics.

What is the main goal of data science? ›

The goal of data science is to construct the means for extracting business-focused insights from data. This requires an understanding of how value and information flows in a business, and the ability to use that understanding to identify business opportunities.

Is data science a good career? ›

Yes, data science is a very good career with tremendous opportunities for advancement in the future. Already, demand is high, salaries are competitive, and the perks are numerous – which is why Data Scientist has been called “the most promising career” by LinkedIn and the “best job in America” by Glassdoor.

Do data scientists code? ›

In a word, yes. Data Scientists code. That is, most Data Scientists have to know how to code, even if it's not a daily task. As the oft-repeated saying goes, “A Data Scientist is someone who's better at statistics than any Software Engineer, and better at software engineering than any Statistician.”

Which job has highest salary in world? ›

Anesthesiologists are trained physicians who have special training in preoperative care. This is one of the highest paying jobs in the world. They play a very important role in a surgical procedure, as they must ensure that the patient receives the correct type and correct dosage of anesthesia while under the scalpel.

Is data science high paying? ›

One of the highest-paying careers in data science. Data Scientists earn an average of Rs. 116,100 a year, according to Glassdoor. As a result, Data Science is a very lucrative career choice.

Can I learn data science in 1 month? ›

To roughly understand Data Science you need at least 6 to 8 months and to become a Data Scientist you need 1 more month to build your resume and hunt for the job.

Can I become Data Analyst in 3 months? ›

Can I become data analyst in 3 months? Ans: Make the most of your three months and learn everything you can. Because time is limited, the emphasis should be on learning Excel, SQL, R/ Python, Tableau/ PowerBI, and ML if time allows. Investing your time in projects will also give you an advantage when applying for jobs.

How many hours does it take to learn data science? ›

On average, you will need to study around 500 hours of lectures to learn data science adequately. For around 100 hours, you can understand the basics of data science. The numbers can vary depending on your knowledge of programming, calculus, and statistics.

Can a non coder learn data science? ›

You don't require programming skills to use Data Science and Machine Learning Tools. This is especially advantageous to Non-It professionals who don't have experience with programming in Python, R, etc. They provide a very interactive GUI which is very easy to use and learn.

How much Python do data scientists need? ›

For data science, the estimate is a range from 3 months to a year while practicing consistently. It also depends on the time you can dedicate to learn Python for data science. But it can be said that most learners take at least 3 months to complete the Python for data science learning path.

Can you be a data scientist without coding? ›

Coding is required. For working professionals who code: Coding is required in Data Science, and you can pick it up. There is a learning curve in Data Science because, along with code, you will also need to unlearn and relearn mathematics and business. The data science bootcamp can help here.

Is there a lot of math in data science? ›

Mathematics is an integral part of data science. Any practicing data scientist or person interested in building a career in data science will need to have a strong background in specific mathematical fields.

Which degree is best for data scientist? ›

B.S. in Computer Science: This degree is a natural fit for a career in data science with its emphasis on programming languages. Earning this degree gives you a strong technical foundation and familiarity with today's industry-standard tools.

Which stream is best for data scientist? ›

Step 1: Earn a Bachelor's Degree

A great way to get started in Data Science is to get a bachelor's degree in a relevant field such as data science, statistics, or computer science.

What should I learn first in data science? ›

  • Learn Python. The First and Foremost Step Towards Data Science should learning be a programming language ( i.e. Python). ...
  • Learn Statistics. ...
  • Data Collection. ...
  • Data Cleaning. ...
  • Acquaintance With EDA( Exploratory Data Analysis) ...
  • Machine Learning & Deep Learning. ...
  • Learn Deploying of ML model. ...
  • Real-World Testing.
12 Apr 2021

How hard is it to learn data science? ›

Data science is a difficult field. There are many reasons for this, but the most important one is that it requires a broad set of skills and knowledge. The core elements of data science are math, statistics, and computer science. The math side includes linear algebra, probability theory, and statistics theory.

Can I learn data science for free? ›

An online learning platform, freeCodeCamp is another best place to learn Data Science for free. They offer free lessons on statistics for Data Science, computer science concepts, Python fundamentals, Pandas, Python Matplotlib, and even a guide to build a good Data Science portfolio.

Which step in the data science process takes the longest time to complete? ›

Data preparation — This can be considered to be the most time-consuming phase of the data mining process as it involves rigorous data cleaning and pre-processing as well as the handling of missing data.

What are the different types of data in data science? ›

4 Types Of Data – Nominal, Ordinal, Discrete and Continuous.

What is Osemn framework? ›

OSEMN stands for Obtain, Scrub, Explore, Model, and iNterpret. Like Knowledge Discovery in Databases and the applied machine learning process, you can use this process to work a machine learning problem.

What is the difference between data life cycle and data analysis process? ›

The data life cycle deals with transforming and verifying data; data analysis is using the insights gained from the data. The data life cycle deals with the stages that data goes through during its useful life; data analysis is the process of analyzing data.

How long does it take to finish a beginner data science project? ›

If you are interested in playing with data and if you are to start from scratch, 3 months should be fine. Professionals gather all the required data analyst skills over a period of time and some come with a lot of learning and experience only.

What is ML lifecycle? ›

The machine learning life cycle is the cyclical process that data science projects follow. It defines each step that an organization should follow to take advantage of machine learning and artificial intelligence (AI) to derive practical business value.

What are the six steps of machine learning cycle? ›

In this book, we break down how machine learning models are built into six steps: data access and collection, data preparation and exploration, model build and train, model evaluation, model deployment, and model monitoring.

What are the 2 main types of data? ›

There are two general types of data – quantitative and qualitative and both are equally important. You use both types to demonstrate effectiveness, importance or value.

How do you set up a data science project? ›

  1. Step 1: Start small, with the basics. ...
  2. Step 2: Take an online certification for a defined approach. ...
  3. Step 3: Work through the Data Science lifecycle. ...
  4. Step 4: Create a diverse portfolio of projects. ...
  5. Step 5: Create visualizations & work on storytelling.
6 Jun 2022

Why is data science such a growing field? ›

And it's growing. This is because of advances in computer technology and processing speed, the relatively low cost to store data, and the massive availability of data from the Internet and other sources such as global financial markets. For data science to happen, of course, you need data scientists.

Which of the following is not a step of the data science process? ›

Communication Building is not a part of data science process.

Videos

1. The Life Cycle of Stars
(Institute of Physics)
2. The Surprising Secret of Synchronization
(Veritasium)
3. What Happens To Your Body After You Die? | Human Biology | The Dr Binocs Show | Peekaboo Kidz
(Peekaboo Kidz)
4. Data Science Full Course - Learn Data Science in 10 Hours | Data Science For Beginners | Edureka
(edureka!)
5. Data Science Project - Covid-19 Data Analysis Project using Python | Python Training | Edureka
(edureka!)
6. Real World: The Carbon Cycle -- Essential for Life on Earth
(NASAeClips)

Top Articles

You might also like

Latest Posts

Article information

Author: Lidia Grady

Last Updated: 12/01/2022

Views: 5569

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.