|Data Science Life Cycle|
This is the very crucial and initial stage of the Data Science Life Cycle model. The team has to go through a detailed understanding of the specific business domain. By noting several questions like How the business process? How many partner organizations or sister organizations are connected to the business? What are the internal and external business problems? How is the customer relationship? Are customers satisfied with the business process? What are the business risks? and so on. After understanding the overall business domain knowledge, now the team have to figure out the business goals and objectives. The goal is to find the key variables and the key metrics which can be used as target variables of the model. So as to predict the target variable’s values which leads to determining the success of the project.
Also, another goal is to identify or to generate relevant data sources from the domain. Prepare a data science problem statement by asking domain related questions which may associate with the analysis technique.
For eg.How much or how many? (regression)
Which category? (classification)
Which group? (clustering)
Is this weird? (anomaly detection)
Which option should be taken? (recommendation)
Determine which statement depends on your data science project study and also determines how you can achieve success on it. Make a team and assign specific roles to the members. Define the success metrics.
After determining and defining problem statement, and also sketching the process outline, its time to put the knowledge material into the deliverables. Deliverables are Charter Document, Data Sources, Data dictionary.
- Charter Document: It is the standard template to track the overall data science project within the team members. Team members can update the new discoveries, researches, and add or remove the requirements as per the project specifications. Members have to iteratively fill the details in the document by engaging with the stockholders and customers too.
- Data Sources: Data Sources is defined as a part of Data report. It tells us about the generation details of the raw data source, about its origin and destination locations.
- Data dictionary: Data dictionary is kind of document which contains data description provided by the client. (includes entity-relationship data diagrams, data-types.)
Identifying and understanding Data Source:
Data from sources can be in text, documents, emails, pdf, logs, reports, and so on. Here, the team needs to adapt to the ETL process(extraction, Transform, and Load).
1. Extraction of data:
2. Transformation of data:
- business rules, applying constraints to data as per business analytical requirements.
- Cleaning (e.g., mapping NULL to 0 or “Male” to “M” and “Female” to “F, etc.)
- filtering (e.g., selecting only certain columns to load)
- splitting a column into multiple columns and v.v.
- joining together data from multiple sources (e.g., lookup, merge)
- transposing rows and columns
- applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)
For eg, For a supply chain business, the data sources are OLTP system for various supply chain transactions, OLAP systems for geospatial tracking of the image during the transportation and customer details along with their geographical-area information. By ETL process the data from various data sources is pulled out in the organization’s data warehouse for applying business analytics. Transformed as per business needs, suppose they want to forecast the sales for some customers of any geographical area. Then the acquired data is needed to transform by combining the key features (sales data, geographical area, customer info as key variables) to form a business targeted the large dataset.
3. Load the data:
The goal here is to Extract the data from the organization’s data warehouse (database) or from cloud storage into your analytical environment. The extracted data format can be of .xls, .csv, and so on and in order to import it on the analytic tools the dataset should be converted into tool specified data frames.
The goal of data exploration is to perform descriptive statistics on qualitative and quantitative data. Understand the datatypes and their relations by exploring all the features of the particular dataset. Understand the categorical and numerical features and their values.
Data Preparation/Data Wrangling:
Here, the data frame needs to go through several preparations using the data cleaning process before moving to model phase .ie. Training the model. The goal here is to remove outlier data values and to identify null values in categorical and numerical features and impute the null values by applying mean (for numerical), mode(for categorical), median(for numerical) on that feature or data frame column. If the feature is not of importance, we can drop null values or the whole column.
After imputing null values and making the data frame consistent, now its time to divide the data frame into two parts .ie. Train and test data frames(by 80-20 rule). Before splitting the dataset into the train-test dataset, team members must select the features for the target variable for prediction and also select the independent features which can be of importance else can drop that feature column. This is known as feature engineering. Feature engineering involves the inclusion, aggregation, and transformation of raw variables to create the features to use in the analysis.
Here, the team needs to build a model, based on certain data science algorithm which perfectly defines the solution in the form of a statistical model for the specifically defined business problem statement. Machine learning model includes three main learning categories .ie. Supervised learning, unsupervised learning and reinforcement learning algorithms. Define target variables(dependent variable) and independent variable. For predicting target variables by inputting continuous updated independent variables.
The model is built when trained by inputting train dataset’s observations(independent variables, dependent variable). Thus the model learns all features of input observations and thus predicts the target variable value for any different dataset having the same features. Check the accuracy of the model when predicted with test data, with the accuracy of the trained model. More the accuracy percentage the better is the performance, and the model makes an effective prediction. Select the model after data evaluation stage which can give better accuracy with test/ validation/ new datasets.
Data Visualisation is the main part in data science project which involves graphical representations and insightful visual to get the quick glance about what the data wants to illustrate about the analysis. As for data storytelling, the team members require better communication skills to talk about the visualization dashboards. The better interaction and communication about data visualizations can receive a quick acknowledgment from the customers and stakeholders. For the managers or CEO, this interaction results in taking quick business decisions.
This stage evaluates the model by the offline and online mechanism. Trained Model needs to be evaluated on the separate new dataset is known as data validation.
In the offline mechanism, it checks whether the resulted output is as per the described objectives. Offline mechanism evaluation done through several iterative processes over each stage. Types of data validations include Hold out Validation, Cross-validation, and Bootstrapping.
In the online mechanism, the evaluation takes on the live data which is continuously generating.A/B Testing is the online data evaluation mechanism accepted by many big organizations. A/B testing helps to answer questions like which model performs better? The live data traffic gets split into A and B .ie.Control and Experiment, where A traffic is routed to old model and B is routed to the new model.
Check the key differences between the model performances. Use of statistical hypothesis testing(null or alternative hypothesis test) the better model is selected as an effective business model.
Deliverables in the form of Model reporting: The team must do report writing by mentioning points regarding overall data model evaluation process, it’s requirements, it’s output in detail.
Model deployment is the much-awaited stage for the data science project members. The goal here is to deploy models with a data pipeline to production or production-like environment for final user acceptance. Just building a model does not work, the team needs to deploy the model in a user-friendly environment for doing testing on the new data which is inputted by users. Model deployment in the form of Intelligent Applications, web services, model store and make it available to all customers, so that they can use and experience the model performance.
Deployed model having User interfaces or API like spreadsheets, online websites, dashboards, back-end applications and so on.
Thus track of Score, performance, and monitoring of the model can be done continuously to make the model learning process better thus can give accurate solutions.
Here the deliverables are the status dashboards that displays the system health and key metrics, final modeling report with deployment details, final solution architecture document.
If the model performance in the production environment is reviewed as best by the customer or the stakeholders and accepts the data science project. Thus, the model application is now ready to have future market opportunities and can gain recognition to stand with competitors.