Want to create interactive content? It’s easy in Genially!

Over 30 million people create interactive content in Genially.

Check out what others have designed:

Transcript

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Format data

Integrate data

Construct data

Clean data

Select data

Data preparation

Verify data quality

Explore data

Describe data

Collect initial data

Data understanding

Produce project plan

Determine data mining goals

Assess situation

Determine business objectives

Business understanding

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Format data

Integrate data

Construct data

Clean data

Select data

Data preparation

Verify data quality

Explore data

Describe data

Collect initial data

Data understanding

Produce project plan

Determine data mining goals

Assess situation

Determine business objectives

Business understanding

Return

In this phase, we need to understand the business problem and objectives. We need to assess the current situation in the business. What data is available? What things don't we know about? We may want to develop a Project Plan for how we are going to address the business problem. This will be especially important if this is a larger-scale project. How will we know whether we have solved the problem? We always need to define the success criteria. At this stage, these would be in business terms: dollars saved, customers retained, and so on. We definitely will want to work out what is the appropriate data mining problem to solve the business problem. Is it a matter of understanding a dataset, suggesting unsupervised approaches such as data visualisation or clustering? Or is it a matter of making predictions, in which case we might want to look at regression or classification?

Data understanding

Select modeling technique

Build model

Determine next steps

Construct data

Explore data

Produce final report

Describe data

Plan monitoring & maintenance

Clean data

Review process

Collect initial data

Select data

Verify data quality

Review project

Format data

Integrate data

Evaluate results

Plan deployment

Determine data mining goals

Generate test design

Assess situation

Assess model

Produce project plan

Determine business objectives

Deployment

Evaluation

Modeling

Data preparation

Business understanding

Now that we know what data mining problem we want to solve, it's time to focus on understanding the data. First, we must collect the initial data, which may be scattered over the organisation in different formats. We need to describe the data, producing a data dictionary so that all stakeholders have a common understanding of the meaning of different data attributes. Next, we need to explore the data. For this, you will use the methods we covered in the Data Analytics Foundations course. In particular, you would use data visualisation methods to explore the data, focusing on attributes one at a time, then in combinations. You might be interested in looking at the statistics describing attributes, looking at correlations between attributes, and using tools such as box plots, scatter plots and other graphs. Finally, we need to verify the data quality by doing a data quality audit. Are there missing values or systemic corruption of attributes?

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Format data

Integrate data

Construct data

Clean data

Select data

Data preparation

Verify data quality

Explore data

Describe data

Collect initial data

Data understanding

Business understanding

Produce project plan

Determine data mining goals

Assess situation

Determine business objectives

Return

At this stage, we have an understanding of the data and of the potential quality problems with it. In this phase, we can look at fixing them and preparing the data for the types of methods we identified in the Business Understanding phase. First, we look at selecting the data required to solve the business problem. The feature selection methods we looked at earlier in this course are applicable here. Next, we clean the data to address the identified data quality issues. For missing values, we may omit data or attributes, or predict the values using predictive models. We next potentially construct data attributes. This might be as simple applying some kind of function to one or more attributes, for example calculating the GST on a price or the Body Mass Index by combining weight and height attributes. Or normalising of scaling attributes using the methods we discussed in the first course. Alternately, feature construction might be more complex involving some of the feature extraction methods discussed earlier in this course, such as Principal Component Analysis. Next, we may integrate data together to form a combined dataset and we may be to format data, especially attributes such as dates and times.

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Format data

Integrate data

Construct data

Clean data

Select data

Data preparation

Verify data quality

Explore data

Describe data

Collect initial data

Data understanding

Return

Finally, we are up to the modelling phase. For this, you will use the methods that we have spent the most time on in this course, both unsupervised and supervised as well as the approached we have looked at for evaluating models. Given the business problem you identified in the Business Understanding phase you will probably be focused more on either unsupervised methods or supervised methods, although often you may use a combination of the two. For example, you might use clustering to explore a customer dataset, but then to describe a cluster you may build a decision tree. You will identify a set of modelling techniques that may be applicable to the problem, together with some metric for evaluating their quality. For example, for a classification problem, you might look at decision trees, random forest, SVMs and artificial neural networks. The chosen metric might be the F1 measure or the AUC. Next, we generate a test design. How are we going to test the models and choose the best? One approach that is often used is to first choose a baseline model, one that is simple (e.g., decision trees or logistic regression) and that we expect will not produce the highest quality model. Then we build successive models of increasing complexity or different preprocessing, evaluating them as we go. We stop when we have the best model, however, we define "best". The actual plan for doing this will be in your test design. We also need to construct data sets themselves using the methods we discussed in the second course in this series such as partitioning data into training, test and validation sets, bootstrap sampling or other approaches. Next, we build models and assess them. In essence, follow the plan in the test design. For a small project this may be done by hand, but more often these days it would be automated by coding into a Python pipeline or similar. By this stage, we would have a model of the highest quality that we could build.

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Format data

Integrate data

Construct data

Clean data

Select data

Data preparation

Return

In the Evaluation phase, we want to understand whether we have answered the business question. We evaluate the results of the modelling and see whether they address the business objectives. We will review the process we have followed looking for opportunities for improvement and weaknesses. Finally, we determine the next steps. Have we answered the business question? If so, we can move to deployment. If not, perhaps we need to change the business question based on the learnings gleaned from our modelling.

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Assess model

Build model

Generate test design

Select modeling technique

Modeling

Return

The final phase is deployment. Here we look at using our models in the organisation. Models are often called data products, which we will talk about in later modules. So we look at deploying the models. Do they fit into the software within the organisation to support decision making? Are the deployed as dashboards? Are reports generated from the models? Unlike software engineering frameworks like the waterfall model, the knowledge discovery process is an iterative development framework. So the development is never completely final. We can only produce processed datasets supported by models which can always be improved further with the availability of new training data. If the business problem or world changes, our models may become stale. For example, in fraud detection over time fraudsters change their approaches and targets and models must be rebuilt. Thus discovered knowledge patterns are our starting point for further reasoning and investigation. We can combine the data models acting as an abstracted form of training data with the knowledge discovery processes to form a higher-level understanding of the important concepts in the training data that would act as a basis for future action. This idea is called generalization in machine learning. Learning models that minimize generalization errors ought to ideally help us recollect, deduce, induct, and intuit unknown features and unforeseen outcomes in the machine learning. This is the type of knowledge patterns we are normally wanting to automate, evaluate and validate in the learning algorithms. Plan monitoring and maintenance look at all these issues. We may also produce a final report describing the project as well as review the project to be able to improve the process in the future.

Review project

Produce final report

Plan monitoring & maintenance

Plan deployment

Deployment

Determine next steps

Review process

Evaluate results

Evaluation

Return