Want to make creations as awesome as this one?

Transcript

Businessunderstanding

Determine business objectives

Assesssituation

Determinedata mininggoals

Produceproject plan

Data understanding

Collectinitial data

Describedata

Exploredata

Verifydata quality

Data preparation

Selectdata

Cleandata

Constructdata

Integratedata

Formatdata

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

Businessunderstanding

Determine business objectives

Assesssituation

Determinedata mininggoals

Produceproject plan

Data understanding

Collectinitial data

Describedata

Exploredata

Verifydata quality

Data preparation

Selectdata

Cleandata

Constructdata

Integratedata

Formatdata

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

Businessunderstanding

Data preparation

Modeling

Evaluation

Deployment

Determine business objectives

Produceproject plan

Assessmodel

Assesssituation

Generatetest design

Determinedata mininggoals

Plandeployment

Evaluateresults

Integratedata

Formatdata

Reviewproject

Verifydata quality

Selectdata

Collectinitial data

Reviewprocess

Cleandata

Plan monitoring& maintenance

Describedata

Producefinal report

Exploredata

Constructdata

Determinenext steps

Buildmodel

Selectmodeling technique

Data understanding

In this phase, we need to understand the business problem and objectives. We need to assess the current situation in the business. What data is available? What things don't we know about?We may want to develop a Project Plan for how we are going to address the business problem. This will be especially important if this is a larger-scale project.How will we know whether we have solved the problem? We always need to define the success criteria. At this stage, these would be in business terms: dollars saved, customers retained, and so on.We definitely will want to work out what is the appropriate data mining problem to solve the business problem. Is it a matter of understanding a dataset, suggesting unsupervised approaches such as data visualisation or clustering? Or is it a matter of making predictions, in which case we might want to look at regression or classification?

Return

Return

Determine business objectives

Assesssituation

Determinedata mininggoals

Produceproject plan

Businessunderstanding

Data understanding

Collectinitial data

Describedata

Exploredata

Verifydata quality

Data preparation

Selectdata

Cleandata

Constructdata

Integratedata

Formatdata

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

Now that we know what data mining problem we want to solve, it's time to focus on understanding the data. First, we must collect the initial data, which may be scattered over the organisation in different formats. We need to describe the data, producing a data dictionary so that all stakeholders have a common understanding of the meaning of different data attributes.Next, we need to explore the data. For this, you will use the methods we covered in the Data Analytics Foundations course. In particular, you would use data visualisation methods to explore the data, focusing on attributes one at a time, then in combinations. You might be interested in looking at the statistics describing attributes, looking at correlations between attributes, and using tools such as box plots, scatter plots and other graphs.Finally, we need to verify the data quality by doing a data quality audit. Are there missing values or systemic corruption of attributes?

Return

Data understanding

Collectinitial data

Describedata

Exploredata

Verifydata quality

Data preparation

Selectdata

Cleandata

Constructdata

Integratedata

Formatdata

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

At this stage, we have an understanding of the data and of the potential quality problems with it. In this phase, we can look at fixing them and preparing the data for the types of methods we identified in the Business Understanding phase.First, we look at selecting the data required to solve the business problem. The feature selection methods we looked at earlier in this course are applicable here.Next, we clean the data to address the identified data quality issues. For missing values, we may omit data or attributes, or predict the values using predictive models.We next potentially construct data attributes. This might be as simple applying some kind of function to one or more attributes, for example calculating the GST on a price or the Body Mass Index by combining weight and height attributes. Or normalising of scaling attributes using the methods we discussed in the first course. Alternately, feature construction might be more complex involving some of the feature extraction methods discussed earlier in this course, such as Principal Component Analysis.Next, we may integrate data together to form a combined dataset and we may be to format data, especially attributes such as dates and times.

Return

Data preparation

Selectdata

Cleandata

Constructdata

Integratedata

Formatdata

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

Finally, we are up to the modelling phase. For this, you will use the methods that we have spent the most time on in this course, both unsupervised and supervised as well as the approached we have looked at for evaluating models.Given the business problem you identified in the Business Understanding phase you will probably be focused more on either unsupervised methods or supervised methods, although often you may use a combination of the two. For example, you might use clustering to explore a customer dataset, but then to describe a cluster you may build a decision tree.You will identify a set of modelling techniques that may be applicable to the problem, together with some metric for evaluating their quality. For example, for a classification problem, you might look at decision trees, random forest, SVMs and artificial neural networks. The chosen metric might be the F1 measure or the AUC.Next, we generate a test design. How are we going to test the models and choose the best? One approach that is often used is to first choose a baseline model, one that is simple (e.g., decision trees or logistic regression) and that we expect will not produce the highest quality model. Then we build successive models of increasing complexity or different preprocessing, evaluating them as we go. We stop when we have the best model, however, we define "best". The actual plan for doing this will be in your test design.We also need to construct data sets themselves using the methods we discussed in the second course in this series such as partitioning data into training, test and validation sets, bootstrap sampling or other approaches.Next, we build models and assess them. In essence, follow the plan in the test design. For a small project this may be done by hand, but more often these days it would be automated by coding into a Python pipeline or similar.By this stage, we would have a model of the highest quality that we could build.

Return

Modeling

Selectmodeling technique

Generatetest design

Buildmodel

Assessmodel

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

In the Evaluation phase, we want to understand whether we have answered the business question. We evaluate the results of the modelling and see whether they address the business objectives. We will review the process we have followed looking for opportunities for improvement and weaknesses. Finally, we determine the next steps. Have we answered the business question? If so, we can move to deployment. If not, perhaps we need to change the business question based on the learnings gleaned from our modelling.

Return

Evaluation

Evaluateresults

Reviewprocess

Determinenext steps

Deployment

Plandeployment

Plan monitoring& maintenance

Producefinal report

Reviewproject

The final phase is deployment. Here we look at using our models in the organisation. Models are often called data products, which we will talk about in later modules. So we look at deploying the models. Do they fit into the software within the organisation to support decision making? Are the deployed as dashboards? Are reports generated from the models?Unlike software engineering frameworks like the waterfall model, the knowledge discovery process is an iterative development framework. So the development is never completely final. We can only produce processed datasets supported by models which can always be improved further with the availability of new training data. If the business problem or world changes, our models may become stale. For example, in fraud detection over time fraudsters change their approaches and targets and models must be rebuilt.Thus discovered knowledge patterns are our starting point for further reasoning and investigation. We can combine the data models acting as an abstracted form of training data with the knowledge discovery processes to form a higher-level understanding of the important concepts in the training data that would act as a basis for future action. This idea is called generalization in machine learning. Learning models that minimize generalization errors ought to ideally help us recollect, deduce, induct, and intuit unknown features and unforeseen outcomes in the machine learning. This is the type of knowledge patterns we are normally wanting to automate, evaluate and validate in the learning algorithms.Plan monitoring and maintenance look at all these issues.We may also produce a final report describing the project as well as review the project to be able to improve the process in the future.

Businessunderstanding

Data preparation

Modeling

Evaluation

Deployment

Determine business objectives

Produceproject plan

Assessmodel

Assesssituation

Generatetest design

Determinedata mininggoals

Verifydata quality

Selectdata

Formatdata

Evaluateresults

Reviewproject

Collectinitial data

Plandeployment

Integratedata

Reviewprocess

Cleandata

Plan monitoring& maintenance

Describedata

Constructdata

Producefinal report

Buildmodel

Exploredata

Determinenext steps

Selectmodeling technique

Data understanding