Today many large companies are just storing their history data and never or do not know how to process the data to be useful. Confused to start where the history data can provide valuable information may be one of the causes.
If previously friends did not know what Data Mining is, you should first read “What is Data Mining?“ before reading further.
Ever heard of the term CRISP-DM?
CRISP-DM stands for Cross-Industry Standard Process Model for Data Mining, first developed in 1996. CRISP-DM describes the data mining process which is divided into six stages.
This process is one of the goals to find interesting and meaningful patterns in the data. As well as involving several disciplines, namely Statistics, Machine Learning, Artificial Intelligence, Pattern Recognition, and Data Mining.
One of the advantages of using this process is to explain the most common steps in the processes.
This process also involves managers and practitioners simultaneously. Where managers in general provide direction for the main project objectives to be carried out, data availability and models to be used. While practitioners will work in accordance with their fields in every process that exists, the practitioner can also consist of various disciplines, from mathematics, statistics, or informatics engineering.
The CRISP-DM stages are:
1. Business Understanding
Broadly speaking to define the project. This is the first stage in CRISP-DM and includes a vital part of the project and during the work in the next process. At this stage requires knowledge of business objects, how to build or obtain data, and how to match modeling goals to business goals so that the best models can be built.
2. Data Understanding
Broadly speaking to examine the data, so that it can identify problems in the data. This stage provides an analytical foundation for a project by making summary and identifying potential problems in the data. This stage must also be done carefully and not in a hurry, such as in data visualization, which sometimes the insight is very difficult to get linked to the data summary. If there is a problem at this stage that has not been answered, it will interfere at the modeling stage. Summaries or summary of data can be useful to confirm whether the data is distributed as expected, or reveal unexpected irregularities that need to be addressed in the next stage, namely Data Preparation Problems in the data usually such as missing values, outliers, spike distributions, bimodal distributions must be identified and measured so that they can be corrected in the Data Preparation.
3. Data Preparation
Broadly speaking to fix problems in data, then create derived variables.
This phase clearly requires a fairly mature mindset and a high enough effort to ensure that the data is right for the algorithm used.
It does not mean that when the Data Preparation is the first time where the data problems have been resolved, the data can be used up to the last stage. This stage is a stage that is often reviewed when finding problems during the construction of the model. So iteration is done to find things that match the data.
The sampling phase can be done here and the data is generally divided into two, training data and testing data.
Broadly speaking to make a predictive or descriptive model.
At this stage, finally we can use Statistics and Machine Learning to get useful insights from the data to achieve project goals.
Some modeling that is usually done is classification, scoring, ranking, clustering, finding relation, characterization.
Broadly speaking to assess the model in order to report the expected effects of the model
After having a model, we must determine whether it suits our purpose. Some of the questions below can help whether our model is suitable for the purpose or not:
– Is it accurate enough for our needs? Does it generalize well?
– Does the model do better than “real guess”? Better than any estimates that you currently use?
– Does the result of the model (coefficient, cluster, rule) make sense in the context of the domain problem?
Broadly speaking for the model usage plan
This stage is the most valued stage of the CRISP-DM process. Planning for Deployment starts during Business Understanding and must combine not only how to generate model values, but also how to convert decision scores, and how to combine decisions in operational systems.
In the end, the Deployment system plan recognizes that there is no static model. The model is constructed from data represented by data at a certain time, so that changes in time can cause changes in data characteristics. The model must be monitored and may be replaced with an improved model.