Data Mining is the process of exploring and analyzing, using automatic and semi-automatic systems, large amounts of data to find models (or patterns) from the data, and use them for a specific purpose.
The purpose of Big Data is to extract information in a reasonable time and with limited resources. This extrapolation is performed with Data Mining techniques, which allow the extraction of patterns and knowledge from the data examined.
The term mining refers to extraction just as it did in real mines, where what was extracted, instead of data, was coal.
A pattern is the result of data extraction, indicates a structure, a model, or, in general, a false representation of the data.
· Understandable, from a semantic and syntactic point of view, so that the user can interpret it:
· Potentially useful, so that the user can understand it;
· Valid on data with a certain degree of confidence;
· Previously unknown.
Data Mining Models
To obtain real final benefits (whether they are of a purely commercial, scientific nature, or belonging to any other field). It is essential to use exact Data Mining techniques useful for the set purposes.
They can be divided into two models:
1) Descriptive: Detect similarities or groupings shared in historical data to determine the reasons for success or failure, such as grouping customers based on product preferences.
2) Predictive: This methodology goes deeper and aims to classify events in the future or estimate unknown results. Predictive modeling helps you find information on how to avoid losing customers, and how to predict customer buying behavior.
The choice of these models is made according to the type of data to be analyzed and the type of pattern to be extracted from the data.
The data you want to analyze can take different forms: texts, numbers, maps, audio, video, e-mail, and so on.
Referring to the type of pattern, the main ones that can be extracted from the data are the following (obviously, this is not an exhaustive list):
· Clusters: Group or the elements of a set, according to their characteristics, into classes not assigned a priori.
· Classification models: They allow deriving a model for the classification of data according to a set of classes assigned a priori. A type of classifier is the decision trees that enable identifying, in order of importance, the causes that lead to the occurrence of an event.
· Association rules: They allow determining the logical implication rules present in the database, therefore, to identify the affinity groups between objects.
· Forecasting Models: Data mining techniques are designed to explain or understand the past (for example, because an aircraft has stopped unexpectedly) or to predict the future (for example, to predict whether an earthquake will take place tomorrow).
The process of finding new knowledge from data
When we talk about Data Mining, it is appropriate to keep in mind that it is only one of the steps of searching for new knowledge from data (defined Knowledge Discovery in Databases, KDD).
In fact, other fundamental phases are used to integrate Data Mining models (according to the representation of Fayyad, Piatetsky-Shapiro, and Smyth - 1996). These phases are:
1) Data selection: the selection of the data set requires knowledge of the domain from which the data are taken. The removal of data not related to each other from the data set allows a reduction of the search space during the data mining phase, which results in a decrease in the analysis time.
2) Data pre-processing (Data preprocessing): this phase consists of cleaning the information, removing the "noise" or other inconsistencies that could cause problems in the data analysis process. It also provides for exploring and preparing data for subsequent steps.
3) Data transformation (Data Transformation): the data is transformed and converted into formats suitable for the analysis of Data Mining techniques. In this phase, the variety of data is reduced while preserving the quality of the same. Information is organized, changed from one type to another, and new "derived" attributes are defined.
4) Data Mining: use of some Data Mining techniques (algorithms) to analyze data and discover exciting models or extract new knowledge from these data.
5) Evaluation: the final step is the documentation and interpretation of the results achieved by the previous phases. What are Data Mining and The Process of data mining earlier steps to refine the acquired knowledge or transform the knowledge according to the needs most requested by the user.
It is important to stress that Data Mining should not be considered as a separate and autonomous entity because the pre-processing and final evaluation are equally essential.
Data Mining Applications
Data Mining is used in the financial sector, in marketing, and manufacturing. Some examples in these fields are:
· Automated learning: through neural networks, they identify a specific pattern within which there are elements with precise relationships between them.
· Disposal of goods: it allows us to identify the products bought together by a sufficiently large number of customers.
· Direct marketing: to reduce, for example, the cost of advertising by mail, defining the set of customers who are most likely to buy a new telephony product.
· Fraud detection: to predict the fraudulent use of certain situations (such as credit cards).
· Identification of customer dissatisfaction: to predict customers likely to switch to a competitor.
· Grouping of documents: to find subgroups of documents that are similar based on the most relevant terms that appear in them.
· Market segmentation: to divide customers into separate subsets to be used as targets for specific marketing activities.
Benefits of Data Mining
Finally, we see the main advantages of using Data Mining methodologies:
· It does not require a priori hypotheses from the researcher who is performing the analysis.
· Possibility of processing a large number of variables and observations.
· Use optimized algorithms to minimize processing time.
· It can guarantee a simple interpretation of the result.
· It allows a clear display of the results.
Dec 02, 2019