Machine Learning Algorithm : C4.5 Algorithm

madamg6a
Feb 6, 2018
5 min read

Algorithms can be applied to Android applications. Machine learning has various algorithms for various learning tasks (or problems from other references). Regression, classification, and clustering are the most common among them. Regression is the supervised learning task for modeling and predicting continuous, numeric variables (examples include predicting real-estate prices, stock price movements, or student test scores); whereas, classification is the supervised learning task for modeling and predicting categorical variables (examples include predicting employee churn, email spam, financial fraud, or student letter grades) while clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within the dataset (examples include customer segmentation, grouping similar items in e-commerce, and social network analysis . These learning tasks mentioned above are classified based on learning styles — supervised, unsupervised, semi-supervised, and reinforcement learning. In supervised learning, the input data is called training data and has a known label or result. Its opposite is unsupervised learning where the input data is not labeled and does not have a known result. On the other hand, semi-supervised learning is a mixture of labeled and unlabeled as an input data. There is a desired prediction problem in semi-supervised learning but the model must learn the structures to organize the data as well as make predictions . The method of reinforcement learning aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk . Some classification algorithms include logistic regression (regression by name but used for classification), decision trees, random forest, and naïve bayes. Logistic regression takes some inputs and calculates the probability of some outcome. If the probability is greater than fifty percent (50%) then the decision is true. Decision tree is a mechanical way to decide by dividing the inputs into smaller decisions. The tree is divided into decision nodes and leaves. The leaves are the decisions: yes or no. The nodes are the factors (example: windy, sunny, and others). In random forest, the approach of classification is like the decision tree, except the questions that are posed include some randomness. The goal is to push out bias and group outcomes based upon the most likely positive responses. These collections of positive responses are called bags. Bayes (Naive Bayes) is used, among other cases, to classify email as spam or not. It is based on the concept of dependent probability. Dependent probability is based on what is the chance of some outcome given some other outcome . The goal of a clustering algorithm is to group data in a relevant way. It can be done by K-means and hierarchical clustering. The first algorithm, K-means, partitions the data set into unique homogeneous clusters whose observations are like each other but different than other clusters. The resultant clusters remain mutually exclusive, i.e., non-overlapping clusters. In this technique, "K" refers to the number of cluster which partitions the data. Every cluster has a centroid. The name "k means" is derived from the fact that cluster centroids are computed as the mean distance of observations assigned to each cluster. In simple words, hierarchical clustering tries to create a sequence of nested clusters to explore deeper insights from the data. For example, this technique is being popularly used to explore the standard plant taxonomy which would classify plants by family, genus, species, and so on . For regression, least squares are a method for performing linear regression. Linear regression is the task of fitting a straight line through a set of points. There are multiple possible strategies to do this, and “ordinary least squares” strategy go by drawing a line, and then for each of the data points, measuring the vertical distance between the point and the line, and adding these up; the fitted line would be the one where this sum of distances is as small as possible . In machine learning, “No Free Lunch” theorem exists. It states that no algorithm works best for every problem, especially in the case of supervised learning due to factors like the size and structure of dataset. Knowing this means that the algorithms to use must be appropriate for the problem where picking the right machine learning task comes in . Classifying the contacts as a priority contact falls in classification task. Since the researchers are using data from phone contacts with call and SMS exchange with the user, it can be used as training data which is an application of supervised learning . Among machine learning algorithms that are used to solve classification task and are under supervised learning like k-nearest neighbor (used in classification study when there is little or no prior knowledge about the distribution data), logistic regression (used in logistic regression algorithms when there is a requirement to model the probabilities of the response variable as a function of some other explanatory variable and when there is a need to predict probabilities that categorical dependent variable will fall into two categories of the binary response as a function of some explanatory variables.), naive bayes (used in sentiment analysis, document categorization, and email spam filtering), random forests (collection of decision trees used in classification and regression), AdaBoost (used in cases when independently trained and whose predictions are combined in some way to make the overall prediction), neural networks (mimic the structure of the biological neural networks which comprise hundreds of algorithms and variations for all manner of problem types), and decision trees to name some . There are many decision tree based algorithms. They are Classification and Regression Tree (CART), Iterative Dichotomiser 3 (ID3), C4.5 and C5.0 (different versions of a powerful approach), Chi-squared Automatic Interaction Detection (CHAID), Decision Stump, M5, and Conditional Decision Trees to name some . The researchers will be comparing the splitting and pruning techniques even the strengths and weaknesses of decision tree algorithms created by J. Ross Quinlan: ID3 and C4.5 algorithm. J. Ross Quinlan originally developed ID3 at the University of Sydney. He first presented ID3 in 1975 in a book, Machine Learning, Volume 1, Number 1. ID3 is based on the Concept Learning System (CLS) algorithm [65]. The next algorithm, C4.5, is an evolution of ID3 which was presented by the same author. Its decision tree grows using depth-first strategy . The basic characteristics of these decision trees are summarized below: The advantage of J. Ross Quinlan’s ID3 is that it creates tree by searching the whole tree. It only needs to test enough attributes until all data are classified. Finding leaf nodes enable test data to be pruned; thus, reducing number of tests. It builds the fastest but short tree. ID3 algorithm makes understandable prediction rules based from the training data. For the ID3’s extension, C4.5 algorithm can handle both continuous and discrete attributes. It also allows missing values under attributes. It goes back through the tree once it is been created and attempts to remove branches that do not help by replacing them with leaf nodes . However, if a small sample is tested in ID3, data may be over-fitted or over-classified. Only one attribute at a time is tested for making a decision and it does not handle numeric attributes and missing values. While for C4.5, it constructs empty branches; it is the most crucial step for rule generation in C4.5 . Information gain, ID3’s splitting criterion, is biased towards choosing attributes with a large number of values that may lead to selection of an attribute that is non-optimal for prediction (over fitting) and data can be fragmented into many small sets (fragmentation) making C4.5's gain ratio more reliable than information gain . Real-world applications of algorithm should permit numeric attributes; allow missing values, robust in the presence of noise and, able to approximate arbitrary concept descriptions (at least in principle). To deal with these conditions, C4.5 is the result of the extension of ID3 because the conditions cited above are the limitations of C4.5's predecessor algorithm . The training dataset that will be formed from the application contains numerical attributes; therefore, the handling of numerical attributes of C4.5 algorithm is suitable in generation of suggested priority contacts.