Using Decision Trees To Extract Decision Rules essay

And the results indicated that improper overtaking and not using a settable was the most important factors associated with crash severity. CART is particularly appropriate for studying traffic accident because is non- parametric techniques that do not require a priori probabilistic knowledge about the phenomena under studying and consider conditional interactions mongo input data [9]. Moreover, CART method allow certain decision rules of the “if-then” type to be extracted and these rules can be used to discover behaviors that occur within a particular set of data.

So, the aim of this work is to use CART method to identify the main factors that affect of the traffic injury severity and to extract certain decision rules which could be used in future road safety campaigns. The paper is organized in four mayor sections. Section 2 presents an introduction to the main concepts of CART method, Decision Rules and the database used in the analysis. Section 3 presents the results and discussion. And, finally, section 4 presents the main conclusions of the study. 2. Materials and Methods 2. 1.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

CART A decision tree (DOT) could be defined as a predictive model which can be used to represent both classifiers and regression models (depending on the nature of the variable class). When the value of the target variable is discrete, a regression trees is developed, whereas a regression trees is developed for the continuous target variable. CART method is a particularly type of TTS which allow developed either type Of tree. In this wok a classification tree is plopped because target variable (injury severity) is discrete (slight injured SSL; killed or seriously injured -SKI).

A DOT is a simple structure formed by number finite of “nodes” (which represent an attribute variable) con enacted by “branches” (which represents one of the states of the one variable) and finally, “terminal nodes or leafs” which specify the expected value of the variable class or target variable. The principle behind tree growing is to recursively partition the target variable to maximize “purity’ in the child node. TTS are built recursively, following a descending strategy.

The root node (which contained all of the data), is divide by two branches (because the CART model generates binary trees) on the basis of an independent variable (splitter) that creates the best homogeneity. Each branch connected with a child node, the data in each child node are more homogeneous than those in the upper parent node. Then, each child node is split recursively until all of them are pure (when all the cases are of the same class) or their “purity” cannot be increased. That is how the tree’s terminal nodes are formed, which are obtained according to the answer values of the variable class. 08 Lopez Griddled et al. Proceeds – Social and Behavioral Sciences 53 (2012) 106 There are different splitting criteria, however in the CART system the most commonly applied splitting criteria is the Gin index (GIG); it could be defined for node c, as: gin(c) = 1 With: p(j) p(j,c) p(c) p(j, c) and p(c) j – number of target variable or classes; S (j) – prior probability for class j; p(j) – conditional probability of a case being in class j provided that is in node m, Nj (c) number of cases of class j of node m, N j – number of cases of class j in the roof node.

GIG is one measure the degree of purity of the node, so when GIG is equal to ere, the node is pure (all the cases in the node have the same class). When CART is development the aim is to achieve the maximum purity in the nodes, so the best split is the one that minimizes GIG. Following this procedure the maximal tree that oversets the data is created. To decrease its complexity, the tree is pruned using a cost-complexity measure that combines the precision criteria as opposed to complexity in the number of nodes and processing speed, searching for the tree that obtains the lowest value for this parameter.

At great length description of the CART method could be found in Iberian 10]. Following De Ooh et al. , [1 1], the goodness of a classification method is evaluated by accuracy. Accuracy is the percentage of cases correctly classified by the classifier of the method, and it is defined by following equation: accuracy ITS TASK IFS FISH Where, ITS- Number of cases of SSL; TASK- Number of cases of SKI; SSI- Number of false cases of SSL (i. E. Incorrectly classified as SD; FISH- Number of false cases of SKI (i. E. Incorrectly classified as KS’).

On the other hand, one of the most valuable outcome provided by CART analysis is the value of the importance of independent variables that intervene in the model, which shows the impact of such predictor variables on the model. 2. 2. Decision Rules Decision Rules (Dry) could be obtained from the Dot’s structure. DRY are important because could be used to extract the potentially useful information from the data. The rules have the form of logic conditional: if “A” then “B”, where “A” is the antecedent (a state or a set of statuses of one or several variables) and “B” is the consequent (one status of the variable class).

So, the conditioned structure (IF) of DRY, begins in root node. Each variable hat intervenes in tree division makes an IF of the rule, which ends in child nodes with a value of THEN, which is associated with the class resulting (the status of the variable class that shows the highest number of cases in the terminal node) from the child node. A priori, as same number of rules can be identified as the number of terminal nodes on the tree. However, 2 parameters (population ;Pop; class probability -P) were used in order to extract important rules that could provide useful information for the implementation of road safety strategies in the future.

The parameters that have been used old be defined as: population (Pop), is the percentage of cases of a node in relation to the total number of cases analyses; and class probability (P), is the percentage of cases for the resulting class. The minimum values used so the selected rules will be representative are: popl% and p60%. Lopez Griddled et al. / Proceeds – Social and Behavioral Sciences 53 (2012) 1 06 2. 3. Data In this work, traffic accident data for rural highways for the province of Granddad (South of Spain) have been used.

These data have been obtained from Spanish General Traffic Accident Directorate (GET). The period of the duty is 5 years (2004-2008), and only data for 1 vehicle involved were used for this analysis. The total number of accident’s records used is 1 ,801 . Considering that the main objective of this study is to identify the principal factors that affect the severity of traffic accidents, 17 explanatory variables were used based on De Ooh et al. [1 1], and as a class variable, the injury severity level was considered with two classes (SSL or SKI).

The data included variables describing the conditions that contributed to the accident and injury severity (see Table 1 characteristics of the accidents (month, time, day type, umber of injuries, number of occupants, accident type and cause); weather information (atmospherics factors and lighting); driver characteristics (age and gender); and road characteristics (pavement width, lane width, shoulder width, paved shoulder, road markings and sight distance).