contestada

Use the adult data set from the book series web site for the following exercises. The target
variable is income, and the goal is to classify income based on the other variables.
2. Which variables are categorical, and which are continuous?
3. Using software, construct a table of the first 10 records of the data set, in order to get a feel for the data.
4. Investigate whether we have any correlated variables.
5. For each of the categorical variables, construct a bar chart of the variable, with an overlay of the target variable. Normalize if necessary. a. Discuss the relationship, if any, each of these variables has with the target variables. b. Which variables would you expect to make a significant appearance in any data mining classification model we work with?
6. For each pair of categorical variables, construct a cross tabulation. Discuss your salient results.
7. Report on whether anomalous fields exist in this data set, based on your EDA, which fields these are, and what we should do about it.
8. Report the mean, median, minimum, maximum, and standard deviation for each of the numerical variables.
9. Construct a histogram of each numerical variables, with an overlay of the target variable income. Normalize if necessary. a. Discuss the relationship, if any, each of these variables has with the target variables. b. Which variables would you expect to make a significant appearance in any data mining classification model we work with?
10. For each pair of numerical variables, construct a scatter plot of the variables. Discuss your salient results.
11. Based on your EDA so far, identify interesting sub-groups of records within the data set that would be worth further investigation.
12. Apply binning to one of the numerical variables. Do it in such a way as to maximize the effect of the classes thus created (following the suggestions in the text). Now do it in such a way as to minimize the effect of the classes so that the difference between the classes is diminished. Comment.
13. Refer to the previous exercise. Apply the other two binning methods (equal width, and equal number of records) to this same variable. Compare the results and discuss the differences. Which method do you prefer?
14. Summarize your salient EDA findings from the above exercises, just as if you were writing a report.