Let's continue to create datasets , And further data visualization to gain insight into data relevance
1. Create test set
I talked about , How do we use the pure random method to divide our data set into training set and test set . Start by writing a simple split_train_test() Function to , Analyze the existing problems , Finally, improve its problems , So that the training of the model will not be affected by the change of the data set .
If a company needs to pass 1000 Someone to investigate a few questions , Then the company won't look for people in a purely random way , But trying to get through this 1000 People represent the whole . For example, the population group of the United States becomes 51.3% Of women and 48.7% Of men , This is the best 1000 So are people 513 Women and 487 Men . This is Stratified sampling ： First divide the population into uniform subsets , Each subset has one layer , Then extract the correct number of strength per layer .
In predicting house prices in California , Median income is a very important attribute , Pictured 1.1 Median income histogram , We can easily see that the median income is roughly distributed in 1.5-6.0 Between ten thousand dollars , Therefore, we can divide the median income into five levels according to this range .
Be careful ： We should have enough instances of each layer , This ensures that the data will not be misestimated . We should divide the layers less , The strength of each layer is big enough . That's why we need to look at the approximate distribution of median income first , Then consider how to layer .
chart 1.1 Median income histogram
# Used here pd.cut() Function pair median_income Division # Simple understanding pd.cut() Several parameters of # `x`： Must be a one-dimensional array , `bins`： The standard of sub box , That is, how to divide , `lable`： It's the label of the box # `right`： Set whether to include boundaries , Default True, It means to divide the income into (0.0, 1.5]、(1.5, 3.0]、 # (3.0, 4.5]、(4.5, 6.0]、(6.0, np.inf] Five kinds housing["income_cat"] = pd.cut( housing["median_income"], bins=(0.0, 1.5, 3.0, 4.5, 6, np.inf), labels=(1, 2, 3, 4, 5) ) housing["income_cat"].hist() # It's shown in the picture 1.2 plt.show() Copy code
chart 1.2 Income category histogram
After sorting the categories , We can use it Scikit-Learn Of StratifiedSpilt Class to create a test set
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(housing, housing['income_cat']): start_train_set = housing.loc[train_index] start_test_set = housing.loc[test_index] # Check it out. 5 Proportion of instances in class print( start_test_set['income_cat'].value_counts() / len(start_test_set) ) # Output 3 0.350533 2 0.318798 4 0.176357 5 0.114341 1 0.039971 Name: income_cat, dtype: float64 Copy code
to glance at StratifiedShuffleSplit Parameters of ：
- n_splits ： It's easy to understand , That is, we need to separate the total data set n For training set and test set .
- test_size ： This is the proportion of the test set , Or it can be set train_size, That is, the proportion of training set
- random_state ： Random number seed , Still familiar with 42
split.split() Parameters of ：
- X ： Data requiring stratified sampling
- y ： Need to follow y Pattern layering , Just like the output label：3, To possess 0.35 The proportion ,label:2. To occupy 0.32 The proportion
You can know StratifiedShuffleSplit The process of , First divide the total data set into n( This example is 1) For training set and test set , Then, the layering shall be carried out according to the set layering requirements , Finally, we output the training set and test set after stratified sampling .
Now we have the data of stratified sampling , Then we can compare the complete data set 、 The relationship between stratified sampling data set and pure random sampling data set , Is it really in line with our expectations .
chart 1.3 Comparison of sampling deviation between stratified sampling and pure random sampling
We can see that the proportion distribution in the test set of stratified sampling is almost the same as that in the complete data set , The pure random sampling still has a certain deviation .
income_cat It helps us understand that stratified sampling can better generate test machines , Finally, don't forget to return the data as it is .
for set_ in (start_train_set,start_test_set): set_.drop("income_cat", axis=1, inplace=True) Copy code
2. Data exploration and data visualization
Before, we were just browsing the data quickly , Now we're going to explore data sets more deeply . Now we can only explore the training set , Don't touch the test set yet . In the previous processing, we are directly processing the training set , It's not safe , Easily corrupt data sets . We can make a copy first .
# housing For a copy of the training set ,copy() There are also parameters deep, When deep=True It's a deep copy housing = start_train_set.copy() Copy code
2.1 Visualize geography
Include geographic location in the data , Then we can use this as x,y The axis visualizes the data .
housing.plot(kind="scatter", x="longitude", y='latitude') Copy code
chart 2.1 Geographic scatter chart
But this can only show a general outline , I can't see any other information . Then we can put alpha Options ( transparency ) Set to 0.1. This can distinguish the location of high-density data points .
# alpha The smaller the setting, the lower the transparency housing.plot(kind="scatter", x="longitude", y='latitude', alpha=0.1) Copy code
chart 2.2 Geographical scatter map highlighting high-density areas
It looks much better now , You can clearly see where is the high-density area . Besides looking at the area density , Now let's look at the distribution of house prices .
# `c` Represents a color sequence , The median house price is used here to express , When the house price is higher, the color is redder , The lower the house price, the lower the color # `s` Indicates the size of scatter points in the graph , Regional population is used here to express , Due to the large number of people , So we need to divide 100 # `cmap` Represents a defined color configuration , Use it directly here jet scatter = plt.scatter(x=housing['longitude'], y=housing['latitude'], label=" Population ", c=housing['median_house_value'], s=housing['population']/100, cmap=plt.get_cmap('jet')) # Used for display label plt.legend() # Show colorbar, And set up barlabel plt.colorbar(scatter).set_label(" The median house price ") Copy code
chart 2.3 California house prices
If some readers don't know much about the geographical location of California , You can check the map of California by yourself . From the picture 2.3 It is easy to see that house prices along the coast are still very high , House prices and geographical location are closely related to the population . Usually faced with such problems , We may use clustering algorithm First detect the primary cluster , Then add a new feature to measure the distance for each cluster center . But we can also find that house prices in the coastal areas of Northern California are not very high , Therefore, there are still deficiencies in the current rule .
2.2 Looking for relevance
We can calculate the standard correlation between each pair of attribute values （ Also known as Pearson ）. Pearson correlation coefficient is used to measure two variables X and Y The correlation between （ Linear correlation ）, Its value is between -1 And 1 Between . The closer the 1, The stronger the positive correlation . When the coefficient is close to -1 yes , Indicates a strong negative correlation .
chart 2.4 Pearson correlation coefficient, for example
We need to pay attention to , The correlation coefficient only measures the linear correlation .（“ If x rising , be y rising / falling ”）. So it may completely miss the nonlinear correlation （ for example “ If x Close to the 0, be y Will rise ”）. Look at the picture 2.4 The bottom row of images , Their correlation coefficient is 0, But the relationship between the horizontal axis and the vertical axis is not completely independent ： This is an example of a nonlinear relationship . The relevance of the second line is 1/-1, It shows that the correlation has nothing to do with the slope .
We can use corr() Methods it is easy to calculate the correlation between the attributes of each team in the training set .
# Calculate the correlation coefficient , And output in reverse order corr_matrix = housing.corr() print(corr_matrix['median_house_value'].sort_values(ascending=False)) # Output median_house_value 1.000000 median_income 0.687151 total_rooms 0.135140 housing_median_age 0.114146 households 0.064590 total_bedrooms 0.047781 population -0.026882 longitude -0.047466 latitude -0.142673 Name: median_house_value, dtype: float64 Copy code
According to the output , It can be seen that there is a positive correlation between median income and median house price . And as the latitude goes farther north , There is a negative correlation between the median house price .
pandas There are also scatter_matrix function , He will plot the correlation of each numeric attribute with other attributes . We have 11 Attributes , You can draw Images . I choose some of the most important exhibitions related to the median house price .
chart 2.5 Correlation histogram between attributes
chart 2.6 Median income and median house price
The median income and the median house price are the same as we talked about, because of the ceiling problem , When the median house price is 50 There is a clear horizontal line in Wanshi . There are a few less obvious ones, such as 45 ten thousand 、38 All around the . If you don't process this data , The algorithm may reproduce these strange data after learning . Therefore, we need to clean up the data later , To delete these corresponding areas .
本文为[Don't teach dreams]所创，转载请带上原文链接，感谢