Let's continue to create datasets , And further data visualization to gain insight into data relevance

## 1. Create test set

I talked about , How do we use the pure random method to divide our data set into training set and test set . Start by writing a simple split_train_test() Function to , Analyze the existing problems , Finally, improve its problems , So that the training of the model will not be affected by the change of the data set .

If a company needs to pass 1000 Someone to investigate a few questions , Then the company won't look for people in a purely random way , But trying to get through this 1000 People represent the whole . For example, the population group of the United States becomes 51.3% Of women and 48.7% Of men , This is the best 1000 So are people 513 Women and 487 Men . This is ** Stratified sampling **： First divide the population into uniform subsets , Each subset has one layer , Then extract the correct number of strength per layer .

In predicting house prices in California , Median income is a very important attribute , Pictured 1.1 Median income histogram , We can easily see that the median income is roughly distributed in 1.5-6.0 Between ten thousand dollars , Therefore, we can divide the median income into five levels according to this range .

Be careful ： We should have enough instances of each layer , This ensures that the data will not be misestimated . We should divide the layers less , The strength of each layer is big enough . That's why we need to look at the approximate distribution of median income first , Then consider how to layer .

chart 1.1 Median income histogram

```
# Used here pd.cut() Function pair median_income Division
# Simple understanding pd.cut() Several parameters of
# `x`： Must be a one-dimensional array , `bins`： The standard of sub box , That is, how to divide , `lable`： It's the label of the box
# `right`： Set whether to include boundaries , Default True, It means to divide the income into (0.0, 1.5]、(1.5, 3.0]、
# (3.0, 4.5]、(4.5, 6.0]、(6.0, np.inf] Five kinds
housing["income_cat"] = pd.cut(
housing["median_income"],
bins=(0.0, 1.5, 3.0, 4.5, 6, np.inf),
labels=(1, 2, 3, 4, 5)
)
housing["income_cat"].hist()
# It's shown in the picture 1.2
plt.show()
Copy code
```

chart 1.2 Income category histogram

After sorting the categories , We can use it Scikit-Learn Of StratifiedSpilt Class to create a test set

```
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
start_train_set = housing.loc[train_index]
start_test_set = housing.loc[test_index]
# Check it out. 5 Proportion of instances in class
print( start_test_set['income_cat'].value_counts() / len(start_test_set) )
# Output
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: income_cat, dtype: float64
Copy code
```

to glance at StratifiedShuffleSplit Parameters of ：

- n_splits ： It's easy to understand , That is, we need to separate the total data set n For training set and test set .
- test_size ： This is the proportion of the test set , Or it can be set train_size, That is, the proportion of training set
- random_state ： Random number seed , Still familiar with 42

split.split() Parameters of ：

- X ： Data requiring stratified sampling
- y ： Need to follow y Pattern layering , Just like the output label：3, To possess 0.35 The proportion ,label:2. To occupy 0.32 The proportion

You can know StratifiedShuffleSplit The process of , First divide the total data set into n( This example is 1) For training set and test set , Then, the layering shall be carried out according to the set layering requirements , Finally, we output the training set and test set after stratified sampling .

Now we have the data of stratified sampling , Then we can compare the complete data set 、 The relationship between stratified sampling data set and pure random sampling data set , Is it really in line with our expectations .

chart 1.3 Comparison of sampling deviation between stratified sampling and pure random sampling

We can see that the proportion distribution in the test set of stratified sampling is almost the same as that in the complete data set , The pure random sampling still has a certain deviation .

income_cat It helps us understand that stratified sampling can better generate test machines , Finally, don't forget to return the data as it is .

```
for set_ in (start_train_set,start_test_set):
set_.drop("income_cat", axis=1, inplace=True)
Copy code
```

## 2. Data exploration and data visualization

Before, we were just browsing the data quickly , Now we're going to explore data sets more deeply . Now we can only explore the training set , Don't touch the test set yet . In the previous processing, we are directly processing the training set , It's not safe , Easily corrupt data sets . We can make a copy first .

```
# housing For a copy of the training set ,copy() There are also parameters deep, When deep=True It's a deep copy
housing = start_train_set.copy()
Copy code
```

### 2.1 Visualize geography

Include geographic location in the data , Then we can use this as x,y The axis visualizes the data .

```
housing.plot(kind="scatter", x="longitude", y='latitude')
Copy code
```

chart 2.1 Geographic scatter chart

But this can only show a general outline , I can't see any other information . Then we can put alpha Options ( transparency ) Set to 0.1. This can distinguish the location of high-density data points .

```
# alpha The smaller the setting, the lower the transparency
housing.plot(kind="scatter", x="longitude", y='latitude', alpha=0.1)
Copy code
```

chart 2.2 Geographical scatter map highlighting high-density areas

It looks much better now , You can clearly see where is the high-density area . Besides looking at the area density , Now let's look at the distribution of house prices .

```
# `c` Represents a color sequence , The median house price is used here to express , When the house price is higher, the color is redder , The lower the house price, the lower the color
# `s` Indicates the size of scatter points in the graph , Regional population is used here to express , Due to the large number of people , So we need to divide 100
# `cmap` Represents a defined color configuration , Use it directly here jet
scatter = plt.scatter(x=housing['longitude'], y=housing['latitude'], label=" Population ",
c=housing['median_house_value'], s=housing['population']/100,
cmap=plt.get_cmap('jet'))
# Used for display label
plt.legend()
# Show colorbar, And set up barlabel
plt.colorbar(scatter).set_label(" The median house price ")
Copy code
```

chart 2.3 California house prices

If some readers don't know much about the geographical location of California , You can check the map of California by yourself . From the picture 2.3 It is easy to see that house prices along the coast are still very high , House prices and geographical location are closely related to the population . Usually faced with such problems , We may use ** clustering algorithm ** First detect the primary cluster , Then add a new feature to measure the distance for each cluster center . But we can also find that house prices in the coastal areas of Northern California are not very high , Therefore, there are still deficiencies in the current rule .

### 2.2 Looking for relevance

We can calculate the standard correlation between each pair of attribute values （ Also known as Pearson $r$）. Pearson correlation coefficient is used to measure two variables X and Y The correlation between （ Linear correlation ）, Its value is between -1 And 1 Between .** The closer the 1, The stronger the positive correlation **.** When the coefficient is close to -1 yes , Indicates a strong negative correlation **.

chart 2.4 Pearson correlation coefficient, for example

We need to pay attention to , The correlation coefficient only measures the linear correlation .（“ If x rising , be y rising / falling ”）. So it may completely miss the nonlinear correlation （ for example “ If x Close to the 0, be y Will rise ”）. Look at the picture 2.4 The bottom row of images , Their correlation coefficient is 0, But the relationship between the horizontal axis and the vertical axis is not completely independent ： This is an example of a nonlinear relationship . The relevance of the second line is 1/-1, It shows that the correlation has nothing to do with the slope .

We can use corr() Methods it is easy to calculate the correlation between the attributes of each team in the training set .

```
# Calculate the correlation coefficient , And output in reverse order
corr_matrix = housing.corr()
print(corr_matrix['median_house_value'].sort_values(ascending=False))
# Output
median_house_value 1.000000
median_income 0.687151
total_rooms 0.135140
housing_median_age 0.114146
households 0.064590
total_bedrooms 0.047781
population -0.026882
longitude -0.047466
latitude -0.142673
Name: median_house_value, dtype: float64
Copy code
```

According to the output , It can be seen that there is a positive correlation between median income and median house price . And as the latitude goes farther north , There is a negative correlation between the median house price .

pandas There are also scatter_matrix function , He will plot the correlation of each numeric attribute with other attributes . We have 11 Attributes , You can draw $11^{2}=121$ Images . I choose some of the most important exhibitions related to the median house price .

chart 2.5 Correlation histogram between attributes

chart 2.6 Median income and median house price

The median income and the median house price are the same as we talked about, because of the ceiling problem , When the median house price is 50 There is a clear horizontal line in Wanshi . There are a few less obvious ones, such as 45 ten thousand 、38 All around the . If you don't process this data , The algorithm may reproduce these strange data after learning . Therefore, we need to clean up the data later , To delete these corresponding areas .

版权声明

本文为[Don't teach dreams]所创，转载请带上原文链接，感谢

https://cdmana.com/2022/134/202205141312285042.html