|

楼主 |
发表于 2017-4-13 19:07:07
|
显示全部楼层
Exploring and understanding data(考察和理解数据)
Exploring the structure of data(考察数据结构)
If you are fortunate, your source will provide a datadictionary, which is a document that describes the dataset's features. In our case, theused car data does not come with this documentation, so we'll need to create one onour own.(如果运气好的话,数据源会自带一个文档描述数据集的特性,如果没有的话,就需要我们自己去创建)
The str() function provides a method to display the structure of R data structuressuch as data frames, vectors, or lists. It can be used to create the basic outline for ourdata dictionary str函数能帮助我们显示出数据结构,如图1) Exploring numeric variables(考察数字变量)
To investigate the numeric variables in the used car data, we will employ a comm**et of measurements to describe values known as summary statistics. The summary()function displays several common summary statistics. Let's take a look at a singlefeature, year 使用summary函数取得一组统计量,如图2 图3)
Measuring the central tendency – mean and median(测量中心趋势---平均数和中位数)
mean函数和median函数(图4)
Measuring spread – quartiles and the ve-numbersummary(四分位数和五数概括法)
五数:Minimum(最小值) First quartile or Q1(第一四分位数) Median(中位数) Third quartile or Q3(第三四分位数) Maximum(最大值)The span between the minimum and maximum value is known as the range. In R,the range() function returns both the minimum and maximum value. Combiningrange() with the diff() difference function allows you to examine the range ofdata with a single line of code range函数返回最大值和最小值 diff函数返回两个数之间的差值)
Quartiles are a special case of a type of statistics called quantiles, which are numbersthat divide data into equally sized quantities. In addition to quartiles, commonlyused quantiles include tertiles (three parts), quintiles ( ve parts), deciles (10 parts),and percentiles (100 parts).(分位数就是将数据集划分为等大小的 不同部分,如将数据集分为5份、10份、100份)
The middle 50 percent of data between the rst and third quartiles is of particularinterest because it in itself is a simple measure of spread. The difference between Q1and Q3 is known as the Interquartile Range (IQR), and it can be calculated with theIQR() function IQR函数返回Q3和Q1之间的差值)
The quantile() function provides a robust tool to identify quantiles for a set ofvalues. By default, the quantile() function returns the ve-number summary.(quantile函数返回指定位数的分数集,默认是5分位数)
Visualizing numeric variables – boxplots(数字变量的可视化) (图5 6 7)
图中从下到上依次为Q1, Q2 (the median), and Q3
Visualizing numeric variables – histograms [size=21.3333px](数字变量的可视化--柱状图) (图8 9 10)
Measuring spread – variance and standarddeviation(测量分布趋势方差和标准差) (图11)
|
-
图1
-
图2
-
图3
-
图4
-
图5
-
图6
-
图7
-
图10
-
图8
-
图9
-
图11
|