缺失值处理介绍.docx

  1. 1、本文档共26页,可阅读全部内容。
  2. 2、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
  3. 3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载
  4. 4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
查看更多
缺失值1. is.na 确实值位置判断注意: 缺失值被认为是不可比较的,即便是与缺失值自身的比较。这意味着无法使用比较运算符来检测缺失值是否存在。例如,逻辑测试myvar == NA的结果永远不会为TRUE。作为替代,你只能使用处理缺失值的函数(如本节中所述的那些)来识别出R数据对象中的缺失值。2. na.omit() 删除不完整观测manyNAslibrary(DMwR)manyNAs(data, nORp = 0.2) Argumentsdata A data frame with the data set.nORp A number controlling when a row is considered to have too many NA values (defaults to 0.2, i.e. 20% of the columns). If no rows satisfy the constraint indicated by the user, a warning is generated. 按照比例判断缺失.3. knnImputation K近邻填补library(DMwR)knnImputation(data, k = 10, scale = T, meth = weighAvg, distData = NULL)12ArgumentsArgumentsdataA data frame with the data setkThe number of nearest neighbours to use (defaults to 10)scaleBoolean setting if the data should be scale before finding the nearest neighbours (defaults to T)methString indicating the method used to calculate the value to fill in each NA. Available values are ‘median’ or ‘weighAvg’ (the default).distDataOptionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in dataDetailsThis function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.If meth=’median’ the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If meth=’weighAvg’ the function will use a weighted average of the values of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k例子:#首先读入程序包并对数据进行清理 library(DMwR) data(algae) algae - algae[-manyNAs(algae), ] clean.algae - knnImputation(algae[,1:1

文档评论(0)

希望之星 + 关注
实名认证
内容提供者

该用户很懒,什么也没介绍

1亿VIP精品文档

相关文档