stanford大学-大数据挖掘-introduction1.pptVIP

  • 3
  • 0
  • 约6.04千字
  • 约 27页
  • 2016-02-25 发布于江苏
  • 举报
stanford大学-大数据挖掘-introduction1.ppt

CS345A: Data Mining on the Web Course Introduction Issues in Data Mining Bonferroni’s Principle Course Staff Instructors: Anand Rajaraman Jeff Ullman Reach us as cs345a-win0809-staff @ . More info on /class/cs345a. Requirements Homework (Gradiance and other) 20% Go to /pearson Enter class code 83769DC9. If you took CS145 or CS245 in the past year, you should have free access; otherwise you will have to purchase access from Pearson Ed. Project 40% Final Exam 40% Project Software implementation related to course subject matter. Should involve an original component or experiment. More later about available data and computing resources. Possible Projects Many past projects have dealt with collaborative filtering (advice based on what similar people do). E.g., Netflix Challenge. Others have dealt with engineering solutions to “machine-learning” problems. ML-Replacement Projects ML generally requires a large “training set” of correctly classified data. Example: classifying Web pages by topic. Hard to find well-classified data. Exception: Open Directory works for page topics, because work is collaborative and shared by many. Other good exceptions? ML-Replacement – (2) Many problems require thought rather than ML: Tell important pages from unimportant (PageRank). Tell real news from publicity (how?). Distinguish positive from negative product reviews (how?). Etc., etc. Team Projects Working in pairs OK, but … No more than two per project. We will expect more from a pair than from an individual. The effort should be roughly evenly distributed. What is Data Mining? Discovery of useful, possibly unexpected, patterns in data. Subsidiary issues: Data cleaning: detection of bogus data. E.g., age = 150. Entity resolution. Visualization: something better than megabyte files of output. Cultures Databases: concentrate on large-scale (non-main-memory) data. AI (machine-learning): concentrate on complex methods, small data. Statistics: concentrate on models. Models vs. Analytic Proce

文档评论(0)

1亿VIP精品文档

相关文档