- 6
- 0
- 约1.5万字
- 约 12页
- 2017-10-07 发布于河南
- 举报
海量数据笔试问题(Massive data written test questions)
海量数据笔试问题(Massive data written test questions)
Massive data written questions.Txt
1., given a, B two files, each store 5 billion URL, each URL accounted for 64 bytes each, memory limit is 4G, let you find the A, B file common URL?
Scenario 1: you can estimate that the size of each file is 50G * 64=320G, much larger than the memory limit of 4G. So its impossible to load it entirely into memory. Consider a divide and conquer approach.
S traverses the file a, extracts each URL, and stores the URL separately into the 1000 small files according to the values obtained. Thus, each small file is about 300M.
S traverses the file B, taking the same way as a, storing the URL to 1000 small files, respectively. After this processing, all the possible same URL are in the corresponding small file (), and the small file that does not correspond cannot have the same url. Then we only ask for the same URL in 1000 small files.
When s asks for the same URL in each of the small files, you can store the URL of one of the small files into the hash_set. Then iterate over each URL of another small file to see if its in the hash_set that was just built, and if so, thats the common URL, save it into the file.
Scenario 2: if you allow a certain error rate, you can use Bloom filter, and 4G memory can represent 34 billion bit. One file in URL using Bloom filter mapping for the 34 billion bit, and then one by one to read another file URL, check whether the Bloom and filter, if it is, then the URL should be URL (note that there will be some error rate).
2., there are 10 files, each file 1G, each line of each file is stored by the users query, each files query may be repeated. You are required to sort according to the frequency of query.
Scenario 1:
S reads 10 files in sequence and writes query to the other 10 files in accordance with the result of hash (query)%10. In this way, the size of each new generated file is about 1G (assuming the hash function is random).
S looks for a machine with about 2G
您可能关注的文档
- 模拟二十四(Analog twenty-four).doc
- 模拟自测题答案(Simulated self-test questions answer).doc
- 模拟电路的境界(The realm of analog circuits).doc
- 模拟芯片设计的四重境界(Four levels of analog chip design).doc
- 模糊pid控制在泵站高效运行中的分析(Analysis of fuzzy PID control in high efficiency operation of pumping station).doc
- 欧洲人认识中国的拐点(Europeans know the inflection point of China).doc
- 欧洲中小型企业的国际化(Internationalization of small and medium European enterprises).doc
- 欧洲古典大师技法(专业精品)(Master of European classical techniques (professional boutique)).doc
- 欧洲文化(European culture).doc
- 欧洲央行准备购债(The ECB prepares to buy debt).doc
- 海淀区高三文科数学试题 0904(Haidian District high school liberal arts mathematics examination questions 0904).doc
- 消极腐败的表现(Manifestations of negative corruption).doc
- 消耗或力助经济保八(Consumption or economic help protect eight).doc
- 消灭贫困需要一整代人的努力 日本通过排灌设施建设和农业 日野岛(Poverty eradication needs a whole generation to Japan by agricultural irrigation and drainage facilities construction and Hino Island).doc
- 消费品供应链的基础知识(Basic knowledge of consumer goods supply chain).doc
- 消费升级中的渠道变阵(The upgrading of consumption in channel Bianzhen).doc
- 消费税习题1(Excise tax 1).doc
- 消费者别轻信壁纸能去除室内甲醛(Consumers don't believe wallpaper can remove indoor formaldehyde).doc
- 海尔的质量管理(Quality management of Haier).doc
- 消防原理3(Principles of fire fighting 3).doc
最近下载
- 电视摄像教程(第2版)全套PPT课件.pptx
- 部编版语文小学二年级下册第六单元大单元教学教材分析集体备课.pptx VIP
- (新教材)2026年春期人教版二年级下册数学 第3单元 万以内数的认识 单元核心素养教案.docx
- 2023年山东司法警官职业学院单招考试综合素质题库及答案解析.docx VIP
- 智慧医院银医通建设方案.pptx VIP
- 部编版语文二年级下册第三单元教材解读大单元集体备课.pptx VIP
- 2025年度医院党支部组织生活会准备情况汇报.doc VIP
- DFMEA理论与实战-DFMEA八步.pptx VIP
- 2025年春新课程能力培养八年级语文下册人教版答案.pdf VIP
- 2026年党支部班子在对照加强理论武装、执行上级组织决定、加强党员管理监督等“六个对照”方面检查材料[三篇文].docx VIP
原创力文档

文档评论(0)