海量数据笔试问题(Massive data written test questions).docVIP

  • 6
  • 0
  • 约1.5万字
  • 约 12页
  • 2017-10-07 发布于河南
  • 举报

海量数据笔试问题(Massive data written test questions).doc

海量数据笔试问题(Massive data written test questions)

海量数据笔试问题(Massive data written test questions) Massive data written questions.Txt 1., given a, B two files, each store 5 billion URL, each URL accounted for 64 bytes each, memory limit is 4G, let you find the A, B file common URL? Scenario 1: you can estimate that the size of each file is 50G * 64=320G, much larger than the memory limit of 4G. So its impossible to load it entirely into memory. Consider a divide and conquer approach. S traverses the file a, extracts each URL, and stores the URL separately into the 1000 small files according to the values obtained. Thus, each small file is about 300M. S traverses the file B, taking the same way as a, storing the URL to 1000 small files, respectively. After this processing, all the possible same URL are in the corresponding small file (), and the small file that does not correspond cannot have the same url. Then we only ask for the same URL in 1000 small files. When s asks for the same URL in each of the small files, you can store the URL of one of the small files into the hash_set. Then iterate over each URL of another small file to see if its in the hash_set that was just built, and if so, thats the common URL, save it into the file. Scenario 2: if you allow a certain error rate, you can use Bloom filter, and 4G memory can represent 34 billion bit. One file in URL using Bloom filter mapping for the 34 billion bit, and then one by one to read another file URL, check whether the Bloom and filter, if it is, then the URL should be URL (note that there will be some error rate). 2., there are 10 files, each file 1G, each line of each file is stored by the users query, each files query may be repeated. You are required to sort according to the frequency of query. Scenario 1: S reads 10 files in sequence and writes query to the other 10 files in accordance with the result of hash (query)%10. In this way, the size of each new generated file is about 1G (assuming the hash function is random). S looks for a machine with about 2G

您可能关注的文档

文档评论(0)

1亿VIP精品文档

相关文档