Web Toolbar by Wibiya

Pages

Tuesday, July 12, 2011

Given a huge file with 2GB size with one string per line, which sorting algorithm will you use to sort the file and why?

Solution #1: The catch here is "2GB" file. By saying "2GB", the question stresses that we cannot bring the entire data in memory at the same time. Therefore we will have to sort the data in chunks. This can be done by a class of sorting algorithms called external sorting. External sorting works as follows:


Suppose we have 200 MB of available memory,

1) Read 200 MB of data in main memory and sort it by some conventional method like quicksort in O(n log n).

2) Write the sorted data back to a temporary file on disk.

3) Repeat steps 1 and 2 until all the data is in sorted 200 MB chunks (2GB/200MB = 10 chunks). We will now merge these chunks into a single output file.

4) Read the first 15 MB of of each chunk (15*10 = 150 MB total) into input buffer in main memory and allocate remaining 50 MB for output buffer.

5) Perform a 10-way merge on the input buffer and store the result in the output buffer. If the output buffer is full, write it to the final sorted file on the disk and empty it. If any of the 10 input buffers get empty, fill it again with the next 20 MB of the associated 200 MB chunk until no more data from the chunk is available.

3 comments:

jiabul sk said...

thanks

Suchit Maindola said...

you are welcome Jiabul :)

Ankita said...

Can you explain the 4th and 5th step of merging more elaborately?I am not able to get it. That would be a great help. Thanks!!