{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"import io_backend_v3 as io"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: this notebook requires `io_backend.py` and `compModel.js` to be in the same directory**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"External Merge Sort\n",
"------------------\n",
"\n",
"Sorting big data is easy!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 522 ms, sys: 5.92 ms, total: 528 ms\n",
"Wall time: 533 ms\n"
]
}
],
"source": [
"import random\n",
"x = list(range(1000000)); random.shuffle(x)\n",
"%time x.sort()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Except that sorting a million integers isn't cool... you know what's cool? Sorting a billion! Try it!\n",
"\n",
"Just kidding! (Hint: if you took the above seriously, select \"Kernel >> Restart\" in the notebook menu bar and then select \"Clear all outputs & restart\" to save your computer)\n",
"\n",
"What actually happens when someone- or more pertinently, some DBMS- wants to sort data that is bigger than main memory (e.g. RAM)? For example, the rows of a very large table in a database?\n",
"\n",
"In this notebook we'll look at one classic algorithm for doing this efficently: **external merge sort**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Step 1: External Merge\n",
"---------------------\n",
"\n",
"First, we'll look at a simple way to efficiently merge a file larger than memory. Suppose we have the following scenario:\n",
"* _Input:_ Two sorted lists of length $N$ and $M$\n",
"* _Output:_ One sorted list of length $N+M$\n",
"\n",
"We will see a way to do this that requires at least 3 buffer pages, and is $O(2*(M+N))$:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def min_non_null_idx(a):\n",
" arr = filter(lambda x : x[1] is not None, enumerate(a))\n",
" return min(arr, key=lambda x : x[1])\n",
"\n",
"def external_merge(b, file_in_ids, erase=False):\n",
" P = b.page_size\n",
" if len(file_in_ids) + 1 > b.buffer_size:\n",
" raise Exception(\"Too many files for buffer size.\")\n",
"\n",
" # A FileIterator will iterate through elements of a file's pages\n",
" # reading in & releasing pages so as to take up one page in buffer\n",
" fis = [io.FileIterator(b, fid) for fid in file_in_ids]\n",
" elements = [fi.get_next() for fi in fis]\n",
"\n",
" # A FileWriter object will append elements to a file, creating\n",
" # & flushing pages so as to take up one page in buffer\n",
" fid_out = b.new_file()\n",
" fw = io.FileWriter(b, fid_out)\n",
" \n",
" # Successively choose the smallest of the B files' smallest elements\n",
" while any([e is not None for e in elements]):\n",
" \n",
" # Erase & then output the smallest element from the pages currently in buffer\n",
" min_i, min_e = min_non_null_idx(elements)\n",
" if erase:\n",
" fis[min_i].erase_current()\n",
" fw.append(min_e)\n",
" \n",
" # Get the next element from the file we just picked one from\n",
" elements[min_i] = fis[min_i].get_next()\n",
" fw.close()\n",
" return fid_out"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def new_rand_file(b, l, sorted=False):\n",
" vals = random.sample(range(10*l), l)\n",
" if sorted:\n",
" vals.sort()\n",
" fid = b.new_file()\n",
" fw = io.FileWriter(b, fid)\n",
" for v in vals:\n",
" fw.append(v)\n",
" fw.close()\n",
" return fid"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" IO Counts | R | W |
\n",
" \n",
" To/from Buffer | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" To/from Disk | \n",
" 0 | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
" TOOLTIP!
\n",
" TOOLTIP!
\n",
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/javascript": [
"$.getScript(\"https://d3js.org/d3.v3.min.js\", function () {\n",
"$.getScript(\"https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js\", function () {\n",
"$.getScript(\"https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js\", function () {\n",
"var DURATION = 1000\n",
"var chartNum = \"0-0\"\n",
"var numBufferPages = 3\n",
"var pageSize = 2\n",
"var numDiskPages = 5\n",
"var logStart = 21\n",
"// Load data from json files\n",
"var pagesLog;\n",
"var getPagesLog = function() {\n",
" d3.json(\"pagesLog.json\", function(error, json) {\n",
" if (error) return console.warn(error);\n",
" pagesLog = json;\n",
"\n",
" // Make the animation\n",
" makeComp();\n",
"\n",
" // Bind to window resize\n",
" //$(window).resize(function() {\n",
" // makeComp();\n",
" //});\n",
" });\n",
"}\n",
"getPagesLog();\n",
"\n",
"// Set some globals\n",
"var pageWidth = 50 + 20*pageSize; // NOTE: this is hard-coding the font-width!!!\n",
"var pageHeight = 30;\n",
"var pagePadding = 5;\n",
"var frameWidth = pageWidth + 2*pagePadding;\n",
"var frameHeight = pageHeight + 2*pagePadding;\n",
"var framePadding = 2*pagePadding;\n",
"\n",
"var bufferFrameFill = '#3366FF';\n",
"var diskFrameFill = '#B685FF';\n",
"var frameStroke = 'white'; // '#C1C0C0'\n",
"var diskBackground = '#ABACB0';\n",
"var pageColor = 'blue';\n",
"var pageHighlightColor = 'red';\n",
"\n",
"var maxDiskSize = 4; // MAX # of files that can fit in disk\n",
"var fileSize;\n",
"var height = framePadding + (maxDiskSize + 2) * (framePadding + frameHeight);\n",
"var width;\n",
"\n",
"// Create a row of frames\n",
"var appendFrameRowTo = function(e, x, y, len, name, label, fill, duration, callback) {\n",
" var frameRow = e.append('g')\n",
" .attr({\n",
" 'transform': 'translate(' + x + ',' + y + ')',\n",
" 'id': name + '-g'})\n",
" .style('opacity', 0);\n",
" var frameRange = new Array(len);\n",
" frameRow.append('text')\n",
" .text(label)\n",
" .attr({\n",
" 'text-anchor': 'middle',\n",
" 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n",
" 'dx': 0.5*frameWidth,\n",
" 'dy': 0.5*frameHeight});\n",
" frameRow.selectAll('rect.' + name + '-frame')\n",
" .data(frameRange)\n",
" .enter()\n",
" .append('rect')\n",
" .attr({\n",
" 'id': function(d,i) { return name + '-frame-' + i; },\n",
" 'class': name + '-frame',\n",
" 'width': frameWidth,\n",
" 'height': frameHeight,\n",
" 'x': function(d,i) { return (i+1)*frameWidth; },\n",
" 'y': pagePadding,\n",
" 'fill': fill,\n",
" 'stroke': frameStroke});\n",
" if (duration != null) {\n",
" frameRow.transition()\n",
" .style('opacity', 1)\n",
" .duration(DURATION)\n",
" .each(\"end\", function() { if (callback != null) { callback(); } });\n",
" } else {\n",
" frameRow.style('opacity', 1);\n",
" if (callback != null) { callback(); }\n",
" }\n",
" return frameRow;\n",
"}\n",
"\n",
"var diskHeight = function(nFiles) {\n",
" return 2*framePadding + nFiles * (framePadding + frameHeight)\n",
"}\n",
"\n",
"// Create the chart\n",
"var bufferPages;\n",
"var files = [];\n",
"var svg;\n",
"var buffer;\n",
"var disk;\n",
"var makeComp = function() {\n",
" $(\"#chart-\"+chartNum).html('');\n",
" \n",
" // Make sure the tooltips are hidden\n",
" $(\".tooltip\").hide();\n",
"\n",
" // viewer size\n",
" width = $('.output_subarea').width() - 20;\n",
"\n",
" // Make the files the max possible size\n",
" // TODO: make files dynamically resizeable?\n",
" fileSize = Math.floor((width - framePadding) / (framePadding + frameWidth) - 2);\n",
"\n",
" // append svg\n",
" svg = d3.select(\"#chart-\"+chartNum)\n",
" .append('svg')\n",
" .attr({\n",
" 'width': width,\n",
" 'height': height})\n",
" .style('border', '1px solid black');\n",
"\n",
" // append buffer\n",
" buffer = appendFrameRowTo(svg, pagePadding, pagePadding, numBufferPages, 'buffer', 'BUFFER', bufferFrameFill, null, null);\n",
"\n",
" // append disk container\n",
" var diskY = 2*pagePadding + 2*frameHeight;\n",
" disk = svg.append('g')\n",
" .attr({\n",
" 'id': 'disk-g',\n",
" 'transform': 'translate(' + pagePadding + ',' + diskY + ')'});\n",
" disk.append('text')\n",
" .text('DISK')\n",
" .attr({\n",
" 'text-anchor': 'middle',\n",
" 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n",
" 'dx': 0.5*frameWidth,\n",
" 'dy': 0.5*frameHeight});\n",
" disk.append('rect')\n",
" .attr({\n",
" 'id': chartNum + '-disk-box',\n",
" 'width': width - 2*frameWidth,\n",
" 'height': diskHeight(1), // Start w/ space for one file\n",
" 'x': frameWidth,\n",
" 'y': 0,\n",
" 'fill': diskBackground,\n",
" 'stroke': 'none'});\n",
"\n",
" // Animate the page transistions according to the log diff in pagesLog\n",
" var success = takeActions(0);\n",
"}\n",
"\n",
"var addFile = function(fileId, duration, callback) {\n",
" var n = files.length;\n",
" if (fileId >= maxDiskSize) {\n",
" return null;\n",
" }\n",
" if (n > 0 && fileId >= n) {\n",
" $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(n+1));\n",
" }\n",
" var file = appendFrameRowTo(disk, 2*framePadding + frameWidth, framePadding + fileId * (framePadding + frameHeight), fileSize, 'file-' + fileId, 'File ' + fileId, diskFrameFill, duration, callback);\n",
" if (fileId < n) {\n",
" files[fileId] = file;\n",
" } else {\n",
" files.push(file);\n",
" }\n",
"}\n",
"\n",
"var removeFile = function(fileId) {\n",
" var file = files[fileId];\n",
" file.remove();\n",
" files[fileId] = null;\n",
" var n = files.length;\n",
" var nh = n;\n",
" if (n > 0) {\n",
" for (var j=n-1; j > 0; j--) {\n",
" if (files[j] != null) {\n",
" break;\n",
" } else {\n",
" nh = j;\n",
" }\n",
" }\n",
" $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(nh));\n",
" }\n",
"}\n",
"\n",
"// For some reason in certain situations e.g. d3.select(\"#0-0-1\")\n",
"// d3.select is not working here?? Workaround...\n",
"var getElement = function(elementType, filterFn) {\n",
" var e = d3.select('#chart-'+chartNum)\n",
" .selectAll(elementType)[0]\n",
" .filter(filterFn)[0];\n",
" return e;\n",
"}\n",
"\n",
"var pageText = function(fid, pid, data) {\n",
" var s = fid + '/' + pid + ' : ';\n",
" for (var i=0; i < data.length; i++) {\n",
" s += (data[i] != null) ? data[i] : '_';\n",
" if (i < data.length - 1) {\n",
" s += ', ';\n",
" }\n",
" }\n",
" return s;\n",
"}\n",
"\n",
"var appendPageTo = function(e, a, x, y, opacity, color) {\n",
" var page = e.append('g')\n",
" .attr({\n",
" 'id': chartNum + '-' + a.file + '-' + a.page + '-' + a.newLocation,\n",
" 'transform': 'translate(' + x + ',' + y + ')',\n",
" 'class': 'page'})\n",
" .style('opacity', opacity);\n",
" page.append('rect')\n",
" .attr({\n",
" 'id': 'page-' + a.file + '-' + a.page + '-rect',\n",
" 'class': 'page',\n",
" 'width': pageWidth,\n",
" 'height': pageHeight,\n",
" 'fill': color,\n",
" 'stroke': '#444'});\n",
" page.append('text')\n",
" .attr({\n",
" 'dx': pagePadding,\n",
" 'dy': 0.7*pageHeight,\n",
" 'font-size': \"15\",\n",
" 'fill': 'white'})\n",
" .text(pageText(a.file, a.page, a.pageData));\n",
" return page;\n",
"}\n",
"\n",
"var getBufferPageAbsPos = function(bufferIndex) {\n",
" var bufferPos = d3.transform(buffer.attr(\"transform\")).translate;\n",
" var x = bufferPos[0] + pagePadding + (bufferIndex + 1) * frameWidth;\n",
" var y = bufferPos[1] + 2 * pagePadding;\n",
" return [x, y];\n",
"}\n",
"\n",
"var getFilePageAbsPos = function(file, page) {\n",
" var diskPos = d3.transform(disk.attr(\"transform\")).translate;\n",
" var filePos = d3.transform(files[file].attr(\"transform\")).translate;\n",
" var x = diskPos[0] + filePos[0] + pagePadding + (page + 1) * frameWidth;\n",
" var y = diskPos[1] + filePos[1] + 2 * pagePadding;\n",
" return [x, y];\n",
"}\n",
"\n",
"var placeRelativeTo = function(a, b, dTop, dLeft) {\n",
" var bOffset = b.offset();\n",
" if (bOffset != null) {\n",
" a.offset({'top' : bOffset.top + dTop, 'left' : bOffset.left + dLeft});\n",
" }\n",
"}\n",
"\n",
"var takeActions = function(step) {\n",
"\n",
" // Get the next action to execute\n",
" if (step < pagesLog.length) {\n",
" var a = pagesLog[step];\n",
" //console.warn(a.operation);\n",
" } else {\n",
" $(\".tooltip\").hide();\n",
" return null;\n",
" }\n",
"\n",
" // Are we in the new part of the log which should be animated?\n",
" var animate = (a.show && step >= logStart);\n",
"\n",
" // Update IO counter\n",
" $(\"#chart-\" + chartNum + \"-bufferReads\").html(a.ioCount.bufferReads);\n",
" $(\"#chart-\" + chartNum + \"-bufferWrites\").html(a.ioCount.bufferWrites);\n",
" $(\"#chart-\" + chartNum + \"-diskReads\").html(a.ioCount.diskReads);\n",
" $(\"#chart-\" + chartNum + \"-diskWrites\").html(a.ioCount.diskWrites);\n",
"\n",
" // Show tooltip! -> @ old location, or new if old is null\n",
" if (animate) {\n",
" var tooltip = $(\"#chart-\" + chartNum + \"-tooltip\");\n",
" var pageId = \"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\";\n",
" pageId += (a.oldLocation != null) ? a.oldLocation : a.newLocation;\n",
" placeRelativeTo(tooltip, $(pageId), 0.7*pageHeight, 0.3*pageWidth);\n",
" tooltip.html(a.operation);\n",
" tooltip.show();\n",
" }\n",
"\n",
" // Check for tooltip animations & handle here\n",
" if (a.operation.slice(0, 7) == 'TOOLTIP') {\n",
" // TODO: Clean this up\n",
" if (a.operation == 'TOOLTIP-2') {\n",
" var tooltip = $(\"#chart-\" + chartNum + \"-tooltip-2\");\n",
" var page = $(\"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\" + a.oldLocation);\n",
" placeRelativeTo(tooltip, page, -(tooltip.outerHeight() + 1.5*pagePadding), 0.1*pageWidth);\n",
" tooltip.html(a.tooltipContent);\n",
" tooltip.show();\n",
" }\n",
" takeActions(step + 1);\n",
" return true;\n",
" }\n",
"\n",
" // Check for file actions & handle here\n",
" if (a.operation === \"NEWFILE\") {\n",
" addFile(a.file, DURATION, function() { takeActions(step + 1); });\n",
" return true;\n",
" } else if (a.operation === \"DELETEFILE\") {\n",
" removeFile(a.file);\n",
" takeActions(step + 1); // TODO: Do proper animation here\n",
" return true;\n",
" }\n",
"\n",
" // old -> new location animation\n",
" if (a.newLocation != null) {\n",
"\n",
" // Delete the old element\n",
" if (!a.keepOld && a.oldLocation != null) {\n",
" $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation).remove();\n",
" }\n",
"\n",
" // ANIMATION: If animation is called for, we create a new copy of page *which is above all \n",
" // other SVG g layers* to do the animation, then delete it once the animation is done\n",
" // The point is to do the animation so nice & visible (not hidden behind g layers)\n",
" // But so that the end object is appended to the proper g\n",
" var newE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" var newI = (a.newLocation === \"BUFFER\") ? a.bufferIndex : a.page;\n",
" var newX = (newI + 1) * frameWidth + pagePadding;\n",
" var newY = 2 * pagePadding;\n",
"\n",
" // Animate only if we are in the new part of the log\n",
" if (animate) {\n",
" var newPos = (a.newLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n",
" if (a.oldLocation != null) {\n",
" var oldPos = (a.oldLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n",
" var oldE = (a.oldLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" } else {\n",
" var oldPos = newPos;\n",
" var oldE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" }\n",
"\n",
" // Append the dummpy page for animation\n",
" var p = appendPageTo(svg, a, oldPos[0], oldPos[1], 0, pageColor);\n",
"\n",
" // Animation! Place the new page and proceed to next action at end\n",
" p.transition()\n",
" .style('opacity', 1).duration(DURATION)\n",
" .transition()\n",
" .attr('transform', 'translate(' + newPos[0] + ',' + newPos[1] + ')').duration(DURATION)\n",
" .each(\"end\", function() { \n",
" var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n",
" takeActions(step + 1);\n",
" p.remove();\n",
" });\n",
"\n",
" // Else if no animation we just place the new element and proceed\n",
" } else {\n",
" var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n",
" takeActions(step + 1);\n",
" }\n",
"\n",
" // If no new location: simple fade out animation\n",
" } else {\n",
" if (!a.keepOld && a.oldLocation != null) {\n",
" var page = $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation);\n",
" if (animate) {\n",
" page.fadeOut(DURATION, function() {\n",
" page.remove();\n",
" takeActions(step + 1);\n",
" });\n",
" } else {\n",
" page.remove();\n",
" takeActions(step + 1);\n",
" }\n",
" } else {\n",
" takeActions(step + 1);\n",
" }\n",
" }\n",
" return true;\n",
"}\n",
"});\n",
"});\n",
"});\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create the buffer + two random sorted files\n",
"N = 4\n",
"M = 5\n",
"b = io.Buffer(page_size=2, buffer_size=3)\n",
"fids = [\n",
" new_rand_file(b, N, sorted=True),\n",
" new_rand_file(b, M, sorted=True)\n",
"]\n",
"b.display_set_mark() # Don't animate this setup part\n",
"\n",
"# Merge!\n",
"merged_fid = external_merge(b, fids, erase=True)\n",
"\n",
"# NOTE THAT YOU CAN ADJUST THE ANIMATION SPEED!!!\n",
"b.display(speed=1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that for two files and three buffer pages, this algorithm is\n",
"\n",
"$O(2*(M+N))$,\n",
"\n",
"since for each page read in from disk ($M+N$ total), there is a page flushed out. Furthermore, since the algorithm above is fully generalizable to $K$ files and a buffer of size $K+1$ or greater, we see that if file $i$ has $N_i$ pages, then we again are\n",
"\n",
"$O\\left(2*\\sum_{i=1}^KN_i\\right)$\n",
"\n",
"which is just the above in more general form."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"External Merge Sort\n",
"------------------\n",
"\n",
"Now that we can merge two sorted files, sorting is easy: we just\n",
"\n",
"1. Split our data up into files small enough to _sort in memory_ (i.e. small enough to fit in the buffer)\n",
"2. Repeatedly _merge these files_ with the above algorithm until they are again one (now sorted) file!\n",
"\n",
"Since space is an issue for the visualizations, we'll just show a simple sort-merge with one round of merging below, however the algorithm is fully extendable to larger files with more merges:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helper functions for (1) splitting and sorting in-memory:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import itertools\n",
"def flatten(x): return list(itertools.chain.from_iterable(x))\n",
"\n",
"def split_and_sort_in_mem(b, fid_in):\n",
" fids_out = []\n",
" N = b.get_file_len(fid_in)\n",
" B = b.buffer_size\n",
" P = b.page_size\n",
" for i in range(0, N, B):\n",
" fid_out = b.new_file()\n",
" \n",
" # Read in enough pages to fill the buffer if possible\n",
" pages = [b.read(fid_in, pid) for pid in range(i, min(i+B, N))]\n",
" \n",
" # Collect & sort the data from the pages in memory\n",
" vals = sorted(flatten([page.get_data_copy() for page in pages]), key=str)\n",
" \n",
" # Write out to the new file\n",
" for j,page in enumerate(pages):\n",
" b.release(page)\n",
" page = b.new_page(fid_out)\n",
" page.set_all(vals[j*P:(j+1)*P])\n",
" b.flush(page)\n",
" fids_out.append(fid_out)\n",
" \n",
" # Delete original file & return new file ids\n",
" b.delete_file(fid_in)\n",
" return fids_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helper function for (2) recursively applying external merge:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def recursive_external_merge(b, fids, erase=False):\n",
" L = len(fids)\n",
" if L < b.buffer_size:\n",
" return external_merge(b, fids, erase=erase)\n",
" else:\n",
" mid_1 = recursive_external_merge(b, fids[:(L/2)], erase=erase)\n",
" mid_2 = recursive_external_merge(b, fids[(L/2):], erase=erase)\n",
" return external_merge(b, [mid_1, mid_2], erase=erase)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_The full **external merge-sort** algorithm:_"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def external_merge_sort(b, fid, erase=False, concise=False):\n",
" \n",
" # Split into size-B chunks, sort in memory, save to new files\n",
" fids = split_and_sort_in_mem(b, fid)\n",
" \n",
" # Optionally set mark here so as not to animate the previous two steps\n",
" if concise:\n",
" b.display_set_mark()\n",
" \n",
" # Merge the sorted files recursively\n",
" return recursive_external_merge(b, fids, erase=erase)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" IO Counts | R | W |
\n",
" \n",
" To/from Buffer | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" To/from Disk | \n",
" 0 | \n",
" 0 | \n",
"
\n",
"
\n",
"
\n",
" TOOLTIP!
\n",
" TOOLTIP!
\n",
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/javascript": [
"$.getScript(\"https://d3js.org/d3.v3.min.js\", function () {\n",
"$.getScript(\"https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js\", function () {\n",
"$.getScript(\"https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js\", function () {\n",
"var DURATION = 1000\n",
"var chartNum = \"1-0\"\n",
"var numBufferPages = 3\n",
"var pageSize = 2\n",
"var numDiskPages = 5\n",
"var logStart = 69\n",
"// Load data from json files\n",
"var pagesLog;\n",
"var getPagesLog = function() {\n",
" d3.json(\"pagesLog.json\", function(error, json) {\n",
" if (error) return console.warn(error);\n",
" pagesLog = json;\n",
"\n",
" // Make the animation\n",
" makeComp();\n",
"\n",
" // Bind to window resize\n",
" //$(window).resize(function() {\n",
" // makeComp();\n",
" //});\n",
" });\n",
"}\n",
"getPagesLog();\n",
"\n",
"// Set some globals\n",
"var pageWidth = 50 + 20*pageSize; // NOTE: this is hard-coding the font-width!!!\n",
"var pageHeight = 30;\n",
"var pagePadding = 5;\n",
"var frameWidth = pageWidth + 2*pagePadding;\n",
"var frameHeight = pageHeight + 2*pagePadding;\n",
"var framePadding = 2*pagePadding;\n",
"\n",
"var bufferFrameFill = '#3366FF';\n",
"var diskFrameFill = '#B685FF';\n",
"var frameStroke = 'white'; // '#C1C0C0'\n",
"var diskBackground = '#ABACB0';\n",
"var pageColor = 'blue';\n",
"var pageHighlightColor = 'red';\n",
"\n",
"var maxDiskSize = 4; // MAX # of files that can fit in disk\n",
"var fileSize;\n",
"var height = framePadding + (maxDiskSize + 2) * (framePadding + frameHeight);\n",
"var width;\n",
"\n",
"// Create a row of frames\n",
"var appendFrameRowTo = function(e, x, y, len, name, label, fill, duration, callback) {\n",
" var frameRow = e.append('g')\n",
" .attr({\n",
" 'transform': 'translate(' + x + ',' + y + ')',\n",
" 'id': name + '-g'})\n",
" .style('opacity', 0);\n",
" var frameRange = new Array(len);\n",
" frameRow.append('text')\n",
" .text(label)\n",
" .attr({\n",
" 'text-anchor': 'middle',\n",
" 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n",
" 'dx': 0.5*frameWidth,\n",
" 'dy': 0.5*frameHeight});\n",
" frameRow.selectAll('rect.' + name + '-frame')\n",
" .data(frameRange)\n",
" .enter()\n",
" .append('rect')\n",
" .attr({\n",
" 'id': function(d,i) { return name + '-frame-' + i; },\n",
" 'class': name + '-frame',\n",
" 'width': frameWidth,\n",
" 'height': frameHeight,\n",
" 'x': function(d,i) { return (i+1)*frameWidth; },\n",
" 'y': pagePadding,\n",
" 'fill': fill,\n",
" 'stroke': frameStroke});\n",
" if (duration != null) {\n",
" frameRow.transition()\n",
" .style('opacity', 1)\n",
" .duration(DURATION)\n",
" .each(\"end\", function() { if (callback != null) { callback(); } });\n",
" } else {\n",
" frameRow.style('opacity', 1);\n",
" if (callback != null) { callback(); }\n",
" }\n",
" return frameRow;\n",
"}\n",
"\n",
"var diskHeight = function(nFiles) {\n",
" return 2*framePadding + nFiles * (framePadding + frameHeight)\n",
"}\n",
"\n",
"// Create the chart\n",
"var bufferPages;\n",
"var files = [];\n",
"var svg;\n",
"var buffer;\n",
"var disk;\n",
"var makeComp = function() {\n",
" $(\"#chart-\"+chartNum).html('');\n",
" \n",
" // Make sure the tooltips are hidden\n",
" $(\".tooltip\").hide();\n",
"\n",
" // viewer size\n",
" width = $('.output_subarea').width() - 20;\n",
"\n",
" // Make the files the max possible size\n",
" // TODO: make files dynamically resizeable?\n",
" fileSize = Math.floor((width - framePadding) / (framePadding + frameWidth) - 2);\n",
"\n",
" // append svg\n",
" svg = d3.select(\"#chart-\"+chartNum)\n",
" .append('svg')\n",
" .attr({\n",
" 'width': width,\n",
" 'height': height})\n",
" .style('border', '1px solid black');\n",
"\n",
" // append buffer\n",
" buffer = appendFrameRowTo(svg, pagePadding, pagePadding, numBufferPages, 'buffer', 'BUFFER', bufferFrameFill, null, null);\n",
"\n",
" // append disk container\n",
" var diskY = 2*pagePadding + 2*frameHeight;\n",
" disk = svg.append('g')\n",
" .attr({\n",
" 'id': 'disk-g',\n",
" 'transform': 'translate(' + pagePadding + ',' + diskY + ')'});\n",
" disk.append('text')\n",
" .text('DISK')\n",
" .attr({\n",
" 'text-anchor': 'middle',\n",
" 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n",
" 'dx': 0.5*frameWidth,\n",
" 'dy': 0.5*frameHeight});\n",
" disk.append('rect')\n",
" .attr({\n",
" 'id': chartNum + '-disk-box',\n",
" 'width': width - 2*frameWidth,\n",
" 'height': diskHeight(1), // Start w/ space for one file\n",
" 'x': frameWidth,\n",
" 'y': 0,\n",
" 'fill': diskBackground,\n",
" 'stroke': 'none'});\n",
"\n",
" // Animate the page transistions according to the log diff in pagesLog\n",
" var success = takeActions(0);\n",
"}\n",
"\n",
"var addFile = function(fileId, duration, callback) {\n",
" var n = files.length;\n",
" if (fileId >= maxDiskSize) {\n",
" return null;\n",
" }\n",
" if (n > 0 && fileId >= n) {\n",
" $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(n+1));\n",
" }\n",
" var file = appendFrameRowTo(disk, 2*framePadding + frameWidth, framePadding + fileId * (framePadding + frameHeight), fileSize, 'file-' + fileId, 'File ' + fileId, diskFrameFill, duration, callback);\n",
" if (fileId < n) {\n",
" files[fileId] = file;\n",
" } else {\n",
" files.push(file);\n",
" }\n",
"}\n",
"\n",
"var removeFile = function(fileId) {\n",
" var file = files[fileId];\n",
" file.remove();\n",
" files[fileId] = null;\n",
" var n = files.length;\n",
" var nh = n;\n",
" if (n > 0) {\n",
" for (var j=n-1; j > 0; j--) {\n",
" if (files[j] != null) {\n",
" break;\n",
" } else {\n",
" nh = j;\n",
" }\n",
" }\n",
" $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(nh));\n",
" }\n",
"}\n",
"\n",
"// For some reason in certain situations e.g. d3.select(\"#0-0-1\")\n",
"// d3.select is not working here?? Workaround...\n",
"var getElement = function(elementType, filterFn) {\n",
" var e = d3.select('#chart-'+chartNum)\n",
" .selectAll(elementType)[0]\n",
" .filter(filterFn)[0];\n",
" return e;\n",
"}\n",
"\n",
"var pageText = function(fid, pid, data) {\n",
" var s = fid + '/' + pid + ' : ';\n",
" for (var i=0; i < data.length; i++) {\n",
" s += (data[i] != null) ? data[i] : '_';\n",
" if (i < data.length - 1) {\n",
" s += ', ';\n",
" }\n",
" }\n",
" return s;\n",
"}\n",
"\n",
"var appendPageTo = function(e, a, x, y, opacity, color) {\n",
" var page = e.append('g')\n",
" .attr({\n",
" 'id': chartNum + '-' + a.file + '-' + a.page + '-' + a.newLocation,\n",
" 'transform': 'translate(' + x + ',' + y + ')',\n",
" 'class': 'page'})\n",
" .style('opacity', opacity);\n",
" page.append('rect')\n",
" .attr({\n",
" 'id': 'page-' + a.file + '-' + a.page + '-rect',\n",
" 'class': 'page',\n",
" 'width': pageWidth,\n",
" 'height': pageHeight,\n",
" 'fill': color,\n",
" 'stroke': '#444'});\n",
" page.append('text')\n",
" .attr({\n",
" 'dx': pagePadding,\n",
" 'dy': 0.7*pageHeight,\n",
" 'font-size': \"15\",\n",
" 'fill': 'white'})\n",
" .text(pageText(a.file, a.page, a.pageData));\n",
" return page;\n",
"}\n",
"\n",
"var getBufferPageAbsPos = function(bufferIndex) {\n",
" var bufferPos = d3.transform(buffer.attr(\"transform\")).translate;\n",
" var x = bufferPos[0] + pagePadding + (bufferIndex + 1) * frameWidth;\n",
" var y = bufferPos[1] + 2 * pagePadding;\n",
" return [x, y];\n",
"}\n",
"\n",
"var getFilePageAbsPos = function(file, page) {\n",
" var diskPos = d3.transform(disk.attr(\"transform\")).translate;\n",
" var filePos = d3.transform(files[file].attr(\"transform\")).translate;\n",
" var x = diskPos[0] + filePos[0] + pagePadding + (page + 1) * frameWidth;\n",
" var y = diskPos[1] + filePos[1] + 2 * pagePadding;\n",
" return [x, y];\n",
"}\n",
"\n",
"var placeRelativeTo = function(a, b, dTop, dLeft) {\n",
" var bOffset = b.offset();\n",
" if (bOffset != null) {\n",
" a.offset({'top' : bOffset.top + dTop, 'left' : bOffset.left + dLeft});\n",
" }\n",
"}\n",
"\n",
"var takeActions = function(step) {\n",
"\n",
" // Get the next action to execute\n",
" if (step < pagesLog.length) {\n",
" var a = pagesLog[step];\n",
" //console.warn(a.operation);\n",
" } else {\n",
" $(\".tooltip\").hide();\n",
" return null;\n",
" }\n",
"\n",
" // Are we in the new part of the log which should be animated?\n",
" var animate = (a.show && step >= logStart);\n",
"\n",
" // Update IO counter\n",
" $(\"#chart-\" + chartNum + \"-bufferReads\").html(a.ioCount.bufferReads);\n",
" $(\"#chart-\" + chartNum + \"-bufferWrites\").html(a.ioCount.bufferWrites);\n",
" $(\"#chart-\" + chartNum + \"-diskReads\").html(a.ioCount.diskReads);\n",
" $(\"#chart-\" + chartNum + \"-diskWrites\").html(a.ioCount.diskWrites);\n",
"\n",
" // Show tooltip! -> @ old location, or new if old is null\n",
" if (animate) {\n",
" var tooltip = $(\"#chart-\" + chartNum + \"-tooltip\");\n",
" var pageId = \"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\";\n",
" pageId += (a.oldLocation != null) ? a.oldLocation : a.newLocation;\n",
" placeRelativeTo(tooltip, $(pageId), 0.7*pageHeight, 0.3*pageWidth);\n",
" tooltip.html(a.operation);\n",
" tooltip.show();\n",
" }\n",
"\n",
" // Check for tooltip animations & handle here\n",
" if (a.operation.slice(0, 7) == 'TOOLTIP') {\n",
" // TODO: Clean this up\n",
" if (a.operation == 'TOOLTIP-2') {\n",
" var tooltip = $(\"#chart-\" + chartNum + \"-tooltip-2\");\n",
" var page = $(\"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\" + a.oldLocation);\n",
" placeRelativeTo(tooltip, page, -(tooltip.outerHeight() + 1.5*pagePadding), 0.1*pageWidth);\n",
" tooltip.html(a.tooltipContent);\n",
" tooltip.show();\n",
" }\n",
" takeActions(step + 1);\n",
" return true;\n",
" }\n",
"\n",
" // Check for file actions & handle here\n",
" if (a.operation === \"NEWFILE\") {\n",
" addFile(a.file, DURATION, function() { takeActions(step + 1); });\n",
" return true;\n",
" } else if (a.operation === \"DELETEFILE\") {\n",
" removeFile(a.file);\n",
" takeActions(step + 1); // TODO: Do proper animation here\n",
" return true;\n",
" }\n",
"\n",
" // old -> new location animation\n",
" if (a.newLocation != null) {\n",
"\n",
" // Delete the old element\n",
" if (!a.keepOld && a.oldLocation != null) {\n",
" $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation).remove();\n",
" }\n",
"\n",
" // ANIMATION: If animation is called for, we create a new copy of page *which is above all \n",
" // other SVG g layers* to do the animation, then delete it once the animation is done\n",
" // The point is to do the animation so nice & visible (not hidden behind g layers)\n",
" // But so that the end object is appended to the proper g\n",
" var newE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" var newI = (a.newLocation === \"BUFFER\") ? a.bufferIndex : a.page;\n",
" var newX = (newI + 1) * frameWidth + pagePadding;\n",
" var newY = 2 * pagePadding;\n",
"\n",
" // Animate only if we are in the new part of the log\n",
" if (animate) {\n",
" var newPos = (a.newLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n",
" if (a.oldLocation != null) {\n",
" var oldPos = (a.oldLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n",
" var oldE = (a.oldLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" } else {\n",
" var oldPos = newPos;\n",
" var oldE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n",
" }\n",
"\n",
" // Append the dummpy page for animation\n",
" var p = appendPageTo(svg, a, oldPos[0], oldPos[1], 0, pageColor);\n",
"\n",
" // Animation! Place the new page and proceed to next action at end\n",
" p.transition()\n",
" .style('opacity', 1).duration(DURATION)\n",
" .transition()\n",
" .attr('transform', 'translate(' + newPos[0] + ',' + newPos[1] + ')').duration(DURATION)\n",
" .each(\"end\", function() { \n",
" var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n",
" takeActions(step + 1);\n",
" p.remove();\n",
" });\n",
"\n",
" // Else if no animation we just place the new element and proceed\n",
" } else {\n",
" var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n",
" takeActions(step + 1);\n",
" }\n",
"\n",
" // If no new location: simple fade out animation\n",
" } else {\n",
" if (!a.keepOld && a.oldLocation != null) {\n",
" var page = $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation);\n",
" if (animate) {\n",
" page.fadeOut(DURATION, function() {\n",
" page.remove();\n",
" takeActions(step + 1);\n",
" });\n",
" } else {\n",
" page.remove();\n",
" takeActions(step + 1);\n",
" }\n",
" } else {\n",
" takeActions(step + 1);\n",
" }\n",
" }\n",
" return true;\n",
"}\n",
"});\n",
"});\n",
"});\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create buffer & random file larger than buffer\n",
"b = io.Buffer(buffer_size=3, page_size=2)\n",
"fid = new_rand_file(b, 11)\n",
"b.display_set_mark()\n",
"\n",
"# Use the algorithm\n",
"external_merge_sort(b, fid, erase=True, concise=True)\n",
"b.display(buffer_num=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the cost of this algorithm?\n",
"\n",
"Initial part (splitting and in-memory sorting): $2*N$\n",
"\n",
"Recursive external merge: $\\text{ceil}\\left(\\log_B\\left(\\frac{N}{B+1}\\right)\\right)$ passes, each involving $2*N$ IO operations\n",
"\n",
"$\\implies O\\left(2N*\\left(\\text{ceil}\\left(\\log_B\\left(\\frac{N}{B+1}\\right)\\right) + 1\\right)\\right)$"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}