{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "import io_backend_v3 as io" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: this notebook requires `io_backend.py` and `compModel.js` to be in the same directory**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "External Merge Sort\n", "------------------\n", "\n", "Sorting big data is easy!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 522 ms, sys: 5.92 ms, total: 528 ms\n", "Wall time: 533 ms\n" ] } ], "source": [ "import random\n", "x = list(range(1000000)); random.shuffle(x)\n", "%time x.sort()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Except that sorting a million integers isn't cool... you know what's cool? Sorting a billion! Try it!\n", "\n", "Just kidding! (Hint: if you took the above seriously, select \"Kernel >> Restart\" in the notebook menu bar and then select \"Clear all outputs & restart\" to save your computer)\n", "\n", "What actually happens when someone- or more pertinently, some DBMS- wants to sort data that is bigger than main memory (e.g. RAM)? For example, the rows of a very large table in a database?\n", "\n", "In this notebook we'll look at one classic algorithm for doing this efficently: **external merge sort**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 1: External Merge\n", "---------------------\n", "\n", "First, we'll look at a simple way to efficiently merge a file larger than memory. Suppose we have the following scenario:\n", "* _Input:_ Two sorted lists of length $N$ and $M$\n", "* _Output:_ One sorted list of length $N+M$\n", "\n", "We will see a way to do this that requires at least 3 buffer pages, and is $O(2*(M+N))$:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def min_non_null_idx(a):\n", " arr = filter(lambda x : x[1] is not None, enumerate(a))\n", " return min(arr, key=lambda x : x[1])\n", "\n", "def external_merge(b, file_in_ids, erase=False):\n", " P = b.page_size\n", " if len(file_in_ids) + 1 > b.buffer_size:\n", " raise Exception(\"Too many files for buffer size.\")\n", "\n", " # A FileIterator will iterate through elements of a file's pages\n", " # reading in & releasing pages so as to take up one page in buffer\n", " fis = [io.FileIterator(b, fid) for fid in file_in_ids]\n", " elements = [fi.get_next() for fi in fis]\n", "\n", " # A FileWriter object will append elements to a file, creating\n", " # & flushing pages so as to take up one page in buffer\n", " fid_out = b.new_file()\n", " fw = io.FileWriter(b, fid_out)\n", " \n", " # Successively choose the smallest of the B files' smallest elements\n", " while any([e is not None for e in elements]):\n", " \n", " # Erase & then output the smallest element from the pages currently in buffer\n", " min_i, min_e = min_non_null_idx(elements)\n", " if erase:\n", " fis[min_i].erase_current()\n", " fw.append(min_e)\n", " \n", " # Get the next element from the file we just picked one from\n", " elements[min_i] = fis[min_i].get_next()\n", " fw.close()\n", " return fid_out" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def new_rand_file(b, l, sorted=False):\n", " vals = random.sample(range(10*l), l)\n", " if sorted:\n", " vals.sort()\n", " fid = b.new_file()\n", " fw = io.FileWriter(b, fid)\n", " for v in vals:\n", " fw.append(v)\n", " fw.close()\n", " return fid" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IO CountsRW
To/from Buffer00
To/from Disk00
\n", "
\n", "
TOOLTIP!
\n", "
TOOLTIP!
\n", "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "$.getScript(\"https://d3js.org/d3.v3.min.js\", function () {\n", "$.getScript(\"https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js\", function () {\n", "$.getScript(\"https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js\", function () {\n", "var DURATION = 1000\n", "var chartNum = \"0-0\"\n", "var numBufferPages = 3\n", "var pageSize = 2\n", "var numDiskPages = 5\n", "var logStart = 21\n", "// Load data from json files\n", "var pagesLog;\n", "var getPagesLog = function() {\n", " d3.json(\"pagesLog.json\", function(error, json) {\n", " if (error) return console.warn(error);\n", " pagesLog = json;\n", "\n", " // Make the animation\n", " makeComp();\n", "\n", " // Bind to window resize\n", " //$(window).resize(function() {\n", " // makeComp();\n", " //});\n", " });\n", "}\n", "getPagesLog();\n", "\n", "// Set some globals\n", "var pageWidth = 50 + 20*pageSize; // NOTE: this is hard-coding the font-width!!!\n", "var pageHeight = 30;\n", "var pagePadding = 5;\n", "var frameWidth = pageWidth + 2*pagePadding;\n", "var frameHeight = pageHeight + 2*pagePadding;\n", "var framePadding = 2*pagePadding;\n", "\n", "var bufferFrameFill = '#3366FF';\n", "var diskFrameFill = '#B685FF';\n", "var frameStroke = 'white'; // '#C1C0C0'\n", "var diskBackground = '#ABACB0';\n", "var pageColor = 'blue';\n", "var pageHighlightColor = 'red';\n", "\n", "var maxDiskSize = 4; // MAX # of files that can fit in disk\n", "var fileSize;\n", "var height = framePadding + (maxDiskSize + 2) * (framePadding + frameHeight);\n", "var width;\n", "\n", "// Create a row of frames\n", "var appendFrameRowTo = function(e, x, y, len, name, label, fill, duration, callback) {\n", " var frameRow = e.append('g')\n", " .attr({\n", " 'transform': 'translate(' + x + ',' + y + ')',\n", " 'id': name + '-g'})\n", " .style('opacity', 0);\n", " var frameRange = new Array(len);\n", " frameRow.append('text')\n", " .text(label)\n", " .attr({\n", " 'text-anchor': 'middle',\n", " 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n", " 'dx': 0.5*frameWidth,\n", " 'dy': 0.5*frameHeight});\n", " frameRow.selectAll('rect.' + name + '-frame')\n", " .data(frameRange)\n", " .enter()\n", " .append('rect')\n", " .attr({\n", " 'id': function(d,i) { return name + '-frame-' + i; },\n", " 'class': name + '-frame',\n", " 'width': frameWidth,\n", " 'height': frameHeight,\n", " 'x': function(d,i) { return (i+1)*frameWidth; },\n", " 'y': pagePadding,\n", " 'fill': fill,\n", " 'stroke': frameStroke});\n", " if (duration != null) {\n", " frameRow.transition()\n", " .style('opacity', 1)\n", " .duration(DURATION)\n", " .each(\"end\", function() { if (callback != null) { callback(); } });\n", " } else {\n", " frameRow.style('opacity', 1);\n", " if (callback != null) { callback(); }\n", " }\n", " return frameRow;\n", "}\n", "\n", "var diskHeight = function(nFiles) {\n", " return 2*framePadding + nFiles * (framePadding + frameHeight)\n", "}\n", "\n", "// Create the chart\n", "var bufferPages;\n", "var files = [];\n", "var svg;\n", "var buffer;\n", "var disk;\n", "var makeComp = function() {\n", " $(\"#chart-\"+chartNum).html('');\n", " \n", " // Make sure the tooltips are hidden\n", " $(\".tooltip\").hide();\n", "\n", " // viewer size\n", " width = $('.output_subarea').width() - 20;\n", "\n", " // Make the files the max possible size\n", " // TODO: make files dynamically resizeable?\n", " fileSize = Math.floor((width - framePadding) / (framePadding + frameWidth) - 2);\n", "\n", " // append svg\n", " svg = d3.select(\"#chart-\"+chartNum)\n", " .append('svg')\n", " .attr({\n", " 'width': width,\n", " 'height': height})\n", " .style('border', '1px solid black');\n", "\n", " // append buffer\n", " buffer = appendFrameRowTo(svg, pagePadding, pagePadding, numBufferPages, 'buffer', 'BUFFER', bufferFrameFill, null, null);\n", "\n", " // append disk container\n", " var diskY = 2*pagePadding + 2*frameHeight;\n", " disk = svg.append('g')\n", " .attr({\n", " 'id': 'disk-g',\n", " 'transform': 'translate(' + pagePadding + ',' + diskY + ')'});\n", " disk.append('text')\n", " .text('DISK')\n", " .attr({\n", " 'text-anchor': 'middle',\n", " 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n", " 'dx': 0.5*frameWidth,\n", " 'dy': 0.5*frameHeight});\n", " disk.append('rect')\n", " .attr({\n", " 'id': chartNum + '-disk-box',\n", " 'width': width - 2*frameWidth,\n", " 'height': diskHeight(1), // Start w/ space for one file\n", " 'x': frameWidth,\n", " 'y': 0,\n", " 'fill': diskBackground,\n", " 'stroke': 'none'});\n", "\n", " // Animate the page transistions according to the log diff in pagesLog\n", " var success = takeActions(0);\n", "}\n", "\n", "var addFile = function(fileId, duration, callback) {\n", " var n = files.length;\n", " if (fileId >= maxDiskSize) {\n", " return null;\n", " }\n", " if (n > 0 && fileId >= n) {\n", " $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(n+1));\n", " }\n", " var file = appendFrameRowTo(disk, 2*framePadding + frameWidth, framePadding + fileId * (framePadding + frameHeight), fileSize, 'file-' + fileId, 'File ' + fileId, diskFrameFill, duration, callback);\n", " if (fileId < n) {\n", " files[fileId] = file;\n", " } else {\n", " files.push(file);\n", " }\n", "}\n", "\n", "var removeFile = function(fileId) {\n", " var file = files[fileId];\n", " file.remove();\n", " files[fileId] = null;\n", " var n = files.length;\n", " var nh = n;\n", " if (n > 0) {\n", " for (var j=n-1; j > 0; j--) {\n", " if (files[j] != null) {\n", " break;\n", " } else {\n", " nh = j;\n", " }\n", " }\n", " $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(nh));\n", " }\n", "}\n", "\n", "// For some reason in certain situations e.g. d3.select(\"#0-0-1\")\n", "// d3.select is not working here?? Workaround...\n", "var getElement = function(elementType, filterFn) {\n", " var e = d3.select('#chart-'+chartNum)\n", " .selectAll(elementType)[0]\n", " .filter(filterFn)[0];\n", " return e;\n", "}\n", "\n", "var pageText = function(fid, pid, data) {\n", " var s = fid + '/' + pid + ' : ';\n", " for (var i=0; i < data.length; i++) {\n", " s += (data[i] != null) ? data[i] : '_';\n", " if (i < data.length - 1) {\n", " s += ', ';\n", " }\n", " }\n", " return s;\n", "}\n", "\n", "var appendPageTo = function(e, a, x, y, opacity, color) {\n", " var page = e.append('g')\n", " .attr({\n", " 'id': chartNum + '-' + a.file + '-' + a.page + '-' + a.newLocation,\n", " 'transform': 'translate(' + x + ',' + y + ')',\n", " 'class': 'page'})\n", " .style('opacity', opacity);\n", " page.append('rect')\n", " .attr({\n", " 'id': 'page-' + a.file + '-' + a.page + '-rect',\n", " 'class': 'page',\n", " 'width': pageWidth,\n", " 'height': pageHeight,\n", " 'fill': color,\n", " 'stroke': '#444'});\n", " page.append('text')\n", " .attr({\n", " 'dx': pagePadding,\n", " 'dy': 0.7*pageHeight,\n", " 'font-size': \"15\",\n", " 'fill': 'white'})\n", " .text(pageText(a.file, a.page, a.pageData));\n", " return page;\n", "}\n", "\n", "var getBufferPageAbsPos = function(bufferIndex) {\n", " var bufferPos = d3.transform(buffer.attr(\"transform\")).translate;\n", " var x = bufferPos[0] + pagePadding + (bufferIndex + 1) * frameWidth;\n", " var y = bufferPos[1] + 2 * pagePadding;\n", " return [x, y];\n", "}\n", "\n", "var getFilePageAbsPos = function(file, page) {\n", " var diskPos = d3.transform(disk.attr(\"transform\")).translate;\n", " var filePos = d3.transform(files[file].attr(\"transform\")).translate;\n", " var x = diskPos[0] + filePos[0] + pagePadding + (page + 1) * frameWidth;\n", " var y = diskPos[1] + filePos[1] + 2 * pagePadding;\n", " return [x, y];\n", "}\n", "\n", "var placeRelativeTo = function(a, b, dTop, dLeft) {\n", " var bOffset = b.offset();\n", " if (bOffset != null) {\n", " a.offset({'top' : bOffset.top + dTop, 'left' : bOffset.left + dLeft});\n", " }\n", "}\n", "\n", "var takeActions = function(step) {\n", "\n", " // Get the next action to execute\n", " if (step < pagesLog.length) {\n", " var a = pagesLog[step];\n", " //console.warn(a.operation);\n", " } else {\n", " $(\".tooltip\").hide();\n", " return null;\n", " }\n", "\n", " // Are we in the new part of the log which should be animated?\n", " var animate = (a.show && step >= logStart);\n", "\n", " // Update IO counter\n", " $(\"#chart-\" + chartNum + \"-bufferReads\").html(a.ioCount.bufferReads);\n", " $(\"#chart-\" + chartNum + \"-bufferWrites\").html(a.ioCount.bufferWrites);\n", " $(\"#chart-\" + chartNum + \"-diskReads\").html(a.ioCount.diskReads);\n", " $(\"#chart-\" + chartNum + \"-diskWrites\").html(a.ioCount.diskWrites);\n", "\n", " // Show tooltip! -> @ old location, or new if old is null\n", " if (animate) {\n", " var tooltip = $(\"#chart-\" + chartNum + \"-tooltip\");\n", " var pageId = \"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\";\n", " pageId += (a.oldLocation != null) ? a.oldLocation : a.newLocation;\n", " placeRelativeTo(tooltip, $(pageId), 0.7*pageHeight, 0.3*pageWidth);\n", " tooltip.html(a.operation);\n", " tooltip.show();\n", " }\n", "\n", " // Check for tooltip animations & handle here\n", " if (a.operation.slice(0, 7) == 'TOOLTIP') {\n", " // TODO: Clean this up\n", " if (a.operation == 'TOOLTIP-2') {\n", " var tooltip = $(\"#chart-\" + chartNum + \"-tooltip-2\");\n", " var page = $(\"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\" + a.oldLocation);\n", " placeRelativeTo(tooltip, page, -(tooltip.outerHeight() + 1.5*pagePadding), 0.1*pageWidth);\n", " tooltip.html(a.tooltipContent);\n", " tooltip.show();\n", " }\n", " takeActions(step + 1);\n", " return true;\n", " }\n", "\n", " // Check for file actions & handle here\n", " if (a.operation === \"NEWFILE\") {\n", " addFile(a.file, DURATION, function() { takeActions(step + 1); });\n", " return true;\n", " } else if (a.operation === \"DELETEFILE\") {\n", " removeFile(a.file);\n", " takeActions(step + 1); // TODO: Do proper animation here\n", " return true;\n", " }\n", "\n", " // old -> new location animation\n", " if (a.newLocation != null) {\n", "\n", " // Delete the old element\n", " if (!a.keepOld && a.oldLocation != null) {\n", " $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation).remove();\n", " }\n", "\n", " // ANIMATION: If animation is called for, we create a new copy of page *which is above all \n", " // other SVG g layers* to do the animation, then delete it once the animation is done\n", " // The point is to do the animation so nice & visible (not hidden behind g layers)\n", " // But so that the end object is appended to the proper g\n", " var newE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n", " var newI = (a.newLocation === \"BUFFER\") ? a.bufferIndex : a.page;\n", " var newX = (newI + 1) * frameWidth + pagePadding;\n", " var newY = 2 * pagePadding;\n", "\n", " // Animate only if we are in the new part of the log\n", " if (animate) {\n", " var newPos = (a.newLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n", " if (a.oldLocation != null) {\n", " var oldPos = (a.oldLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n", " var oldE = (a.oldLocation === \"BUFFER\") ? buffer : files[a.file];\n", " } else {\n", " var oldPos = newPos;\n", " var oldE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n", " }\n", "\n", " // Append the dummpy page for animation\n", " var p = appendPageTo(svg, a, oldPos[0], oldPos[1], 0, pageColor);\n", "\n", " // Animation! Place the new page and proceed to next action at end\n", " p.transition()\n", " .style('opacity', 1).duration(DURATION)\n", " .transition()\n", " .attr('transform', 'translate(' + newPos[0] + ',' + newPos[1] + ')').duration(DURATION)\n", " .each(\"end\", function() { \n", " var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n", " takeActions(step + 1);\n", " p.remove();\n", " });\n", "\n", " // Else if no animation we just place the new element and proceed\n", " } else {\n", " var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n", " takeActions(step + 1);\n", " }\n", "\n", " // If no new location: simple fade out animation\n", " } else {\n", " if (!a.keepOld && a.oldLocation != null) {\n", " var page = $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation);\n", " if (animate) {\n", " page.fadeOut(DURATION, function() {\n", " page.remove();\n", " takeActions(step + 1);\n", " });\n", " } else {\n", " page.remove();\n", " takeActions(step + 1);\n", " }\n", " } else {\n", " takeActions(step + 1);\n", " }\n", " }\n", " return true;\n", "}\n", "});\n", "});\n", "});\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the buffer + two random sorted files\n", "N = 4\n", "M = 5\n", "b = io.Buffer(page_size=2, buffer_size=3)\n", "fids = [\n", " new_rand_file(b, N, sorted=True),\n", " new_rand_file(b, M, sorted=True)\n", "]\n", "b.display_set_mark() # Don't animate this setup part\n", "\n", "# Merge!\n", "merged_fid = external_merge(b, fids, erase=True)\n", "\n", "# NOTE THAT YOU CAN ADJUST THE ANIMATION SPEED!!!\n", "b.display(speed=1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that for two files and three buffer pages, this algorithm is\n", "\n", "$O(2*(M+N))$,\n", "\n", "since for each page read in from disk ($M+N$ total), there is a page flushed out. Furthermore, since the algorithm above is fully generalizable to $K$ files and a buffer of size $K+1$ or greater, we see that if file $i$ has $N_i$ pages, then we again are\n", "\n", "$O\\left(2*\\sum_{i=1}^KN_i\\right)$\n", "\n", "which is just the above in more general form." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "External Merge Sort\n", "------------------\n", "\n", "Now that we can merge two sorted files, sorting is easy: we just\n", "\n", "1. Split our data up into files small enough to _sort in memory_ (i.e. small enough to fit in the buffer)\n", "2. Repeatedly _merge these files_ with the above algorithm until they are again one (now sorted) file!\n", "\n", "Since space is an issue for the visualizations, we'll just show a simple sort-merge with one round of merging below, however the algorithm is fully extendable to larger files with more merges:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper functions for (1) splitting and sorting in-memory:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "def flatten(x): return list(itertools.chain.from_iterable(x))\n", "\n", "def split_and_sort_in_mem(b, fid_in):\n", " fids_out = []\n", " N = b.get_file_len(fid_in)\n", " B = b.buffer_size\n", " P = b.page_size\n", " for i in range(0, N, B):\n", " fid_out = b.new_file()\n", " \n", " # Read in enough pages to fill the buffer if possible\n", " pages = [b.read(fid_in, pid) for pid in range(i, min(i+B, N))]\n", " \n", " # Collect & sort the data from the pages in memory\n", " vals = sorted(flatten([page.get_data_copy() for page in pages]), key=str)\n", " \n", " # Write out to the new file\n", " for j,page in enumerate(pages):\n", " b.release(page)\n", " page = b.new_page(fid_out)\n", " page.set_all(vals[j*P:(j+1)*P])\n", " b.flush(page)\n", " fids_out.append(fid_out)\n", " \n", " # Delete original file & return new file ids\n", " b.delete_file(fid_in)\n", " return fids_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper function for (2) recursively applying external merge:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def recursive_external_merge(b, fids, erase=False):\n", " L = len(fids)\n", " if L < b.buffer_size:\n", " return external_merge(b, fids, erase=erase)\n", " else:\n", " mid_1 = recursive_external_merge(b, fids[:(L/2)], erase=erase)\n", " mid_2 = recursive_external_merge(b, fids[(L/2):], erase=erase)\n", " return external_merge(b, [mid_1, mid_2], erase=erase)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_The full **external merge-sort** algorithm:_" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def external_merge_sort(b, fid, erase=False, concise=False):\n", " \n", " # Split into size-B chunks, sort in memory, save to new files\n", " fids = split_and_sort_in_mem(b, fid)\n", " \n", " # Optionally set mark here so as not to animate the previous two steps\n", " if concise:\n", " b.display_set_mark()\n", " \n", " # Merge the sorted files recursively\n", " return recursive_external_merge(b, fids, erase=erase)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IO CountsRW
To/from Buffer00
To/from Disk00
\n", "
\n", "
TOOLTIP!
\n", "
TOOLTIP!
\n", "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "$.getScript(\"https://d3js.org/d3.v3.min.js\", function () {\n", "$.getScript(\"https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js\", function () {\n", "$.getScript(\"https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js\", function () {\n", "var DURATION = 1000\n", "var chartNum = \"1-0\"\n", "var numBufferPages = 3\n", "var pageSize = 2\n", "var numDiskPages = 5\n", "var logStart = 69\n", "// Load data from json files\n", "var pagesLog;\n", "var getPagesLog = function() {\n", " d3.json(\"pagesLog.json\", function(error, json) {\n", " if (error) return console.warn(error);\n", " pagesLog = json;\n", "\n", " // Make the animation\n", " makeComp();\n", "\n", " // Bind to window resize\n", " //$(window).resize(function() {\n", " // makeComp();\n", " //});\n", " });\n", "}\n", "getPagesLog();\n", "\n", "// Set some globals\n", "var pageWidth = 50 + 20*pageSize; // NOTE: this is hard-coding the font-width!!!\n", "var pageHeight = 30;\n", "var pagePadding = 5;\n", "var frameWidth = pageWidth + 2*pagePadding;\n", "var frameHeight = pageHeight + 2*pagePadding;\n", "var framePadding = 2*pagePadding;\n", "\n", "var bufferFrameFill = '#3366FF';\n", "var diskFrameFill = '#B685FF';\n", "var frameStroke = 'white'; // '#C1C0C0'\n", "var diskBackground = '#ABACB0';\n", "var pageColor = 'blue';\n", "var pageHighlightColor = 'red';\n", "\n", "var maxDiskSize = 4; // MAX # of files that can fit in disk\n", "var fileSize;\n", "var height = framePadding + (maxDiskSize + 2) * (framePadding + frameHeight);\n", "var width;\n", "\n", "// Create a row of frames\n", "var appendFrameRowTo = function(e, x, y, len, name, label, fill, duration, callback) {\n", " var frameRow = e.append('g')\n", " .attr({\n", " 'transform': 'translate(' + x + ',' + y + ')',\n", " 'id': name + '-g'})\n", " .style('opacity', 0);\n", " var frameRange = new Array(len);\n", " frameRow.append('text')\n", " .text(label)\n", " .attr({\n", " 'text-anchor': 'middle',\n", " 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n", " 'dx': 0.5*frameWidth,\n", " 'dy': 0.5*frameHeight});\n", " frameRow.selectAll('rect.' + name + '-frame')\n", " .data(frameRange)\n", " .enter()\n", " .append('rect')\n", " .attr({\n", " 'id': function(d,i) { return name + '-frame-' + i; },\n", " 'class': name + '-frame',\n", " 'width': frameWidth,\n", " 'height': frameHeight,\n", " 'x': function(d,i) { return (i+1)*frameWidth; },\n", " 'y': pagePadding,\n", " 'fill': fill,\n", " 'stroke': frameStroke});\n", " if (duration != null) {\n", " frameRow.transition()\n", " .style('opacity', 1)\n", " .duration(DURATION)\n", " .each(\"end\", function() { if (callback != null) { callback(); } });\n", " } else {\n", " frameRow.style('opacity', 1);\n", " if (callback != null) { callback(); }\n", " }\n", " return frameRow;\n", "}\n", "\n", "var diskHeight = function(nFiles) {\n", " return 2*framePadding + nFiles * (framePadding + frameHeight)\n", "}\n", "\n", "// Create the chart\n", "var bufferPages;\n", "var files = [];\n", "var svg;\n", "var buffer;\n", "var disk;\n", "var makeComp = function() {\n", " $(\"#chart-\"+chartNum).html('');\n", " \n", " // Make sure the tooltips are hidden\n", " $(\".tooltip\").hide();\n", "\n", " // viewer size\n", " width = $('.output_subarea').width() - 20;\n", "\n", " // Make the files the max possible size\n", " // TODO: make files dynamically resizeable?\n", " fileSize = Math.floor((width - framePadding) / (framePadding + frameWidth) - 2);\n", "\n", " // append svg\n", " svg = d3.select(\"#chart-\"+chartNum)\n", " .append('svg')\n", " .attr({\n", " 'width': width,\n", " 'height': height})\n", " .style('border', '1px solid black');\n", "\n", " // append buffer\n", " buffer = appendFrameRowTo(svg, pagePadding, pagePadding, numBufferPages, 'buffer', 'BUFFER', bufferFrameFill, null, null);\n", "\n", " // append disk container\n", " var diskY = 2*pagePadding + 2*frameHeight;\n", " disk = svg.append('g')\n", " .attr({\n", " 'id': 'disk-g',\n", " 'transform': 'translate(' + pagePadding + ',' + diskY + ')'});\n", " disk.append('text')\n", " .text('DISK')\n", " .attr({\n", " 'text-anchor': 'middle',\n", " 'dominant-baseline': 'middle', // NOTE: doesn't work in IE?\n", " 'dx': 0.5*frameWidth,\n", " 'dy': 0.5*frameHeight});\n", " disk.append('rect')\n", " .attr({\n", " 'id': chartNum + '-disk-box',\n", " 'width': width - 2*frameWidth,\n", " 'height': diskHeight(1), // Start w/ space for one file\n", " 'x': frameWidth,\n", " 'y': 0,\n", " 'fill': diskBackground,\n", " 'stroke': 'none'});\n", "\n", " // Animate the page transistions according to the log diff in pagesLog\n", " var success = takeActions(0);\n", "}\n", "\n", "var addFile = function(fileId, duration, callback) {\n", " var n = files.length;\n", " if (fileId >= maxDiskSize) {\n", " return null;\n", " }\n", " if (n > 0 && fileId >= n) {\n", " $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(n+1));\n", " }\n", " var file = appendFrameRowTo(disk, 2*framePadding + frameWidth, framePadding + fileId * (framePadding + frameHeight), fileSize, 'file-' + fileId, 'File ' + fileId, diskFrameFill, duration, callback);\n", " if (fileId < n) {\n", " files[fileId] = file;\n", " } else {\n", " files.push(file);\n", " }\n", "}\n", "\n", "var removeFile = function(fileId) {\n", " var file = files[fileId];\n", " file.remove();\n", " files[fileId] = null;\n", " var n = files.length;\n", " var nh = n;\n", " if (n > 0) {\n", " for (var j=n-1; j > 0; j--) {\n", " if (files[j] != null) {\n", " break;\n", " } else {\n", " nh = j;\n", " }\n", " }\n", " $(\"#\" + chartNum + \"-disk-box\").height(diskHeight(nh));\n", " }\n", "}\n", "\n", "// For some reason in certain situations e.g. d3.select(\"#0-0-1\")\n", "// d3.select is not working here?? Workaround...\n", "var getElement = function(elementType, filterFn) {\n", " var e = d3.select('#chart-'+chartNum)\n", " .selectAll(elementType)[0]\n", " .filter(filterFn)[0];\n", " return e;\n", "}\n", "\n", "var pageText = function(fid, pid, data) {\n", " var s = fid + '/' + pid + ' : ';\n", " for (var i=0; i < data.length; i++) {\n", " s += (data[i] != null) ? data[i] : '_';\n", " if (i < data.length - 1) {\n", " s += ', ';\n", " }\n", " }\n", " return s;\n", "}\n", "\n", "var appendPageTo = function(e, a, x, y, opacity, color) {\n", " var page = e.append('g')\n", " .attr({\n", " 'id': chartNum + '-' + a.file + '-' + a.page + '-' + a.newLocation,\n", " 'transform': 'translate(' + x + ',' + y + ')',\n", " 'class': 'page'})\n", " .style('opacity', opacity);\n", " page.append('rect')\n", " .attr({\n", " 'id': 'page-' + a.file + '-' + a.page + '-rect',\n", " 'class': 'page',\n", " 'width': pageWidth,\n", " 'height': pageHeight,\n", " 'fill': color,\n", " 'stroke': '#444'});\n", " page.append('text')\n", " .attr({\n", " 'dx': pagePadding,\n", " 'dy': 0.7*pageHeight,\n", " 'font-size': \"15\",\n", " 'fill': 'white'})\n", " .text(pageText(a.file, a.page, a.pageData));\n", " return page;\n", "}\n", "\n", "var getBufferPageAbsPos = function(bufferIndex) {\n", " var bufferPos = d3.transform(buffer.attr(\"transform\")).translate;\n", " var x = bufferPos[0] + pagePadding + (bufferIndex + 1) * frameWidth;\n", " var y = bufferPos[1] + 2 * pagePadding;\n", " return [x, y];\n", "}\n", "\n", "var getFilePageAbsPos = function(file, page) {\n", " var diskPos = d3.transform(disk.attr(\"transform\")).translate;\n", " var filePos = d3.transform(files[file].attr(\"transform\")).translate;\n", " var x = diskPos[0] + filePos[0] + pagePadding + (page + 1) * frameWidth;\n", " var y = diskPos[1] + filePos[1] + 2 * pagePadding;\n", " return [x, y];\n", "}\n", "\n", "var placeRelativeTo = function(a, b, dTop, dLeft) {\n", " var bOffset = b.offset();\n", " if (bOffset != null) {\n", " a.offset({'top' : bOffset.top + dTop, 'left' : bOffset.left + dLeft});\n", " }\n", "}\n", "\n", "var takeActions = function(step) {\n", "\n", " // Get the next action to execute\n", " if (step < pagesLog.length) {\n", " var a = pagesLog[step];\n", " //console.warn(a.operation);\n", " } else {\n", " $(\".tooltip\").hide();\n", " return null;\n", " }\n", "\n", " // Are we in the new part of the log which should be animated?\n", " var animate = (a.show && step >= logStart);\n", "\n", " // Update IO counter\n", " $(\"#chart-\" + chartNum + \"-bufferReads\").html(a.ioCount.bufferReads);\n", " $(\"#chart-\" + chartNum + \"-bufferWrites\").html(a.ioCount.bufferWrites);\n", " $(\"#chart-\" + chartNum + \"-diskReads\").html(a.ioCount.diskReads);\n", " $(\"#chart-\" + chartNum + \"-diskWrites\").html(a.ioCount.diskWrites);\n", "\n", " // Show tooltip! -> @ old location, or new if old is null\n", " if (animate) {\n", " var tooltip = $(\"#chart-\" + chartNum + \"-tooltip\");\n", " var pageId = \"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\";\n", " pageId += (a.oldLocation != null) ? a.oldLocation : a.newLocation;\n", " placeRelativeTo(tooltip, $(pageId), 0.7*pageHeight, 0.3*pageWidth);\n", " tooltip.html(a.operation);\n", " tooltip.show();\n", " }\n", "\n", " // Check for tooltip animations & handle here\n", " if (a.operation.slice(0, 7) == 'TOOLTIP') {\n", " // TODO: Clean this up\n", " if (a.operation == 'TOOLTIP-2') {\n", " var tooltip = $(\"#chart-\" + chartNum + \"-tooltip-2\");\n", " var page = $(\"#\" + chartNum + \"-\" + a.file + \"-\" + a.page + \"-\" + a.oldLocation);\n", " placeRelativeTo(tooltip, page, -(tooltip.outerHeight() + 1.5*pagePadding), 0.1*pageWidth);\n", " tooltip.html(a.tooltipContent);\n", " tooltip.show();\n", " }\n", " takeActions(step + 1);\n", " return true;\n", " }\n", "\n", " // Check for file actions & handle here\n", " if (a.operation === \"NEWFILE\") {\n", " addFile(a.file, DURATION, function() { takeActions(step + 1); });\n", " return true;\n", " } else if (a.operation === \"DELETEFILE\") {\n", " removeFile(a.file);\n", " takeActions(step + 1); // TODO: Do proper animation here\n", " return true;\n", " }\n", "\n", " // old -> new location animation\n", " if (a.newLocation != null) {\n", "\n", " // Delete the old element\n", " if (!a.keepOld && a.oldLocation != null) {\n", " $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation).remove();\n", " }\n", "\n", " // ANIMATION: If animation is called for, we create a new copy of page *which is above all \n", " // other SVG g layers* to do the animation, then delete it once the animation is done\n", " // The point is to do the animation so nice & visible (not hidden behind g layers)\n", " // But so that the end object is appended to the proper g\n", " var newE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n", " var newI = (a.newLocation === \"BUFFER\") ? a.bufferIndex : a.page;\n", " var newX = (newI + 1) * frameWidth + pagePadding;\n", " var newY = 2 * pagePadding;\n", "\n", " // Animate only if we are in the new part of the log\n", " if (animate) {\n", " var newPos = (a.newLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n", " if (a.oldLocation != null) {\n", " var oldPos = (a.oldLocation === \"BUFFER\") ? getBufferPageAbsPos(a.bufferIndex) : getFilePageAbsPos(a.file, a.page);\n", " var oldE = (a.oldLocation === \"BUFFER\") ? buffer : files[a.file];\n", " } else {\n", " var oldPos = newPos;\n", " var oldE = (a.newLocation === \"BUFFER\") ? buffer : files[a.file];\n", " }\n", "\n", " // Append the dummpy page for animation\n", " var p = appendPageTo(svg, a, oldPos[0], oldPos[1], 0, pageColor);\n", "\n", " // Animation! Place the new page and proceed to next action at end\n", " p.transition()\n", " .style('opacity', 1).duration(DURATION)\n", " .transition()\n", " .attr('transform', 'translate(' + newPos[0] + ',' + newPos[1] + ')').duration(DURATION)\n", " .each(\"end\", function() { \n", " var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n", " takeActions(step + 1);\n", " p.remove();\n", " });\n", "\n", " // Else if no animation we just place the new element and proceed\n", " } else {\n", " var page = appendPageTo(newE, a, newX, newY, 1, pageColor);\n", " takeActions(step + 1);\n", " }\n", "\n", " // If no new location: simple fade out animation\n", " } else {\n", " if (!a.keepOld && a.oldLocation != null) {\n", " var page = $(\"#\" + chartNum + '-' + a.file + '-' + a.page + '-' + a.oldLocation);\n", " if (animate) {\n", " page.fadeOut(DURATION, function() {\n", " page.remove();\n", " takeActions(step + 1);\n", " });\n", " } else {\n", " page.remove();\n", " takeActions(step + 1);\n", " }\n", " } else {\n", " takeActions(step + 1);\n", " }\n", " }\n", " return true;\n", "}\n", "});\n", "});\n", "});\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create buffer & random file larger than buffer\n", "b = io.Buffer(buffer_size=3, page_size=2)\n", "fid = new_rand_file(b, 11)\n", "b.display_set_mark()\n", "\n", "# Use the algorithm\n", "external_merge_sort(b, fid, erase=True, concise=True)\n", "b.display(buffer_num=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the cost of this algorithm?\n", "\n", "Initial part (splitting and in-memory sorting): $2*N$\n", "\n", "Recursive external merge: $\\text{ceil}\\left(\\log_B\\left(\\frac{N}{B+1}\\right)\\right)$ passes, each involving $2*N$ IO operations\n", "\n", "$\\implies O\\left(2N*\\left(\\text{ceil}\\left(\\log_B\\left(\\frac{N}{B+1}\\right)\\right) + 1\\right)\\right)$" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }