{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext sql\n", "%sql sqlite:///flights.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Problem Set 1\n", "=======\n", "\n", "### Instructions / Notes:\n", "\n", "**_Read these carefully_**\n", "\n", "\n", "* You will need to install the `prettytable` module to run the scripts. (eg. `pip install --user prettytable`)\n", "* Download `flights.db` from http://web.stanford.edu/class/cs145/files/flights.db and make sure it is in the same directory as this Jupyter notebook \n", "* Run the top cell above to load the database `flights.db`.\n", "* You **may** create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!\n", "* When you see `In [*]:` to the left of the cell you are executing, this means that the code / query is _running_.\n", " * **If the cell is hanging- i.e. running for too long: to restart the SQL connection, you must restart the entire python kernel**\n", " * To restart the kernel using the menu bar: \"Kernel >> Restart & Clear Output\"), then re-execute the sql connection cell at the top\n", " * You will also need to restart the connection if you want to load a different version of the database file\n", "* Remember:\n", " * `%sql [SQL]` is for _single line_ SQL queries\n", " * `%%sql \n", " [SQL]` is for _multi line_ SQL queries\n", "* Running `submit.py` will run the queries and print them. \n", "* We have provided correct output from our solution in `correct_output.txt`.\n", " * You can diff your output with the correct output by running `python sanity_check.py` or `python submit.py > my_output; diff my_output correct_output.txt`\n", " * **Your `submit.py` should match this output exactly.** This means:\n", " * the columns should have the **same names** as `correct_output.txt`\n", " * the columns should be in the **same order** as `correct_output.txt`\n", "\n", "### Submission Instructions:\n", " * Do _NOT_ submit your iPython notebook.\n", " * Instead, copy each query into the provided `submit.py` script in the corresponding variable.\n", " * Do _not_ copy the `%sql` or `%%sql` commands, only the raw SQL.\n", " * We will autograde your queries on a database with identical schema but arbitrarily-chosen values. Therefore, do not hard-code any constants in your queries -- all of your queries should use the tables in the database.\n", " * **Follow the instructions in `submission_instructions.txt` to upload your answers to our autograders.**\n", "\n", "_Have fun!_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction: Travel Delays\n", "------------------------\n", "\n", "There's nothing I dislike more than travel delays -- how about you?\n", "\n", "In fact, I'm always scheming new ways to avoid travel delays, and I just found an amazing dataset that will help me understand some of the causes and trade-offs when traveling.\n", "\n", "I wonder if you can use SQL to help me!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly... you can!\n", "\n", "In this homework, we'll use SQL to explore airline travel delays that occurred in July 2017.\n", "\n", "To start, let's look at the primary relation in the database we've prepared for you:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearquartermonthday_of_monthday_of_weekfl_dateunique_carrierairline_idcarriertail_numfl_numorigin_airport_idorigin_airport_seq_idorigin_city_market_idoriginorigin_city_nameorigin_state_abrorigin_state_fipsorigin_state_nmorigin_wacdest_airport_iddest_airport_seq_iddest_city_market_iddestdest_city_namedest_state_abrdest_state_fipsdest_state_nmdest_waccrs_dep_timedep_timedep_delaydep_delay_newdep_del15dep_delay_groupdep_time_blktaxi_outwheels_offwheels_ontaxi_incrs_arr_timearr_timearr_delayarr_delay_newarr_del15arr_delay_grouparr_time_blkcancelledcancellation_codedivertedcarrier_delayweather_delaynas_delaysecurity_delaylate_aircraft_delayunnamed: 55
201737162017-07-01AS19930ASN559AS111278112780330852DCAWashington, DCVA51Virginia3814747147470330559SEASeattle, WAWA53Washington93800750.0-10.00.00.0-1.00800-085917.0807.01010.011.010531021.0-32.00.00.0-2.01000-10590.0None0.0NoneNoneNoneNoneNoneNone
" ], "text/plain": [ "[(2017, 3, 7, 1, 6, u'2017-07-01', u'AS', 19930, u'AS', u'N559AS', 1, 11278, 1127803, 30852, u'DCA', u'Washington, DC', u'VA', 51, u'Virginia', 38, 14747, 1474703, 30559, u'SEA', u'Seattle, WA', u'WA', 53, u'Washington', 93, 800, 750.0, -10.0, 0.0, 0.0, -1.0, u'0800-0859', 17.0, 807.0, 1010.0, 11.0, 1053, 1021.0, -32.0, 0.0, 0.0, -2.0, u'1000-1059', 0.0, None, 0.0, None, None, None, None, None, None)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT * \n", "FROM flight_delays \n", "LIMIT 1;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cool, there are so many columns! How many rows are there?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_rows
509070
" ], "text/plain": [ "[(509070,)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT COUNT(*) AS num_rows\n", "FROM flight_delays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow, that's a lot of data! Good thing you don't have to answer all of my questions by hand...\n", "\n", "You don't need to import more data into the database. However, you can find a description of each field online at [https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236).\n", "\n", "We've pre-loaded a number of additional tables that will help you decode important fields like `airline_id`, `airport_id`, and `day_of_week` into human-readable form for the queries below.\n", "\n", "Please use the following cell to explore these the `airlines` and `weekdays` tables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright -- let's get started!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part I: SQL Queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 1: How long are flights delayed on average? (10 points)\n", "------------------------\n", "Just to get a sense of the data, let's start with a simple query.\n", "\n", "In the cell below, write a SQL query that returns the average arrival delay for the entire month of July 2017 (i.e., the whole dataset)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 2: What was the worst flight delay? (10 points)\n", "------------------------\n", "Hmm, the average doesn't look too bad! What about the _worst_ delay?\n", "\n", "In the cell below, write a SQL query that returns the maximum arrival delay for the entire month of July 2017 (i.e., the whole dataset)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 3: What flight am I happiest I didn't take? (10 points)\n", "------------------------\n", "Yikes! What flight was so late?\n", "\n", "In the cell below, write a SQL query that returns the carrier (i.e., `carrier`), flight number, origin city name, arrival city name, and flight date for the flight with the maximum arrival delay for the entire month of July 2017. Do not hard-code the arrival delay you found above. Hint: use a subquery." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 4: Which are the worst days to travel? (10 points)\n", "------------------------\n", "Since CS145 just started, I don't have time to head to Kona anytime soon. However, I'm headed out of town for a trip next week! What day is worst for booking my flight?\n", "\n", "In the cell below, write a SQL query that returns the average arrival delay time for each day of the week, in descending order. The schema of your relation should be of the form (`weekday_name`, `average_delay`).\n", "\n", "**Note: do _not_ report the weekday ID.** (Hint: look at the `weekdays` table and perform a join to obtain the weekday name.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 5: Which airlines that fly out of SFO are delayed least?\n", "------------------------\n", "Now that I know which days to avoid, I'm curious which airline I should fly out of SFO. Since I haven't been told where I'm flying, please just compute the average for the airlines that fly from SFO.\n", "\n", "In the cell below, write a SQL query that returns the average arrival delay time (across _all_ flights) for each carrier that flew out of SFO at least once in July 2017 (i.e., in the current dataset), in descending order.\n", "\n", "**Note: do _not_ report the airlines ID.** (Hint: a subquery is helpful here; also, look at the `airlines` table and perform a join.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 6: What proportion of airlines are regularly late?\n", "------------------------\n", "Yeesh, there are a lot of late flights! How many airlines are regularly late?\n", "\n", "In the cell below, write a SQL query that returns the proportion of airlines (appearing in `flight_delays`) whose flights are on average at least 10 minutes late to arrive. Do not hard-code the total number of airlines, and make sure to use at least one `HAVING` clause in your SQL query.\n", "\n", "Note: sqlite `COUNT(*)` returns integer types. Therefore, your query should likely contain at least one `SELECT CAST (COUNT(*) AS float)` or a clause like `COUNT(*)*1.0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 7: How do late departures affect late arrivals?\n", "------------------------\n", "It sure looks like my plane is likely to be delayed. I'd like to know: if my plane is delayed in taking off, how will it affect my arrival time?\n", "\n", "The [sample covariance](https://en.wikipedia.org/wiki/Covariance) provides a measure of the joint variability of two variables. The higher the covariance, the more the two variables behave similarly, and negative covariance indicates the variables indicate the variables tend to be inversely related. We can compute the sample covariance as:\n", "$$\n", "Cov(X,Y) = \\frac{1}{n-1} \\sum_{i=1}^n (x_i-\\hat{x})(y_i-\\hat{y})\n", "$$\n", "where $x_i$ denotes the $i$th sample of $X$, $y_i$ the $i$th sample of $Y$, and the mean of $X$ and $Y$ are denoted by $\\bar{x}$ and $\\bar{y}$.\n", "\n", "In the cell below, write a single SQL query that computes the covariance between the departure delay time and the arrival delay time.\n", "\n", "*Note: we could also compute a statistic like the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) here, which provides a normalized measure (i.e., on a scale from -1 to 1) of how strongly two variables are related. However, sqlite doesn't natively support square roots (unlike commonly-used relational databases like PostgreSQL and MySQL!), so we're asking you to compute covariance instead.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 8: It was a bad week...\n", "------------------------\n", "Which airlines had the largest absolute increase in average arrival delay in the last week of July (i.e., flights on or after July 24th) compared to the previous days (i.e. flights before July 24th)?\n", "\n", "In the cell below, write a single SQL query that returns the airline name (_not_ ID) with the maximum absolute increase in average arrival delay between the first 23 days of the month and days 24-31. Report both the airline name and the absolute increase.\n", "\n", "Note: due to [sqlite's handling of dates](http://www.sqlite.org/lang_datefunc.html), it may be easier to query using `day_of_month`.\n", "\n", "Note 2: This is probably the hardest query of the assignment; break it down into subqueries that you can run one-by-one and build up your answer subquery by subquery.\n", "\n", "Hint: You can compute two subqueries, one to compute the average arrival delay for flights on or after July 24th, and one to compute the average arrival delay for flights before July 24th, and then join the two to calculate the increase in delay. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 9: Of Hipsters and Anarchists\n", "------------------------\n", "I'm keen to visit both Portland (PDX) and Eugene (EUG), but I can't fit both into the same trip. To maximize my frequent flier mileage, I'd like to use the same flight for each. Which airlines fly both SFO -> PDX and SFO -> EUG?\n", "\n", "In the cell below, write a single SQL query that returns the distinct airline names (_not_ ID, and with no duplicates) that flew both SFO -> PDX and SFO -> EUG in July 2017." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query 10: Decision Fatigue and Equidistance\n", "------------------------\n", "I'm flying back to Stanford from Chicago later this month, and I can fly out of either Midway (MDW) or O'Hare (ORD) and can fly into either San Francisco (SFO), San Jose (SJC), or Oakland (OAK). If this month is like July, which leg will have the shortest arrival delay for flights leaving Chicago after 2PM local time?\n", "\n", "In the cell below, write a single SQL query that returns the average arrival delay of flights departing either MDW or ORD after 2PM local time (`crs_dep_time`) and arriving at one of SFO, SJC, or OAK. Group by departure and arrival airport and return results descending by arrival delay.\n", "\n", "Note: the `crs_dep_time` field is an integer formatted as hhmm (e.g. 4:15pm is 1615)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## You're done! Now submit!\n", " * Refer to the top of this notebook for submission instructions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }