NYC OpenData has published data regarding 311 service requests every day since 2010.
Each day, NYC311 receives thousands of requests related to several hundred types of non-emergency services, including noise complaints, plumbing issues, and illegally parked cars. These requests are received by NYC311 and forwarded to the relevant agencies, such as the Police, Buildings or Transportation. The agency responds to the request, addresses it and the request is then closed.
The data is accessible online, but the file is so large that it's difficult to work with. We have created a pipeline to enable anyone to interactively explore both short term and long term trends of service requests in New York City.
We constructed a pipeline to organize, analyze, and visualize that data to make it more useful. Here's the process we used:
First, we fired up an AWS EC2 instance with Hadoop and Hive. Then, we loaded gigabytes of 311 data as a CSV.
Second, we moved the CSV into HDFS, and we imposed a relational schema on the data in Hive.
Third, once the data was loaded into Hive, we performed transformations via HiveSQL to extract insights.
Fourth, we pulled data on a daily basis out of the database using its Socrata API. A few Python scripts, and our data visualizations are fully up to date!
Finally, we loaded the data from Hive into Tableau using Cloudera's driver, and we created a few visualizations.
NYC's 311 dataset dates back to 2010, and it's about 20GB as a raw CSV. That's far too large to play with on your laptop--which is why this is an interesting project.
When working with a dataset of that size, it's sometimes helpful to just look at a small piece first. The dashboard below is 311 data from the month of July 2017.
Feel free to filter and drill down using the controls on the right.
For the last seven and a half years, what have been the most common service requests? Do they differ by Borough? This viz shows the most common complaints from 2010 through mid-August 2017.
Here are some of the more bizarre complaints in the database. They're fun to explore!
While the raw dataset on NYC OpenData is huge, it's also constantly changing. To get a sense for the very most recent data, we plugged into the dataset's Socrata API via a few Python scripts.
Is the Department of Buildings taking too long to fix your elevator? Explore the interactive tables and charts to see if neighboring boroughs are getting better service. Looking at 311 data on a daily level warrants some different metrics. Foremost among them: how long does it take for service requests to close? Check out the visualizations below to answer those questions.
Check out our repository on GitHub. It includes all of the instructions you need to launch this project yourself!