OpenBlock

Overview & Initial Experience

Karen Tracey, Colin Copeland

Presenter Notes

Talk Outline

  • What it is?
  • Our experience with it
  • Extensions inspired by our experience
  • Future

Presenter Notes

What is OpenBlock?

Presenter Notes

What Kinds of Sites Might Use OpenBlock?

  • Sites with a local focus,
  • which can benefit from automated news item generation,
  • where news item sources already exist online,
  • in some user-unfriendly, but regular, fashion

Presenter Notes

OpenBlock History

Presenter Notes

  • Let's take a moment to learn where OpenBlock came from

EveryBlock.com

static/everyblock-logo.jpg
  • Adrian Holovaty founded a web startup, EveryBlock, with a team of six
  • March 2007: Won Knight News Challenge program grant
  • Jan. 2008: Site launch with Chicago, New York, San Francisco
  • June 2008: Charlotte and Philadelphia added
  • Aug. 2009: Acquired by MSNBC
  • Today in 16 cities, 3 more coming soon

Presenter Notes

  • EveryBlock was originally funded by a two-year grant from the Knight Foundation through its Knight News Challenge program.

EveryBlock.com

static/example-everyblock.png

Presenter Notes

  • Browse by neighborhoods, streets, zipcodes, or draw your own location
  • Lots of public record information as well as community neighbor content
  • Lots of community activity, especially in Chicago

EveryBlock Source Code

  • July 2009: the EveryBlock team open sourced core functionality on Google Code
  • 7 tarballs available at http://code.google.com/p/ebcode/
  • And then what happened?

Presenter Notes

  • Great codebase, lots of potential, but no community around the code yet
  • Hard to configure project requirements
  • Code was fairly complex, hard for beginners to jump into the project

OpenBlock

static/openblock-logo.png
  • June 2010: Knight Foundation launches OpenBlock Initiative grant
    • OpenPlans: streamline and extend OpenBlock over 2 years
    • The Columbia Daily Tribune: install, test, and add new features in the context of a smaller newspaper
    • The Boston Globe: install, test, and add new features in the context of a larger newspaper

Presenter Notes

  • Limited adoption a year after being open sourced
  • Very little traffic on the ebcode mailing list
  • Grant goal to simplify and accelerate adoption of the open sourced EveryBlock code

OpenBlock Today

static/openblock-logo.png

Presenter Notes

  • Two years later, this is what OpenBlock looks like today
  • Most important are ebpub and ebdata as they contain the geocoding, scraping, and display logic
  • Take a moment to talk breifly about the architecture

OpenBlock Architecture

static/openblock-components.png

Presenter Notes

  • OpenBlock architecture is comprised of 4 main components
  • Touch briefly on data model

Data Model

  • Primary News Models
    • Schema: description of a particular data set, like "Restaurant Inspection"
    • NewsItem: individual piece of news associated with a schema
  • Primary Geocoder Models
    • Street: a street with a unique name
    • Intersection: a point representing the meeting of two streets
    • Block: segment of a single street between two intersecting streets

Presenter Notes

  • Highlight the most important OpenBlock models, divided into two categories
  • Blocks are a fundamental piece of the OpenBlock system
  • Let's look at a few diagrams to illustrate the block model

Example City Streets

static/data-model-city.png

Presenter Notes

  • Example city
  • Main St divided by 1st and 2nd street

Street Model

static/data-model-street.png

Presenter Notes

  • Street model represents an entire street
  • So you can see the entire length of Main St highlighted here

Block Model

static/data-model-block.png

Presenter Notes

  • One segment of a street, including the left and right address ranges for that segment
  • Blocks are a fundamental piece of the OpenBlock system
  • They're core to geocoding and are browsable on the Web UI
  • We'll talk more about blocks later, but I wanted to famliarize them with you now
  • Now, let's look at some OpenBlock sites

OpenBlock Sites

Presenter Notes

OpenBlock Demo: Boston

static/example-boston.png

Presenter Notes

  • Flagship demo for OpenBlock in Boston, MA
  • Ideal example for OpenBlock (large city, similar to EveryBlock)
  • Very recent data, including restaurant inspections and police reports

openCampus Kent

static/example-kent.png

Presenter Notes

  • Kent State University in Ohio
  • Simple site only using a few OpenBlock views (no detail views)
  • Crime reports, reviews from Yelp, News feed from campus newspaper

LarryvilleKU

static/example-larryvilleku.png

Presenter Notes

  • University of Kansas
  • Twitter integration and accident reports
  • Joint venture of the School of Journalism and the student newspaper
  • Newspaper partnership is related to what we've been doing with OpenRural

OpenRural

Presenter Notes

  • Taking OpenBlock and using it in rural North Carolina communities
  • Small towns and small news organizations
  • Newspapers don't have a lot of digitial resources
  • And they lack the resources to make public data digestible on the web
  • Quite different than typical OpenBlock setup in a big city with larger infrastructure

OpenRural

static/unc.png
  • June 2011: OpenRural funded by a three-year Knight News Challenge grant
  • Ryan Thornburg, professor at School of Journalism and Mass Communication at UNC
  • Caktus is helping develop and deploy OpenRural for these NC communities

Presenter Notes

  • Goals:
    • Apply same OpenBlock tools to rural North Carolina communities
    • Increase access to local public records
    • Do this by helping local newspapers leverage OpenBlock
    • "Help Rural Newspapers Get Access to Public Data"

Columbus County, North Carolina

static/nc-columbus-county.png

Presenter Notes

  • Our initial focus is on Columbus County, NC
  • Small county in the south eastern part of the state with 50k residents
  • Working with a local newspaper to incorporate public records onto their site

The News Reporter

static/whiteville-com.png

Presenter Notes

  • The online version of the paper serving Whiteville and Columbus County

Columbus County, NC

static/columbus-county-map.png

Presenter Notes

Sources for Street/Block Data

  • Shapefiles contain location data and metadata
    • Census (Tiger)
    • County
    • State
  • How to measure accuracy & completeness?
    • Columbus County GIS has addresses file
    • ~38,000 valid addresses in the county

Presenter Notes

"Cities" in Columbus County

static/nc-columbus-county-cities.png

Presenter Notes

Challenging Characteristics of Columbus County

  • Multiple "cities"
    • Supported by OpenBlock, but not "default"
    • Different urlpatterns single- vs. multi-city
    • Multi-city urlpatterns include "city slug"
  • Unincorporated areas
    • Lots of space not in any town/city
    • These places need names to be navigable
    • Can use census "county subdivision" names
    • ...but these are not meaningful to residents

Presenter Notes

1st Approach: Census Files for OpenBlock Data

  • Advantages
    • Code already exists in OpenBlock to use these files
    • Generalizable to other NC counties
  • Disadvantages
    • Incomplete/incorrect data
    • 70% success rate geocoding ~38,000 Columbus County addresses

Presenter Notes

Missing Addresses

static/bad-data-missing-addresses.png

Presenter Notes

Changing Names

static/bad-data-primary-names.png

Presenter Notes

2nd Approach: County GIS Department Data

  • Advantages
    • More complete/accurate
    • ~38,000 address geocode success rate improved to 93%
  • Disadvantages
    • Custom code to load this data (custom BlockImporter)
    • Not generalizable to other counties
    • This data not available for all counties

Presenter Notes

Custom Data Availability in NC

static/Street_Centerline_Download_County.jpg

Presenter Notes

Geocoding is Still Difficult

  • Geocoding is a hard problem to solve
  • String parsing
    • number
    • predir
    • street name
    • street type
    • postdir
  • Streets can have multiple names (misspellings table can help)
  • 3rd-party geocoder fallback?

Presenter Notes

What are they?

  • Scripts that extract information from online data sources
  • The process is conceptually simple:
    • Download some data from the web
    • Create one or more NewsItems whose fields are populated with that data
    • Save the NewsItem(s) to the database
  • The grunt work is in extracting the data you need
  • Scrapers sometimes require more than a single data source
    • CSV/Excel/Navy DIF
    • Shapefile
    • Download multiple files and stitch them together locally

Presenter Notes

Scrapers for The News Reporter

  • Corporation Filings: scraped from the NC Secretary of State website
  • Restaurant Inspections: scraped from large Crystal Report exports from the NC Department of Health and Human Services
  • Property Transactions: scraped from the Columbus County Tax and GIS offices
  • Geocoded News Articles: scraped from whiteville.com
  • Notably missing: police incident reports

Presenter Notes

  • Working with newspaper and government staff to scrape and collect online data
  • Local staff has been very helpful

The News Reporter: Public Records

static/whiteville-com-openrural.png

Presenter Notes

  • Plan to launch production environment by Nov. 1, 2012

Property Transactions Scraper

static/scrapers-property.png

Presenter Notes

OpenRural Stack

  • Automated fabric server provisioning and deployment. Testable with vagrant.
  • Using Celery and RabbitMQ for asynchronous tasks (scrapers and maintenance tasks)
  • Modified fork for OpenBlock that includes staticfiles changes.
  • Production runs nginx and gunicorn on a small Amazon EC2 instance.
  • Most issues fixed on OpenBlock core are pushed back to the official repository
  • Everything is completely open source
  • https://github.com/openrural

Presenter Notes

  • Atypical OpenBlock setup
  • Local development instructions are included

Extensions

Presenter Notes

  • So we've highlighted our experience and how we've used it for OpenRural
  • Now we'll cover how we've extended and added features to OpenBlock
  • OpenBlock handles scraping and public viewing, but is missing review and analysis

The Missing Piece: Data Review and Analysis

  • How successful was the geocoder?
  • How many news items were added?
  • Why is my scraper failing to run?
  • Why did this address fail to geocode? How can I correct it?

Presenter Notes

  • We found ourselves asking...

Data Dashboard

static/datadashboard-list.png

Presenter Notes

  • We created what we call the Data Dashboard
  • Simple extension to the OpenBlock scraper architecture
  • Provides statistics related to each run

Data Dashboard

static/datadashboard-runs.png

Presenter Notes

  • Keeps track of each run for every scraper, including execution time and status
  • Since this scraper runs multiple times a day, it doesn't always injest new data
  • Filtered here to only show the runs that updated data
  • 2 min run was a full import after resetting the NewsItems
  • 5 sec run was for when it found new news items a few days later

Data Dashboard

static/datadashboard-stats.png

Presenter Notes

  • High level statistics for each run
  • Includes geocoding exceptions
  • Support for custom counters
  • Option field to record comments

Data Dashboard

static/datadashboard-failures.png

Presenter Notes

  • Detailed list of failures
  • Date of failure, location or string that failed to geocode
  • Geocoding exception, and a link to the admin to fix the error

Data Dashboard

from openrural.data_dashboard.scrapers import DashboardMixin
from openrural.retrieval.base.scraperwiki import ScraperWikiScraper

class CorporationsScraper(DashboardMixin, ScraperWikiScraper):

    # scraper settings
    logname = 'corporations'
    schema_slugs = ('corporations',)

Presenter Notes

  • Simple Mix-in class to use Data Dashboard
  • Handles all stats and metrics by default, but you can add more
  • Nice addition to the OpenBlock suite of tools

What's Next?

Presenter Notes

Columbus County

static/nc-columbus-county.png

Presenter Notes

  • Currently in Columbus County
  • Grant stipulates scaling up to multiple counties

Many Counties

static/nc-14-counties.png

Presenter Notes

  • We're hoping to expand into a dozen or more counties in NC
  • Grant also stipulates that we develop a profitable solution
  • So we have to weigh options moving forward

Considerations

  • Improving the geocoder is tough and, therefore, expensive
    • Possibly fallback to 3rd party geocoder
  • Web UI code is hard to use and extend
    • JavaScript libraries for interacting with slippy maps have come a long way
    • Rewrite would make our lives easier in the future
  • Sustainability as we scale
    • Would it be more efficient to build a single system to power all counties?
    • In our case, each OpenBlock install will be very similar

Presenter Notes

OpenBlock Community

  • OpenBlock has largely been developed through grant funding
  • Paul Winkler of OpenPlans has been very helpful and active in the community
  • However, Knight funding has ended and OpenPlans is no longer actively working on the project
  • Future of the community is unknown
  • OpenBlock needs an organic online community to survive
  • If you're interested in OpenBlock, come speak to us!

Presenter Notes

Questions?

Presenter Notes