So Long, THOMAS - First Branch Forecast

The Library of Congress announced that the legislative information website THOMAS is scheduled to stop functioning on July 5, with Congress.gov to replace its functionality. This will allow the Library to focus all its energy on Congress.gov instead of having also to maintain a very awkward, 21-year-old website.

I’m sure that many news reports will give credit to Newt Gingrich for THOMAS. It is true that he was largely responsible for the political lifting required to build the site, which is a big deal. It was not his brain child, however, and a fair amount of technical work was previously performed under the democrats, who lost power in 1994/5. There were prior efforts to make legislative information available to the general public, including a wrongheaded effort by GPO to sell the data and the clever use of GOPHER to release data, which I dare not try to describe.

The semi-official story behind the naming of THOMAS is that it was named after Thomas Jefferson. There’s an unconfirmed story that it also was a nod to Bill Thomas, who ended up chairing the Committee on House Administration in 1995, which has oversight over the legislative branch, including the Library of Congress which built THOMAS. Later on, some people created a backronym for the site: “The House [of Representatives] Open Multimedia Access System.” I’ve asked the people who were present at the creation: they just smile. Here’s Newt Gingrich recounting its origin story.

While THOMAS was fairly cutting edge for when it launched, it quickly became outdated. The Library of Congress failed to invest in the infrastructure to modernize the site and it refused to release the data behind it. This led to efforts by civic hackers to scrape the website — to pull down the data from all of its pages. This is akin to assembling a carton of eggs from their broken shells. It also was controversially prohibited by the website, which contains a robot.txt file, which says that people can read the website but you’re not supposed to use automated machines to do so.

WashingtonWatch.com may have been the first public-facing effort to scrape THOMAS and republish as a website, although there were paid services, which are not a service to the general public, as well as an earlier efforts to scrape a GPO website by the indomitable Carl Malamud (which led to this hilarious-in-retrospect letter to desist in grabbing legislative data). The most successful republication effort was undertaken by Josh Tauberer, who built GovTrack.us, and made the data he scraped available to everyone for free.

At the same time, there was a Congress-only version of THOMAS, known as the legislative information system (LIS), which was better maintained than THOMAS and had tons more useful information. Often times congressional staff didn’t realize the public had access to an infinitely less useful version.

A coalition of organizations called for releasing THOMAS data in 2008, and some of my time at the Sunlight Foundation was spent building a coalition and lobbying for public access to the underlying data. When I left Sunlight, I formed the Congressional Data Coalition, which continues to advocate for public access to legislative data. Earlier this year, Congress began publishing the text of bills, their summaries, and their status online in bulk — so it is now possible for anyone to easily gather virtually all of the data on THOMAS. A parallel effort by many civic hackers built a Github project known as the United States Project, which contains the scrapers for THOMAS and other legislative information as well.

Pretty much all all of the information you see in the newspapers or on websites about Congress comes from these efforts. For example, nearly a million unique users each month use GovTrack. Programmers are major news outlets participate in the United States Project, and some of the fee-for-service websites reuse its code.

The new congressional website, Congress.gov, was inspired by the work of outside organizations and civic hackers in its design and capabilities. It is worth remembering that Congress.gov and its predecessor THOMAS run off of downstream data — the source information is heavily processed by the Clerk, Secretary of the Senate, GPO, committees, and others before it gets onto Congress.gov, and there’s a lot of information about congressional activities that are missing. Some of this information exists on the congress-only side of Congress.gov, which is the successor to LIS, although much of it is either available elsewhere or not available at all. We outlined some of the data we think Congress should focus on next here.

There’s also been a big change from 1995 to now. It took a long time, but civil society and Congress eventually developed a working relationship. The House of Representatives created the Bulk Data Task Force in 2012 to work on the issue of public access to legislative data, the House included in its rules the directive to make legislative information available to the public in bulk and in structured data formats, hosted two hackathons, and 5 years ago began holding annual Legislative Data and Transparency Conferences. The Senate is coming along, but we are still building relationships there.

There are too many specific accomplishments of this collaboration to list, but highlights include bulk access to bill information, releasing the US code in XML, the creation of a congressional documents and hearings repository (docs.house.gov), the publication of information about House floor activities (including the text of bills before they are voted on) at rules.house.gov, and most recently the publication of the House Rules and Jefferson’s manual as XML. (I am sure I am forgetting several other major changes.)

Just yesterday was a public meeting of the Bulk Data Task Force. It was open to everyone, and was webcast so that people could engage in the conversation in real time. Several major announcements were made, which I will write about later. But the real story continues to be the productive working relationship between civil society and congressional stakeholders. Not everyone in Congress is on board, of course, but we have reached a critical mass where real progress is possible. Most notably, democratic and republican leadership continue to work with one another and with us to bring Congress to the public. This bipartisan story is often untold but is the heart of what makes all this possible.

I am not sorry to see THOMAS go. It served its purpose, but it is time to focus on something new, something better. I am excited to see where this path leads.

An additional thought: THOMAS should have its website archived for future generations, and should turn off its robots.txt file so it can be captured by the Wayback Machine.

— Written by Daniel Schuman

Published by Daniel Schuman