Large Text Data Sets¶
Names¶
Names for the whole U.S. are in the names
folder. Names broken down on a
state-by-state basis are in the names_by_state
folder.
Original source is the Social Security Administration.
FEC¶
Contributions to political campaigns political/individual_contributions.txt (fields definitions)
Automotive¶
Complaints, defect investigations, and recalls for different cars. Source is the National Highway Traffic Safety Administration.
A list of fields and descriptions for the ‘complaints’ table.
Medicare¶
Medicare claims from Centers for Medicare & Medicaid Services
Weather¶
All of this data is in the weather
folder.
The original source for the weather data is from NOAA (Site, FTP)
File info:
- weather/ghcnd-stations.txt – List of stations. Egrep what station you want.
- weather/*.dly – Weather files. Format description is here
Example processing weather data:
egrep TMAX USC00134063.dly | sed 's/^.\{11\}//' | sed 's/[0−9] .*/\1/' | sed 's/TMAX//' | sed 's/^[* 0−9]{4}/\1 /' | sed 's/^.{11}[0−9]/\1\.\2/' > ../craven.csv
egrep TMAX USC00134063.dly
# Search the text file for ‘TMAX* ‘ which is the temperature maximumsed 's/^.\{11\}//'
# Use sed and regular expressions. From * the beginning of the line, remove 11 characterssed 's/[0−* 9] .*/\1/'
# Use sed to remove all of the line after the last temp reading.sed 's/TMA* X//'
# Remove the TMAX from the linesed 's/^[0−9]{4}/\1 /'
# Add a space after the year so it doesn’t run into the monthsed 's/^.{11}[0−9]/\1\.\2/'
# Add a decimal into the number, because 345 is actually 34.5> ../craven.csv
# Redirect to a file. Do it one directory up because there are way too many files here