Downloading California pesticide usage data
Why would anyone want to download pesticide data???
If that went through your head, this article probably isn’t super relevant, but I’ve got some cool python code to help automate the process of getting usable data!
California has an incredibly stringent Pesticide management and regulation agency known as California Department of Pesticide Regulation (CDPR). This department requires users of Pesticides to report their usage to ensure that chemicals are being used according to label as well as for environmental protection and research.
Thankfully CDPR produces a daily public instance of the current year’s data along with historic archives. This data lives in a public FTP server although it’s not the easiest to interpret! The documentation is sparse and the format is strange for a database.
After some digging, I found that the .dat files are essentially fixed width rows and the data can be extracted by selecting each line between a certain number of characters.
Okay so the data is in there, now we just need to know what the columns are and what their widths are. This was what got me stuck for a really long time! I had no clue where to even start. I scoured the CDPR website for any hint and there are broken links to documentation. Thanks to the incredible power of the Way Back Machine, I was able to find the most recent public documentation on the database!
So now to just whip up some code to pull it all together.
First, I scripted out some code to download all the .dat files from the FTP site so we can go ahead and process them. This code is a bit longer than absolutely required because it checks to make sure that the file on the FTP site is newer than the file on the computer so it doesn’t over-download although that’s entirely unnecessary.
The next step was to actually parse the .dat files. To do this, I relied on pandas’ function called read_fwf or read fixed width file. Now that I look at the documentation a bit closer, I realize that the function can actually read directly from the FTP site!
So essentially you pass the column widths and the column names to the function and it produces a dataframe in which the script then saves to disk in the form of a CSV!
For those of you who are here and want to actually take a look at the code, I put everything into this Colab document!