Friday, February 5, 2010

datapyning (tool release)

okok, i'm always writing stuff and never getting it released, so this time i've kludged up a tool and dropped it on google code:

http://code.google.com/p/datapyning

just a little python script that will query search info (to google atm, others in next rev) and pull down all the returned results. the idea is to allow you
to collect files/data en-mass and store it away for further analysis later...

[purpose]

so is there any security relevance here? well i built the tool to archive data for a security research project i've been kicking around. i see it being useful for a variety of research and information discovery tasks, but i donno if anyone else will.

ultimately, the idea came from me trying to find some info i'd seen before and coming to the conclusion that the data had poofed into the aether. if you aren't archiving information you care about, is anyone else???

this tool might help you archive some of that data for your purposes...


[examples]

~ grab up to 20 PDFs posted in the last week w/ the search phrase 'free', verbosely

[user@box datapyning]$ ./datapyning.py -S ./null.list -n 20 -f pdf -t w -s free -v

~ grab up to 100 .xls files in the last year w/ .com, .org, and .net domains w/ search phrase 'profit' quietly into a dir called foo

[user@box datapyning]$ ./datapyning.py -f xls -t y -S ./small.list -s profit -q -d foo

~ grab up to 100 results from the last 24 hrs for each tld w/ the search phrase 'default.password'

[user@box datapyning]$ ./datapyning.py -s "default.password"


[limitations]

* searching for -s 'foo bar' makes google barf, but -s 'foo.bar' works... wtf, mah bad, def on the list to get fixed :(
* other 'advanced' search features (intext:, etc) aren't accesible via cli and mostly not through the search phrase
* currently the tool kinda expects search frequencies >= 1 per day (result dir contains dirs named by search date)
* search domains/sites aren't handled on the cli (files w/ crlf delimeters only)
* max of 100 records per search
* no status bar for larger downloads (it will timeout, make note, and move on if d/l fails)
* no rate limiting, sooo it will use the bandwidth it can
* not sure if the way download file names are genericized and logged makes sense
* tied to google (but potential for either modularized search providers or mb search agnostic)

No comments: