Friday, March 11, 2011

More NSRL Talk

I've spent a bit more time study the NSRL problem.  It is a fantastic but unwieldy database of hash values that is under utilized (at least in my experience).  I spoke in my last post about importing the NSRL database into SQLite so the data can be better managed.  I added at the end of the post that I had created a pair of shell scripts, 'getnsrl' and 'mknsrl' to automate SQLite creation.

I am happy to report that I have extended mknsrl to fully handle the NSRL database.  My first version only imported the file table which contains the hash values.  I thought, incorrectly, that this would be sufficient for generating has sets.  I had not fully understood the NSRL Reference Data Set (RDS) construction, which actually consists of five CSV files:

  • NSRLFile.txt
  • NSRLMfg.txt
  • NSRLOS.txt
  • NSRLProd.txt
  • version.txt

The NSRLFile.txt is broken into four parts that need to be concatenated together.  Currently, this database contains over 62 million lines, far exceeding the capabilities of the spreadsheet programs for which it is formatted.  The database consists of 'known' files, not be confused with 'known good' files.  The database contains hash values for well know hacker programs, for example, and eliminating them through hashing may be problematic to the investigation.  There needed to be a way to export specific hash sets from the RDS for the data to be truly useful.

getnsrl

I wrote getnsrl to make processing the four iso files containing the NSRL RDS easy.  The files must be mounted, and the database files extracted from a zip file.  getnsrl does this automatically, appending the names of the NSRLFile.txt CSV files to distinguish them and prevent overwriting.  On my system, this takes about four minutes.  The user need only put the iso files in a common directory (not required, but easier on the fingers) and issue 'getnsrl RDS*'.  The tool provides feedback and start/stop times.

mknsrl


I wrote mknsrl to allow for easy importation of the CSV files to a standalone SQLite database.  It creates virtual tables from the CSV files and then imports them into regular tables.  The NSRLFile.txt files are not (and should not be) concatenated before processing as this is done by mknsrl while inserting the data into the database.  Though importing the files to a standard table is not required to use the CSV files with SQLite, I favored this approach because the database can be used independently of the CSV files.  I just found it simpler to manage, and it has the added benefit of indexing if the user so chooses, since indexing virtual tables is not possible with SQLite.  mknsrl reports its progress, start/stop times, and some database statistics.  It no longer does any indexing (as in the first version).  I leave that to the user to decide what to index if indexing is desired.  On my system, it took about 12 minutes to import the NSRL RDS data.  The statistical analysis can be commented out to save a few minutes, if desired.

Why use mknsrl?

The chief advantage of mknsrl is database control.  Hash sets can be produced of any product in the database, which is not easy to do with the traditional methods for handling the RDS data.  NSRL has recommended that the four NSRLFile.txt be concatenated and then reduced to uniq hash values, leaving about 18 million records.  The problem is, you can't reliably know the source of the file this way, and you may hash out a program file of interest.

mknsrl allows any hash set or group of hash sets to be extracted from the database with a sqlite query in the following form:

$ sqlite3 -csv -header nsrl.db "select * from file where [expression];"

I used the following expression to create a hash set of all "Windows" operating systems in about 5 minutes (note the nested SELECT statement that makes this possible):

$ sqlite3 -csv -header nsrl.db "select * from file where opsystemcode in (select opsystemcode from os where opsystemname like '%Windows%'"

Your ability to create hash sets is limited only by your ability to craft SQLite queries.

If you would like to look at or use 'getnsrl' or 'mknsrl', you can find them at https://sites.google.com/site/slosleuth/.

No comments:

Post a Comment

Time Perspective

Telling time in forensic computing can be complicated. User interfaces hide the complexity, usually displaying time stamps in a human reada...