Documents

Hash Selection

Categories
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
IMPORTANT
Transcript
  SELECTION OF HASHING ALGORITHMS   Tim Boland Gary Fisher JUNE 30, 2000INTRODUCTION The National Software Reference Library (NSRL) Reference Data Set (RDS) is built on file signature generation technology that is used primarily in cryptography. The selection of the specific file signature generation routines is based on customer requirements and the necessity to proide a leel of confidence in the reference data that will allow it to be used in the !.S. ourts. This document gies an oeriew of the arious hashing algorithms considered# as well as implementations of those algorithms. $t also gies factors regarding their selection and use.%ashing is an e&tremely good way to erify the integrity of a sequence of data bits (e.g.# to ma'e sure the contents of the sequence haent been changed inadertently). The sequence might ma'e up a character string# a file# a directory# or a message representing data (binary s or *s) stored in a computer system. The word +hash, means to +chop into small pieces, (R-). / hashing algorithm is a mathematical function (or a series of functions) ta'ing as input the aforementioned sequence of bits and generating as output a code (alue) produced from the data bits and possibly including both code and data bits. Two files with e&actly the same bit patterns should hash to the same code using the same hashing algorithm. $f a hash for a file stays the same# there is only an e&tremely small  probability that the file has been changed. 0n the other hand# if the hashes for the files donot match# then the files are not the same. Thus# hashes could be used as a primary erification tool to find identical files. The output code of the hash function should hae a+random, property# so that different sequences of bits hash to different alues as much as  possible. %ashes are used in +scatter storage, systems# in digital signature applications# and recently in computer forensics applications# to determine whether the contents of a suspect machine hae been modified maliciously. %ashing algorithms can be efficiently implemented on modern computers.%ashing algorithms fall within the realm of error detection techniques. $n a general sense#the aim of an error detection technique is to enable the receier of a message transmitted through a noisy (error1introducing) channel to determine whether the message has been corrupted. Some hashing algorithms perform comple& transformations on the message to in2ect it with redundant information# while others leae the data intact and append a hash alue on the end of a message. $n any case# the transmitter may construct a hash alue that is a function of the message. The receier can then use the same hashing algorithm tocompute the hash alue of the receied message and compare it with the transmitted hashalue to see if the message was correctly receied. $f the hash alues match# then the  message was correctly receied3 if not# then there must hae been an error in one or more of the data bits of the message. The National $nstitute of Standards and Technology (N$ST) of the !.S. Department of ommerce has been as'ed to inestigate commonly1used hashing algorithms in support of the National Software Reference Library (NSRL). There are seeral algorithms aailable# differing in comple&ity# robustness# ease of use# and machine efficiency. 4enerally# +hardware (physical deice), methods of computing hashes inole e&tensie  bit manipulations# and are relatiely inefficient for a software (programmatic) computation# so the algorithms discussed contain software methodologies to streamline the hashing process. 5hat is inoled in all the algorithms is a method to brea' up the input into manageable portions# and manipulate the input in a systematic way oer and oer (iteratiely). The algorithms generally differ in the degree to which they do this# and the number of iterations inoled. 4ien the aboe# N$ST inestigated aailable implementations of four different hashing algorithms and tested the algorithm output on some test data. The purpose of this e&ercisewas to define some +reference, implementations for erifying the correctness of entries in the NSRL Reference Data Set (RDS). 6ultiple algorithms were considered because of the need for +double1chec'ing, results and because many facilities use multiple hashing algorithms simultaneously. 7erformance metrics mentioned aboe were used in ealuating candidate implementations. or each algorithm# the authorities (official sources and sanctions) of the source programsand tests used for testing accuracy will be described. /lgorithms mentioned in this report may hae limitations (clashes found# performance# etc.)# which will be mentioned as appropriate. /ll implementations ealuated are freely aailable from the $nternet. -ach of the four algorithms will be described below# starting with a high1leel oeriew and  progressing to more detail as appropriate. CRC32 The cyclic redundancy code ( R ) algorithm is the simplest of the four hashing algorithm choices# but also the least robust. The name means that the algorithm operates in repetitie (cyclic) redundant cycles to produce an output hash code. The +89, indicatesthe number of bits being considered to produce the hash code (e&plained below). The R algorithm is a 'ey component in the error1detecting capabilities of many communications protocols. $n a R algorithm# the transmitter of a message constructs a alue (called the chec'sum) and appends it to the message. The receier can then use the same function to compute the chec'sum of the receied message and compare it with the appended chec'sum to see if the message was correctly receied. or e&ample# if we chose a chec'sum function which was the sum of the decimal numbers in a message# it might go something as follows: 6essage1 9 8# 6essage with chec'sum ;  9 8 < (< is sum of  and 9 and 8)# 6essage after transmission ;  9 = <. %ere the third decimal number was corrupted from 8 to =# and the receier can detect this by computing the chec'sum (>?@9@=) from the message# and compare it with the transmitted chec'sum of <. 0biously# both sender and receier must be using the same algorithm to be consistent.  $f the chec'sum itself is corrupted# a correctly transmitted message might be incorrectly identified as a corrupted one. %oweer# this is a side1safe failure. / dangerous1side failure occurs where the message andAor chec'sum is corrupted in a manner that results ina transmission that is internally consistent. !nfortunately# this possibility is completely unaoidable and the best that can be done is to minimiBe its probability by increasing the amount of information in the chec'sum (R-9).The aboe e&ample is obiously ery simple# and would not suffice for rigorous error detection. / more comple& chec'sum function is needed. 5hile addition is clearly not strong enough to form an effectie chec'sum# it turns out that diision is# so long as the diisor (number to diide by) is about as wide as the chec'sum register (place to store thechec'sum alue). The basic idea behind R algorithms is simply to treat the message as an enormous  binary number# to diide it by another fi&ed binary number# get a quotient# and ma'e the remainder from this diision the chec'sum. !pon receipt of the message# the receier can perform the same diision and compare the remainder with the +chec'sum, (transmitted remainder). or e&ample# when diiding decimal  (message) by = (diisor) we get a alue of 9 (quotient) with a remainder of 8.5ith R diision# instead of iewing the numbers mentioned aboe as positie integers# they are iewed as polynomials with binary coefficients. This is done by treatingeach number as a bit1string whose bits are the coefficients of a polynomial. or e&ample# the ordinary number 98 (decimal) is * (binary) and so it corresponds to the  polynomial &CC= @ &CC9 @ &CC @ &CC*. 7olynomials are used because they proide useful mathematical machinery in the calculations. R arithmetic is primarily about 0Ring (e&clusie10Ring) particular alues at arious shifting offsets# which has the effect of doing the binary diision. /n e&clusie10R function produces  if the two input  bits are different3 otherwise it produces *. The R algorithm can be applied to messages of different widths (9# <# or 89 bits). 5e are considering the 891bit ( R 89) algorithm here because it is the most robust.$n this case the polynomial is 89 bits wide and the R 89 chec'sum is also 89 bits.This also simplifies the calculation on most modern computers. 0ther R polynomials used besides R 89 are R 9# R <# and R 1 $TT# from the onsultatie ommittee for Telephone and Telegraph ( $TT). 0n 7 s one can deal with binary numbers of only 89 bits or fewer# so one must brea' up the enormous binary number mentioned aboe into manageable chun's. Thats e&actly what the two R algorithms mentioned below do. $n order to speed up the process# the algorithms use a pre1calculated loo'1up table3 the table contains a R for each character code between * and 9EE# so that the calculation doesnt need to be repeated as the te&t strings are processed. This process has the effect of performing the diision of the enormous binary number by the generator polynomial# but in increments# due to the limitations of modern computing. $n other words# instead of computing the R bit by   bit# a 9E<1element loo'up table can be used to perform the equialent of F bit operations at a time. To perform a R calculation# the user needs to choose a diisor. 4enerally the diisor iscalled the +generator polynomial, or simply the +polynomial,# and is a 'ey parameter of any R algorithm. 0ne can choose any polynomial and come up with a R algorithm.%oweer# some polynomials are better then others. /n e&ample of a polynomial used might be >G#><=#GG decimal# or *&*=cdb> he&adecimal. Theoretical mathematicians hae calculated certain polynomials to proide the least duplications in remainders. CRC Implementation To implement a R algorithm is to implement R diision. There are two reasons why the diide instruction of the host machine cannot be used. The first is that the diision must be in R arithmetic. The second is that the diidend might be ten megabytes ( byte?F bits) long# and todays processors do not hae registers large enoughto hold a diidend of this siBe. To implement R diision# we hae to feed the message in smaller chun's through a diision register.0riginally there were seen candidate R 89 implementations (using or @@ high1leel programming languages) under consideration (which represents nearly all of the researched R 89 implementations publicly aailable). 7erformance metrics used to ealuate these implementations were the following (in no particular order of importance):speed of e&ecution# ease of use# accuracy# ability to operate on entire files# and choice of generator polynomial. 0ne implementation was re2ected because it did not produce accurate results# two were not set up to operate on entire files (only te&t strings)# and two were slow (because they were not +table1drien,). 0nly two were reasonably fast#  produced accurate information# were table drien# and used generally accepted generator  polynomials. Hoth are table1drien# but one uses a polynomial is from an /merican  National Standards $nstitute (/NS$) 8 ommittee# while the other polynomial is not e&plicitly specified in code# but the table entries are the same as compared to the other. The two implementations are about the same number of programming statements. Slight  preference was gien for the algorithm that computes alues for directories of files as well as indiidual files.The test data used to erify the routines was from commonly used 7IJ$7 (R-8) and 5$NJ$7 (R-=) products# and other arious test character strings and file directories. Since these products are commonly used and routinely generate R s# they would be alid benchmar's of accuracy. The R outputs are in he&. Hoth implementations erified correctly against the data. There are no apparent limitations in the implementations# other than the inherent R 89 limitations# although one implementation produces more cursory output on only one file at a time. 6ore information on each of these implementations is gien below.The first candidate R program (using the language) computes the 891bit R used as the frame chec' (error1detection) sequence in $7S > (R-E) This source code is from the Snippets file collection (R-<). $t consists of a header file# crc.h# and a main
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks