Hashing files in python
This article will explain how to calculate hash of a file or hashing files in python. A hash function takes a variable length of bytes and generates a fixed length string out of it. This string is called hash or digest or checksum.
Hashing of files is useful in following situations
1. Used to compare two files by generating their hash and checking their values. If hash of two files are same, they are considered to be identical.
2. Ensure the integrity of a file when it is transferred over network by comparing the hash generated at the sending end with the hash calculated at the receiving end.
If the file is tempered, the hash will be different.
3. Compare the hash of a downloaded file with its actual hash to confirm if it downloaded completely.
There are many different hashing algorithms available, such as
'md5'
, 'sha1'
, 'sha224'
, 'sha256'
, 'sha384'
, 'sha512'
, 'sha3_224'
, 'sha3_256'
, 'sha3_384'
, 'sha3_512'
etc.Most commonly used are
md5
and sha1
since they are fast.Note that while calculating hash of a file, its contents are used and not its name.
In this article, we will get hash of a file using md5
and sha1
algorithms in python.
Python hashlib module
Python hashlib
module provides functions to calculate and get hash of a given data. To use hashlib
, we need to import it and create a hash object according to the algorithm to be used.
To create hash object, hashlib
has functions such as md5()
, sha1()
, sha224()
etc., to find hash using md5
, sha1
, sha224
algorithms respectively.
Once, we get a hash object, following are its important functions that are useful to calcualte secure hash value.
1. update()
This function accepts data and updates the hash object with it. Repeated calls to this function concatenates data to the hash object.
This is usually done while reading a large file in chunks of data.
2. hexdigest()
Returns the secure hash or digest as a string.
3. digest()
Returns the secure hash or digest as bytes.
md5 File hash
Let’s look at how to get hash of a file using md5
algorithm in python. Example program is given below
import hashlib as hs hash_md5 = hs.md5() with open('painting.png', 'rb') as file: buffer = file.read() hash_md5.update(buffer) print('Hash of file is:', hash_md5.hexdigest())
Following is the stepwise explanation of the code.
1. As a first step, hashlib
module is imported into the program.
2. Next, we get its object using md5()
method.
3. The file whose hash we want to calculate is opened for reading. Since, it is an image(binary file), the mode of opening is ‘rb’.
4. The contents of the file are read into a variable using read()
method.
5. This variable containing file data is passed to update()
method of hash object.
6. Finally, we get the hash of file contents as a string with hexdigest()
method of hash object.
Output is
Hash of file is: bf500cbe32570d158f8cf4373106944c
Length of md5 digest is 128 bytes
Finding hash of a file using
sha1
algorithm is similar to md5
except that we need to call sha1()
method of hashlib
module to get the hash object, instead of md5()
. Example, import hashlib as hs hash_sha1 = hs.sha1() with open('painting.png', 'rb') as file: buffer = file.read() hash_sha1.update(buffer) print('Hash of file is:', hash_sha1.hexdigest())
This prints
Hash of file is: 223279b5ca089821291c3c33bbf7a455250a012e
Length of sha1 digest is 160 bytes.
md5 hash of large files
In the last example, we are reading a file using read()
method, which reads entire file at once and loads it into memory.
For large files, this method would not be suitable since it would consume the entire memory.
So, we would read large files in chunks or blocks and pass this block to update()
method of hash object. Multiple calls to update()
concatenates the data that is supplied to it. Example,
import hashlib as hs hash_md5 = hs.md5() # define the bytes to be read at once BLOCK_SIZE = 8192 with open('doc.pdf', 'rb') as file: buffer = file.read(SIZE) # loop till entire file is read while len(buffer) > 0: hash_md5.update(buffer) # read next block buffer = file.read(SIZE) print('Hash of file is:', hash_md5.hexdigest())
Note that we are reading the file using a python while loop. Each iteration of the loop reads 8192 bytes and passes them to update()
method.
Loop continues till the number of bytes read becomes 0, which means that the end of file has reached.
This is similar to previous method except that we need to call
sha1()
method to get the hash object instead of md5()
. Example, import hashlib as hs hash_sha1 = hs.sha1() # define the bytes to be read at once BLOCK_SIZE = 8192 with open('doc.pdf', 'rb') as file: buffer = file.read(SIZE) # loop till entire file is read while len(buffer) > 0: hash_sha1.update(buffer) # read next block buffer = file.read(SIZE) print('Hash of file is:', hash_sha1.hexdigest())
That is all on finding hash of a file in python. Hope the article was useful.