Hashing files in python

This article will explain how to calculate hash of a file or hashing files in python. A hash function takes a variable length of bytes and generates a fixed length string out of it. This string is called hash or digest or checksum.

Hashing of files is useful in following situations
1.  Used to compare two files by generating their hash and checking their values. If hash of two files are same, they are considered to be identical.
2. Ensure the integrity of a file when it is transferred over network by comparing the hash generated at the sending end with the hash calculated at the receiving end.
If the file is tempered, the hash will be different.
3. Compare the hash of a downloaded file with its actual hash to confirm if it downloaded completely.

Hashing algorithms
There are many different hashing algorithms available, such as 'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512' etc.
Most commonly used are md5 and sha1 since they are fast.
Note that while calculating hash of a file, its contents are used and not its name.

In this article, we will get hash of a file using md5 and sha1 algorithms in python.
Python hashlib module
Python hashlib module provides functions to calculate and get hash of a given data. To use hashlib, we need to import it and create a hash object according to the algorithm to be used.

To create hash object, hashlib has functions such as md5(), sha1(), sha224() etc., to find hash using md5, sha1, sha224 algorithms respectively.
Once, we get a hash object, following are its important functions that are useful to calcualte secure hash value.
1. update()
This function accepts data and updates the hash object with it. Repeated calls to this function concatenates data to the hash object.
This is usually done while reading a large file in chunks of data.
2. hexdigest()
Returns the secure hash or digest as a string.
3. digest()
Returns the secure hash or digest as bytes.
md5 File hash
Let’s look at how to get hash of a file using md5 algorithm in python. Example program is given below

import hashlib as hs

hash_md5 = hs.md5()
with open('painting.png', 'rb') as file:
    buffer = file.read()
    hash_md5.update(buffer)
print('Hash of file is:', hash_md5.hexdigest())

Following is the stepwise explanation of the code.
1. As a first step, hashlib module is imported into the program.
2. Next, we get its object using md5() method.
3. The file whose hash we want to calculate is opened for reading. Since, it is an image(binary file), the mode of opening is ‘rb’.
4. The contents of the file are read into a variable using read() method.
5. This variable containing file data is passed to update() method of hash object.
6. Finally, we get the hash of file contents as a string with hexdigest() method of hash object.

Output is

Hash of file is: bf500cbe32570d158f8cf4373106944c

Length of md5 digest is 128 bytes

sha1 file hash
Finding hash of a file using sha1 algorithm is similar to md5 except that we need to call sha1() method of hashlib module to get the hash object, instead of md5(). Example,

import hashlib as hs

hash_sha1 = hs.sha1()
with open('painting.png', 'rb') as file:
  buffer = file.read()
  hash_sha1.update(buffer)
print('Hash of file is:', hash_sha1.hexdigest())

This prints

Hash of file is: 223279b5ca089821291c3c33bbf7a455250a012e

Length of sha1 digest is 160 bytes.
md5 hash of large files
In the last example, we are reading a file using read() method, which reads entire file at once and loads it into memory.
For large files, this method would not be suitable since it would consume the entire memory.
So, we would read large files in chunks or blocks and pass this block to update() method of hash object. Multiple calls to update() concatenates the data that is supplied to it. Example,

import hashlib as hs

hash_md5 = hs.md5()
# define the bytes to be read at once
BLOCK_SIZE = 8192
with open('doc.pdf', 'rb') as file:
  buffer = file.read(SIZE)
  # loop till entire file is read
  while len(buffer) > 0:
    hash_md5.update(buffer)
    # read next block 
    buffer = file.read(SIZE)
print('Hash of file is:', hash_md5.hexdigest())

Note that we are reading the file using a python while loop. Each iteration of the loop reads 8192 bytes and passes them to update() method.
Loop continues till the number of bytes read becomes 0, which means that the end of file has reached.

sha1 hash of large files
This is similar to previous method except that we need to call sha1() method to get the hash object instead of md5(). Example,

import hashlib as hs

hash_sha1 = hs.sha1()
# define the bytes to be read at once
BLOCK_SIZE = 8192
with open('doc.pdf', 'rb') as file:
  buffer = file.read(SIZE)
  # loop till entire file is read
  while len(buffer) > 0:
    hash_sha1.update(buffer)
    # read next block 
    buffer = file.read(SIZE)
print('Hash of file is:', hash_sha1.hexdigest())

That is all on finding hash of a file in python. Hope the article was useful.