Recognize dates from an image using Sliding Window Algorithm & Python OCR.

In this article, weโ€™ll see how to recognize dates of different formats from any image of a document using Python.

Dipankar Medhi
4 min readJan 7, 2023

--

Hey there ๐Ÿ‘‹, Today letโ€™s solve a text processing problem that asks us to find any โ€œdateโ€ present in a string.

We are using โ€œeasyocrโ€, a python OCR library to find the text from the images. Letโ€™s move on with the code.

Extracting text from images | Setting up easyocr

  1. We start by creating a data-extraction.py module.
  2. Install easyocr library.
$ pip install easyocr

3. Create a DataExtraction class and initiate the easyocr model.

# data-extraction.py

from datetime import datetime
import easyocr
import re


class DataExtraction:
def __init__(self) -> None:
self.months = {
"JAN": "01",
"FEB": "02",
"MAR": "03",
"APR": "04",
"MAY": "05",
"JUN": "06",
"JUL": "07",
"AUG": "08",
"SEP": "09",
"OCT": "10",
"NOV": "11",
"DEC": "12",
}
self.reader = easyocr.Reader(["en"])

4. Easyocr gives us a list of strings obtained from the image.

Converting date strings to DateTime objects

There can be an unknown number of date formats and parsing each one of them will take an infinite amount of time and work. So in this example, weโ€™ll consider only a few well-known forms.

Weโ€™ll try to identify โ€œdd mmm yyyyโ€ date formats from a string.

For example, if the given date is โ€œ15 sd f may 2019โ€, then the output should be โ€œ15โ€“05โ€“2019".

We are going to use the Sliding Window to detect if any month is present in between two groups of numerical characters.

The string includes numbers, alphabets, including other characters. For example, consider โ€œ๐—ด๐˜€ ๐Ÿญ๐Ÿฑ ๐—บ๐—ฎ๐—ถ ๐—บ๐—ฎ๐˜† ๐Ÿฎ๐Ÿฌ๐Ÿญ๐Ÿต ๐˜€๐—ด๐—ณ ๐˜€โ€. The date should be 15th May 2019.

Showing sliding window in progress
  1. The first step is to implement a sliding window to convert โ€œMMMโ€ to a number. Like, may to 05.
  2. We create a function that takes in a string and finds if it contains any month from the above dictionary, months.
# Sliding Window implementation
def month_to_num(self, s: str) -> str:
res = ""
start = 0
try:
for end in range(len(s)):
rightChar = s[end]
res += rightChar
if len(res) == 3:
if res.upper() in self.months.keys():
numeric_date = self.months[res.upper()]
return numeric_date
start += 1
res = res[1:]
except Exception as e:
pass

return ""

3. Next, we create a function that takes in a string and gives us the desired format.

def find_date_string(self, s: str) -> list:  # s = "๐—ด๐˜€ ๐Ÿญ๐Ÿฑ ๐—บ๐—ฎ๐—ถ ๐—บ๐—ฎ๐˜† ๐Ÿฎ๐Ÿฌ๐Ÿญ๐Ÿต ๐˜€๐—ด๐—ณ "
s1 = " ".join(re.split(r"([a-zA-Z])([0-9]+)", s))
s2 = " ".join(re.split(r"([0-9]+)([a-zA-Z]+)", s1))
text = "-" + "-".join(re.split(r"[-;,.\s]\s*", s2)) + "-" # "gs-15-mai-may-2019-sgf"
dates_type_1 = re.findall(r"-[0-9][0-9]-.*?-[0-9][0-9][0-9][0-9]-", text) # "-15-mai-may-2019"
date_objects = []
# in-case there is a desired date formats
if len(dates_type_1) > 0:
date_objs = self.get_date_object(dates_type_1)
for date_obj in date_objs:
date_objects.append(date_obj)
return date_objects

def get_date_object(self, date_type_1_list: list):
dates = []
# there might be more than one date in a single string.
for date_str in date_type_1_list:
day_str = date_str[1:3]
month_str = date_str[3:-4]
year_str = date_str[-5:-1]

month_number = self.month_to_num(month_str)
if month_number == "":
return ""

result_date_str = f"{day_str}-{month_number}-{year_str}"
date_object = datetime.strptime(result_date_str, "%d-%m-%Y")
dates.append(date_object)

return dates

4. Now we just need to pass the extracted strings into the above functions.

    def get_date_from_img(self, img_path: str):
result = []

# extract the texts from the img
text_strings = self.reader.readtext(img_path, detail=0)

# check every string for dates
for s in text_strings:
date_obj_list = self.find_date_string(s)
if len(date_obj_list) > 0:
result.append(date_obj_list)
return result

5. Thatโ€™s it. We have all the DateTime objects present in a document image.

This method can be used on any kind of document, provided the date format matches the defined type. There are many kinds of โ€œdateโ€ formats used throughout the world. Different countries have different formats. Parsing each one of them will require some more effort but it is definitely achievable.

Here are some of the other formats to be used for different โ€œdateโ€ types.

"""
1. 1 mai/may 2019
2. 1 mai/may 19
3. 12 09 2016
4. 2 09 2016
5. 12 09 16
6. 2 09 16
"""
dates_type_2 = re.findall(r"-[0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]-", text)
dates_type_3 = re.findall(r"-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-", text)
dates_type_4 = re.findall(r"-[0-9][0-9]-.*?-[0-9][0-9]-", text)
dates_type_5 = re.findall(r"-[0-9]-.*?-[0-9][0-9]-", text)
dates_type_6 = re.findall(r"-[0-9]-.*?-[0-9][0-9][0-9][0-9]-", text)
dates_type_7 = re.findall(r"-[0-9]-[0-9][0-9]-[0-9][0-9]-", text)
dates_type_8 = re.findall(r"-[0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]-", text)

Thatโ€™s all folks!

Weโ€™ll meet on my next blog. Till then, take care!

Happy Coding ๐ŸคŸ

--

--

Dipankar Medhi
Dipankar Medhi

Written by Dipankar Medhi

Sharing byte size solutions | AI/ML | Rust | Python

Responses (1)