Laravel: How to extract invoices from PDF Format

Laravel: How to extract invoices from PDF Format

In web development, you can generate an invoice into PDF by using third party application available online. However, extracting data from pdf is a tricky task. In this article, we will cope on reading invoice from uploaded pdf. We assume that you have background knowledge in laravel file structure and creating model, controller and view.

Requirements:

  • Pdftotext - an application to install on your machine to read pdf files.
  • spatie/pdf-to-text - composer package

Installing Laravel

  1. Open terminal
  2. Run a command
    • composer create-project laravel/laravel pdf-parser
  3. Open your newly created laravel project with your preffered code editor.

Installing Pdftotext and spatie package

Pdftotext is mostly available on linux distribution like ubuntu. You can install it through

apt-get install poppler-utils

install spatie pdftotext package by running
composer require spatie/pdf-to-text

File Structures


app
   /Helpers
      /Prediction
         Prediction.php
      /PDFReader
         Invoice.php
         InvoicePDFTextReader.php
         
   /Http
     /Controllers
        pdfExtractorController.php

Now, that we have pdftotext in our project. We can now start!

PdfExtractorController Class

Make PdfExtractorController
php artisan make:controller PdfExtractorController

//PdfExtractorController.php

public function uploadPDF(Request $request)
{
   $this->files = [];
   $files = $request->file('files');
   if(count($files)){
      $sub_dir = Str::afterLast(Str::uuid(), '-');
      foreach($files as $file){
         if (in_array($file->getClientOriginalExtension(), config('services.allowed_mime_types'))) {
            $filename = Str::lower( Str::ascii($file->getClientOriginalName()) );
            $path = $file->storeAs('./private/pdfs/'. $sub_dir, $filename);
            array_push($this->files, storage_path(str_replace('./', 'app/', $path)));
        }
     }
  }
   return $this->extractPDF();
}

The code above will upload files to our /storage/app/private/pdfs directory
We want to set an empty file array which will be used on extractPDF() method
$this->files = [];

Get files from our request
 $files = $request->file('files');

Set the sub directory with random generated key
$sub_dir = Str::afterLast(Str::uuid(), '-');

Let's make sure that we are process pdf files
 if (in_array($file->getClientOriginalExtension(), config('services.allowed_mime_types'))) {

Storing the uploaded file to our directory
$filename = Str::lower( Str::ascii($file->getClientOriginalName()) );
$path = $file->storeAs('./private/pdfs/'. $sub_dir, $filename);

Submit the uploaded file to our files array which will be processed by our extractPDF method
array_push($this->files, storage_path(str_replace('./', 'app/', $path)));

Now, the extractPDF will handle the files saved in our directory.
Let's intantiate an Invoice class and pass our InvoicePDFTextReader.
We will call the extractPDF() method of Invoice class
$this->Invoice = new Invoice(new InvoicePDFTextReader);

and take data from our invoice
      $this->pdfInvoice[] = $this->Invoice->getData();

public function extractPDF()
{
   foreach($this->files as $file){
      $this->Invoice = new Invoice(new InvoicePDFTextReader);
      $this->Invoice->extractPDF($file);
      $this->pdfInvoice[] = $this->Invoice->getData();
   }
   if(isset($this->pdfInvoice)){
      return $this->pdfInvoice;
   }
}

Invoice Interface

We are now done in our PdfExtractorController
we can enable dependency injection with our InvoiceInterface that will be run on run time.
//InvoiceInterface.php
namespace App\Helpers\PDFReader\Interfaces;

interface InvoiceInterface
{
   public function processInvoice();
}

Utilities Class

We created utilities for common function call like splitting the line and removing empty lines found in our invoice.
//Utilities.php
namespace App\Helpers\PDFReader\Tools;

abstract class Utilities
{
   public static function splitByLine(string $string): array {
      return explode("\n", $string);
   }
   public static function removeEmptyLines($lines)
   {
      $output = [];
      foreach($lines as $line){
         if( ! empty(trim($line)) ){
            array_push($output, trim($line));
         }
     }
     return $output;
   }
}

Invoice Class

We will try to create the Invoice class that we Instantiated in our extractPDF method above.
//Invoice.php
namespace App\Helpers\PDFReader;
class Invoice
{
 
   public function __construct(InvoiceInterface $InvoiceInterface)
   {
      $this->InvoiceService = $InvoiceInterface;
      $this->setDataHeaders();
   }
   public function extractPDF($file)
   {
      $this->lines = $this->getLines(Pdf::getText($file));
      $this->InvoiceService->file = $file;
      $this->InvoiceService->lines = $this->lines;
      $this->InvoiceService->processInvoice();
   }
   public function getData()
   {
      return $this->InvoiceService->getInvoiceData();
   }

   public function getLines($content)
   {
      $lines = Utils::splitByLine($content);
      $lines = Utils::removeEmptyLines($lines);
      return $lines;
   }
}
Now, Let's create the InvoicePDFTextReader class that we passed when we instantiate the Invoice class.
This class is responsible on reading each part in our invoice. It will process supplier name, address, product items, etc...

InvoicePDFTextReader Class

//InvoicePDFTextReader.php
namespace App\Helpers\PDFReader;
class InvoicePDFTextReader implements InvoiceInterface
{
   public function processInvoice()
   {
      $this->prediction = new Prediction($this->data_headers);
      $this->initializeItems();
      $this->processLineItems();
      foreach($this->staticLines as $key => $line){
         $this->processInvoiceData($key, $line);
      }
   }
   private function processLineItems()
   {
      $lineItemClass = new LineItems($this->itemLines, $this->productHeaders);
      $lineItemClass->processLineItems();
      $this->product_values = $lineItemClass->getLineItems();
      $this->product_count_estimate = $lineItemClass->productCountEstimate();
   }
   private function initializeItems()
   {
      $this->getItemLines($this->lines);
   }
   private function getItemLines($lines)
   {
      $range = $this->getItemRangeKey();
      $itemKeys = $this->getItemKeys($range);
      foreach($lines as $key => $line){
         if($line){
            if(in_array($key, $itemKeys)){
                array_push($this->itemLines, $line);
            }else{
                array_push($this->staticLines, $line);
          }
        }
      }
   }
   public function getInvoiceData()
   {
      $this->organizeStaticData();
      $products = $this->organizeProductItems();
      $products = $this->processFinalProducts($products);
      return [
         'items' => $this->product_values,
         'invoice_static_data' => $this->invoiceDetails,
         'products' => $products,
         'product_count_estimate' => $this->product_count_estimate,
         'item_lines' => $this->itemLines,
         'item_static_lines' => $this->staticLines,
         'lines' => $this->lines,
         'file' => $this->file
     ];
   }
   private function organizeStaticData()
   {
      foreach($this->invoiceDetails as $key => $details){
         if(isset($details['options'])){
            $options = $this->organizeOptions($details);
            if($this->isReferenceFound){
              $referenceKeys = json_decode($this->pdfReference['keys']);
              if(isset($referenceKeys->$key)){
                 $this->invoiceDetails[$key]['prediction'] = $this->prediction->getOptionByKey($referenceKeys->$key, $details['options']);
              }
            }else{
               $this->invoiceDetails[$key]['prediction'] = $this->prediction->staticData($key, $details['options']);
            }
            $this->invoiceDetails[$key]['options'] = $options;
         }
      }
   }
   private function processFinalProducts($products)
   {
      $output = [];
      if(isset($this->product_count_estimate['count'])){
         foreach($products as $key => $product){
            if($key < $this->product_count_estimate['count']){
               array_push($output, $product);
            }
         }
      }
      return $output;
   }

   private function organizeProductItems()
   {
      $temp = $this->product_values;
      $products = array();
      $totalctr = count($temp);

      for ($i=0; $i < $totalctr; $i++) {
         foreach ($temp as $key => $product) {
            if (isset($product[$i])) {
               $products[$i][$key] = $product[$i];
               $products[$i][$key]['options'] = $temp[$key];
               $products[$i][$key]['selected'] = $product[$i]['key'];
            }else{
               $products[$i][$key][] = $product;
               $products[$i][$key]['options'] = $temp[$key];
               $products[$i][$key]['selected'] = $temp[$key][0]['key'] ?? '';
            }
        }
     }
     return $products;
  }
   private function organizeOptions($details)
   {
      $options = [];
      if(isset($details['options'])){
        foreach($details['options'] as $option){
           if(! $this->isInOptions($options, $option['value'])){
              array_push($options, $option);
           }
        }
     }
     usort($options, function ($a, $b) {
        return strcmp($a["value"], $b["value"]);
     });
     return $options;
   }
   public function processInvoiceData($key, $line)
   {
      foreach($this->data_headers as $header)
         foreach($header->terms as $term)
            if($this->matchPattern("/$term->term/i", mb_strtolower($line))) {
               $options = $this->assignOptions($key - 5, count($this->staticLines));
               if(! empty($options))
                  foreach($options as $option)
                     if(isset($header->patterns))
                        foreach($header->patterns as $pattern)
                           if($pattern->type == PdfPattern::REGULAR_PATTERN)
                              if($this->matchPattern($pattern->pattern, $option['value']))
                                 if(! $this->isStaticValueExist($header->field, $option)){
                                    $this->invoiceDetails[$header->field]['options'][] = $option;
            }
   }
}

Prediction Class

In InvoicePDFTextReader it will get possible options or values on each field. In supplier name for example it will have an options of possible values for this field.
The prediction class will try to predict the values on the given options.
//Prediction.php
namespace App\Helpers\PDFReader\Prediction;
class Prediction
{
   public function __construct($header_data)
   {
      $this->headerData = $header_data;
   }
   public function getOptionByKey($key, $options){
      if($key === ''){
         return;
      }
      if(!empty($options)){
         foreach($options as $option){
            if(isset($option['key'])){
               if($option['key'] == $key){
                  return $option;
                  break;
               }
             }
          } 
      }
   }
   public function customerSupplierData($key, $options)
   {
      return $this->matchStrictPattern($key, $options);
   }
   public function staticData($key, $options)
   {
      $maximum = $this->getMaximum($options);
      if($total = $this->getTotal($key, $maximum)) return $total;
      if($tax = $this->getTax($key, $maximum, $options)) return $tax;
      if($subTotal = $this->getSubTotal($key, $maximum, $options)) return $subTotal;
      return $this->processStrictPatterns($key, $options);
   }
   private function processStrictPatterns($key, $options)
   {
      if($matchTerm = $this->matchTerm($key, $options)){
         return $matchTerm;
      }
     return $this->matchStrictPattern($key, $options);
   }
   private function matchStrictPattern($key, $options)
   {
      foreach($options as $option)
         foreach($this->headerData as $header)
            if($header->field == $key)
               if(isset($header->patterns))
                  foreach($header->patterns as $pattern)
                     if($pattern->type == PdfPattern::STRICT_PATTERN)
                        if(preg_match($pattern->pattern, $option['value']))
                           return $option;
                           break;
                        }
    }
   private function matchTerm($key, $options)
   {
         $output = [];
         foreach($options as $option)
           if(isset($this->headerData[$key]['strict_patterns']))
              foreach($this->headerData[$key]['strict_patterns'] as $pattern)
                 if(str_contains($pattern, "{{term}}"))
                    foreach($this->headerData[$key]['terms'] as $term){
                       $pattern = str_replace('{{term}}', $term, $pattern);
                       if(preg_match($pattern, trim($option['value']))){
                          $output = $option;
                       }
                     }
         return $output;
   }
   private function getTotal($key, $maximum)
   {
       if($key == 'total'){
           if($maximum){
               return $maximum;
           }
      }
   }
   private function getTax($key, $maximum, $options)
   {
        if($key == 'tax'){
             $target = (int) $maximum['value'] * .20;
             if($tax = $this->getClosest($target, $options)){
                 return $tax;
             }
       }
   }
   private function getSubTotal($key, $maximum, $options)
   {
       if($key == 'sub_total'){
          if(isset($maximum['value'])){
              $target = (int) $maximum['value'] * .20;
              if($tax = $this->getClosest($target, $options)){
                  $target = (int) $maximum['value'] - (int) $tax['value'];
                  if($subTotal = $this->getClosest($target, $options)){
                     return $subTotal;
                  }
              }
           }
       }
   }
   private function getMaximum($options)
   {
       $max = 0;
       $prediction = [];
       foreach($options as $option){
          if((int) $option['value'] > $max){
              $max = (int) $option['value'];
              $prediction = $option;
          }
       }
       if($max > 0){
           return $prediction;
       }
    }

    private function getClosest($search, $options) {
        $closest = null;
        $prediction = [];
        foreach ($options as $option) {
            $value = (int) $option['value'];
            $target = (int) $search;
            if ($closest === null || abs($target - $closest) > abs($value - $target)) {
                $closest = $value;
                $prediction = $option;
             }
        }
         return $prediction;
    }
}

That's it. Hope it will help somebody.