Extract code and comments from reviewed documents

Automating companies with RPA and AI

https://mccminnovations.com - info@mccminnovations.com

Challenges

  • Extract code numbers and optional comments from documents with different formats and backgrounds.
  • The code and comments can be localised anywhere in the document.
  • Comments are written by humans in different colors, font sizes and font types.
  • Output structured data as result.

Case 1

Input document

Out[2]:
<matplotlib.image.AxesImage at 0x7f417547d470>

Apply OCR to the original image

Out[3]:
'© MCCM Innovations Ltd. MCCM Innovations C\n\nSecurity Descriptor: Project Demo\n\nPhase Demo C\nINNOVATIONS\n\n \n\n \n\n \n\nData Sheet for AA007 Laboratory\n\nCODE 4 C\n\nRyan Bright C\n4/21/2017 08:18 PM = INNOVATIONS\n\n \n\n[os meee | oe “ee | oe ‘eee\n\nA02 Issued for Review 11-MAR-2018\nA0L Issued for IDC 20-FEB-2018\n\n \n\n \n \n \n \n \n\n  \n\nContractor / Subcontractor Document Number & Revision\n\nThis document was created on behalf of MCCM Innovations and contains information that is confidential and\n\nProject Document Number\nproprietary to MCCM Innovations. This document and the information contained herein are the property of\n\nMCCM Innovations and shall not be reproduced, disclosed, duplicated, used or made public in any manner prior DEMO-26032019-001\nto express written consent of MCCM Innovations. Copyright 2019 © MCCM Innovations Ltd. All rights reserved.\n\n  \n   \n\nUncontrolled when printed or stored locally'

Image processing: RGB (red, green, blue) to HSV (hue, saturation, and value) space color

Out[4]:
<matplotlib.image.AxesImage at 0x7f4174c18a58>

Mask red pixels

Out[5]:
<matplotlib.image.AxesImage at 0x7f4174b80c88>

Apply OCR to the red masked image

Out[6]:
'CODE 4\n\nRyan Bright\n4/21/2017 08:18 PM'

Finally, get structured data as result

{'code': 4, 'has_comments': 0, 'comments': ''}

Case 2

Input document

Out[8]:
<matplotlib.image.AxesImage at 0x7f4174af4128>

Apply OCR to the original image

Out[9]:
'POY,| £00-90100-1h2-C1E-13-POOMD nas\n\nVSGE-674051\nHULL 480V EMERGENCY SWITCHGEAR\n\n \n\n \n\n \n\nD E F 6\nVGPE-663025\nEMERGENCY GENERATOR\nREMOTE CONTROL PANEL\n- GENERATOR CB CLOSED STATUS\n4C#14+G _T8-0.6/1-CLTBS-14(4C+E)NK ” GENERATOR CB OVERCURRENT\n3C-663025-01 TRIPPED\nACH2+G _T8-0.6/1-CLTBS-12(4C+E)NK - MAIN CT(METERING) / SPEED CONTROL\n3¢-663025-02\n4PR #16 78-0.6/1-1P(|-O)LTBS-16(4P)NK - TRIP BREAKER (94)\n3c-663006-03 = CLOSE BREAKER (98)\n- REMOTE STOP FROM PLATFORM\n9C #1446 T8-0.6/1-CLTBS-14(9C+E)NK HOLD ~ REMOTE SHUTDOWN FROM PLATFORM\npo60302504 +) = REMOTE START FROM PLATFORM\nee = 86 LOCKOUT FROM SWITCHGEAR\n3C#10+G _T8-0.6/1-CLTBS-10(3C+E)NK\n\n- VRPT(FOR VOLTAGE REGULATOR,\n\n3C-663025-05 AND METERING\n\nU4414, HDH-MAIN DECK\n\nMC\nCI\n\nINNOVATIONS\n\nCODE 2\n\nRyan Bright\n8/18/2018 02:54 AM\n\nSame comments in DEMO-26032019-02 apply here\n\n \n\n| J k\nNOTES:\n1\nFOR GENERAL NOTES, ABBREVIATION, LEGENDS, CABLE LEGENDS AND GENERAL 2\nREFERENCE DOCUMENTS REFER SHEET 01 OF 05.\nREFERENCE DRAWINGS\nDRAWING NO. TE\n\nZz\né\n\n(GM004-241MU009-£03-30007-002_| _ VGE-663012 EMERGENCY GEN CONTROL PANEL- AC SCHEMATIC\n\n \n\n‘GM004-241MU009-C\'17-30000-001__ | BLOCK WIRING DIAGRAM - 2Z2-663010\n\n \n\n‘GMO04-241MU008-£03-30001-001 | VGE-663012 EMERGENCY GEN CONTROL PANEL - ONE - LINE\n\n \n\n(G004-241EL002-C02-00063-001 | THREE LINE DIAGRAM UNITS 1-2 - VSGE-674051 HULL 480V EMERGENCY SWGR\n\n \n\n \n\n‘GMO04-241EL002-C02-00064-001__| THREE LINE DIAGRAM UNITS 3-4 - VSGE-674051 HULL 480V EMERGENCY SWGR\n\n \n\nof alale|y)=\n\n(GMO04EL-SLD-010-00146-001 ELECTRICAL ONE LINE DIAGRAM 480V SWGR VSGE-674051\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n    \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n4\n5\nU4414, HDH-MAIN DECK a\n3\nNOTES:\nMC\nCM MC Phase Demo\nINNOVATIONS CM cr0-cot deveooments\naromas SMSEONG KD PROJNO INNOVATIONS —_00kN0:780 ocsesnmno ig\n{approved ty) 16:08:44 +0900" ENGMEER BLOCK DIAGRAM 2\nFA cates EMERGENCY GENERATOR g\n‘APPROVED 3\nCK {IDC CK 5 SK (CLIENT END DRAWING NO REV\nDESCRIPTION cak | ENGR | APPROVED GENT SALE SECTORCODE: 50 | SYSTENNO: DEMO-26032019-02 A04 ¢'

Mask red pixels

Out[10]:
<matplotlib.image.AxesImage at 0x7f4174ad2860>

Mask blue pixels

Out[11]:
<matplotlib.image.AxesImage at 0x7f4174a36ef0>

Mix red and blue masks

Out[12]:
<matplotlib.image.AxesImage at 0x7f41749a06a0>

Apply OCR to the mixed image

Out[13]:
'CODE 2\n\nRyan Bright\n8/18/2018 02:54 AM\n\n \n\nSame comments in DEMO-26032019-02 apply here'

Finally, get structured data as result

Out[14]:
{'code': 2,
 'has_comments': 1,
 'comments': 'Same comments in DEMO-26032019-02 apply here'}

Automating companies with RPA and AI

https://mccminnovations.com - info@mccminnovations.com