AWS Announces General Availability of Amazon Textract

Amazon Textract uses machine learning to automatically extract text and data in virtually any document – with no machine learning experience required

The Globe and Mail, MET Office, PwC, Healthfirst, UiPath, Teradact, Ripcord, Kablamo, Vidado, BluePrism, and Alfresco among customers and partners using Amazon Textract

SEATTLE– Today, Amazon Web Services, Inc. (AWS), an company (NASDAQ: AMZN), announced the general availability of Amazon Textract. This fully managed service uses machine learning to automatically extract text and data. This includes from tables and forms, in any document without manual review, custom code, or ML experience. Amazon Textract goes beyond simple OCR to identify the contents of form fields, information stored in tables, and the context of the information presented. This includes  a name or SSI number from a tax form or  product SKU or quantity in a warehouse from an inventory report.

The extracted text and data is easily used to build smart searches on large archives of documents. Or, loaded into a database for use by applications, such as accounting, auditing, and compliance software. Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos. Customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena.

And, other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to derive deeper meaning from the extracted text and data. To get started with Amazon Textract, visit

Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is time-consuming and often inaccurate and produces output requiring extensive post-processing before it is put in a format usable by other applications.

Because existing OCR technologies are unable to recognize common layouts like forms and tables, they only generate a lengthy and often inaccurate text dump. What organizations want is the ability to accurately identify and extract text and data from forms and tables in documents of any format and from a variety of file types and templates.

Amazon Textract analyzes virtually any type of document, automatically generating highly accurate text, form, and table data. It identifies text in documents – such as line items and totals from a photographed receipt, tax information from a W2, or values from a table in a scanned inventory report – and recognizes a range of document formats. This includes those specific to financial services, insurance, and healthcare, without requiring any customization or human intervention.

Amazon Textract makes it easy for customers to accurately process millions of pages in a few hours, significantly lowering document processing costs, and allowing customers to focus on deriving business value from their data instead of wasting time on post-processing. Results are delivered via an API that can be easily accessed and used without requiring any machine learning experience.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document requiring no machine learning experience. Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning.

“In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.”

Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads them, and returns data. The data is presented as JSON text annotated with the page number, section, form labels, and data types. This data can  be used for various applications (generating smart search indexes, redacting text in a massive collection of forms, creating automated loan approval workflows, regulatory compliance, and flagging fraud risk for insurance claims).

Customers can load the data into business software, such as spreadsheets, databases, and payroll systems.  Or they can analyze and query data using ElasticSearch, DynamoDB,  Redshift, or Athena. Amazon Textract is available in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland). Availability will expand to additional regions in the coming year.

What Customers are Saying

The Globe and Mail is a national icon and Canada’s most recognized media brand. “As a news media company, we rely on PDF or scanned-source documents such as FOIs (freedom of information requests). They can contain important information in tables that we previously couldn’t access,” said Michael O’Neill, MD, Digital and Data Science at The Globe and Mail.

“These documents have been under-utilized because journalists couldn’t access them easily or didn’t know they existed. Using Amazon Textract, we can extract information from tables in PDFs and easily output it to CSV. This offers easy access to these documents by making them available for search queries by our journalists. This increases efficient access to information for our journalist by tenfold.”

Met Office is the UK’s national weather service, and a world leader in providing weather and climate services. “We’ll use Textract to digitize millions of historical weather observations from document archives,” said Philip Brohan, Climate Scientist, Met Office. “Making these observations available to science will improve our understanding of climate variability and change.”

PwC helps organizations and individuals create value by delivering quality in assurance, tax, and advisory services. “At PwC, we provide our customers with intelligent automation tools that help transform previously manual processes. We’ve integrated Textract into our pharmaceutical solution to automate document processing for various FDA forms” said Siddhartha Bhattacharya of PwC.

“Previously, people would manually review, edit, and process these forms, each one taking hours. Amazon Textract is the most efficient and accurate OCR solution for these forms. It extracts all of the relevant information for review and processing, reducing time spent from hours to  minutes.”

Healthfirst is a not-for-profit managed care organization and one of the fastest growing health plans in New York. They have over 1.4M diverse members and a network of 35,000+ providers and 4,500 employees. “At Healthfirst, we’re building data pipelines to turn scanned medical charts into useful clinical information. This improves care coordination, drives quality outcomes, and ensures appropriate reimbursement for members ,” said Steve Prewitt, Chief Analytics Officer, Healthfirst.

“We use Amazon Textract and Amazon Comprehend Medical to glean real value from unstructured data efficiently. This resulted in revenue savings 10-20 times more than our usual downstream operation. By scaling to analyze 50,000+ charts, we’ll find undocumented diagnoses and refer 5,000 members for needed care management.”

Informed, Inc. automates how financial institutions originate loans and open bank accounts.

“We’ve already used Amazon Textract to analyze tens of thousands of loan documents on behalf of financial institutions, and our software has been enhanced by the service. We identify 95% of defects in loan  packages, helping banks reduce manual data entry,” said Justin Wickett, CEO , Informed.IQ.

“Using Textract, we provide lenders real-time visibility into applicants’ income based off their paystubs, bank statements, tax returns, and more. We’re expanding the documents types we analyze, enabling FIs to leverage our machine learning models and achieve real-time decision-making efficiency.”

Additional Testimonials

Candor’s mission is to transform archaic, time consuming processes that burden the mortgage industry. “We use OCR to extract data from lender-required documents to verify income, assets, property value, and more. Until now, OCR read one page in 38.4 seconds. But Amazon Textract reads in a fraction of that,” said Tom Showalter, Founder & CEO of Candor.

“We’ve been able to use Textract to accurately read complex, diverse documents such as bank statements, pay stubs, and tax documents without additional training or machine learning expertise, allowing our clients to underwrite and close a loan in days, as opposed to weeks.”

UiPath is a leading Robotic Process Automation vendor providing a complete software platform to help organizations efficiently automate business processes. “Amazon Textract further differentiates UiPath’s RPA platform by enhancing UiPath’s document understanding capabilities. This enables  customers to unlock critical data from documents, transform data into actionable insights.  They can deliver those insights to business and operational systems,” said Param Kahlon, CPO of UiPath.

TeraDact allows customers to transform stored images and paper documents into privacy-compliant, usable digital formats at scale. “Amazon Textract’s platform feeds TeraDact’s patented redaction services to automatically remove and secure sensitive data. TeraDact customers either permanently remove this data so that it can never be recovered. Or they can replace sensitive data with patented tokens which can be recovered with the appropriate permissions. This is useful in complying with government mandates surrounding individual data privacy such as GDPR,” said Tom Trobridge, COO, TeraDact.

Ripcord’s mission is to digitize and extract knowledge from paper documents using vision-guided robotics, machine learning, and advanced AI. This knowledge automates business processes and workflows. “We’ve had tremendous success utilizing Amazon Textract to augment our advanced entity extraction. This benefits many industries and uncover $4 billion in new pay. We’re expanding our use of Textract across financial and government services, healthcare and legal,” said Alex Fielding, CEO of Ripcord.

Blue Prism develops RPA software to provide businesses and organizations with a more agile virtual workforce. “Blue Prism’s connected-RPA  automates and performs mission-critical processes, allowing staff to focus on more creative, meaningful work. With Amazon Textract, we’ve given our digital workforce another powerful tool for automation. Amazon Textract accurately analyzes data from various document types using machine learning, enhancing the digital transformation journey for our customers. Using AWS AI services like Amazon Comprehend and Amazon Rekognition, we can tackle challenges from secure authentication to fraud detection. The intelligence and flexibility of Textract’s data extraction elevates OCR to new levels in industries like financial services, retail, manufacturing and transportation,” said Dave Moss, CTO and Co-Founder of Blue P.

Upcoming Webinar: Crossing the chasm into the new digital world: The impact of AI and automation in creating a fully digital auto ecosystem