From Hallucination to Validation: Optimizing ChatGPT for Employer Name Matching

From Hallucination to Validation Optimizing ChatGPT for Employer Name Matching Informed

In a previous blog post, we discussed leveraging ChatGPT to improve employer name match — a critical task in employment verification. 

Now, we are thrilled to share further insights into the intricacies of employer name matching. By leveraging the power of large language models (LLMs) like ChatGPT, we have achieved even greater performance in identifying employer matches. As we explore this technology, we continuously improve our systems, ensuring our clients have a secure and reliable lending process. 

In this installment, we take a closer look at the challenges associated with employer name matching and how advanced AI models can help overcome them. We explore the complexities of deciphering variations in employer names, such as when an individual’s paystub or bank statement reflects a parent company rather than the specific subsidiary. Our applied machine learning engineers have addressed these intricacies, and we are excited to present their solutions.

Introduction

Informed leverages LLMs to determine entity relationships between user-reported (human) and extracted information (information the “machine” derives from the document). The goal is to determine whether extracted information can be used to validate reported information. This is challenging, even for humans, as the legal entity issuing the document can seem completely unrelated to the entity listed on the document.

Initial methodology

Informed’s methodology for determining relationships between entities focuses on handling spelling/parsing mistakes, as well as straightforward relationships between companies. The heuristic is primarily composed of:

  • Direct/fuzzy matching of names against each other (i.e., “Seatle” and “Seattle”)
  • Direct/fuzzy comparison of names against Informed’s proprietary database
  • Leveraging spell check and a search engine to determine if there are relationships between the entities

While these heuristics perform well in identifying typos or entities with similar names, there are still relationships which are difficult to derive. Examples of such relationships are:

  • Companies that employ external payroll providers. In this case the individual may be technically employed by the payroll company rather than their actual employer.
  • Staffing companies have a similar issue, where the employer may be ambiguous and context dependent as the individual could be paid by either the staffing company or the company they are working for.
  • Companies which operate under the same umbrella company seem unrelated until someone looks through the corporate structure of parent and subsidiaries.

Examples of non-trivial relationships between entities

One example of a non-trivial relationship GPT identified is between the Atlantic City Municipality and PrimePoint LLC. In this case GPT identified that PrimePoint LLC was a payroll provider for Atlantic City. A press release from the municipality corroborated the relationship.

From Hallucination to Validation Optimizing ChatGPT for Employer Name Matching Informed

Another example GPT identified was between Advanced Imaging Partners and Radnet, where GPT identified that Advanced Imaging Partners is a subsidiary of Radnet. This information was corroborated by an SEC link to Radnet’s subsidiaries.

From Hallucination to Validation Optimizing ChatGPT for Employer Name Matching Informed

Prompt implementation and performance

We backtested the GPT prompt on a curated dataset of paired names and found:

  • Both GPT-3.5 and GPT-4 had approximately the same accuracy of ~70%
  • GPT-4 had half as many false positives 
  • GPT-4 had twice as many false negatives 

But, GPT-4 is definitely more conservative than GPT-3.5. Even aside from the above statistics, GPT-4 chooses not to respond when it’s uncertain and GPT-3.5 guesses! This was more evident as we looked at specific examples.

One example is trying to find a relationship between a health-care staffing provider and a regional medical center. GPT-3.5 reported them as the same entity, while GPT-4 asserted that the healthcare staffing provider did not exist!

Caveats

The performance of GPT is reasonable given its limited context provided and initial lack of fine-tuning. However, there are a number of points important points – the main one is that GPT can be right for the wrong reasons. There is research for improving LLM performance, but for most off-the-shelf LLM models determining the truth of a statement is difficult.

Handling (almost) correct outputs

In backtests, one notable example is the relationship between Value City Corp and American Signature Inc. GPT responded: “Value City Corp is the parent company of American Signature, Inc.” when in fact the reverse is true. So although the individual facts are correct, that one entity is the parent of the other and both entities are real, GPT’s statement is false. Examples like this necessitated a validation layer because LLM logic is very difficult to debug and audit.

In the case of correct outputs, validation (for example through a search engine) provides references to the statements. And, it helps adjust the “almost” correct statements made by LLMs. In the case above, the validation layer helped determine that American Signature Inc is in fact the parent company of Value City Corp via information from their web pages.

Productionizing an LLM driven application

Although Informed does not build or deploy its own LLMs, there were still engineering issues to address. Some of the major ones were how:

  • Do you get structured output from an LLM which usually operates as a chatbot?
  • Can we control costs when using third party APIs?
  • Much will the application SLA be impacted by LLM performance? (Especially as Azure does not have a response time SLA for GPT as of June 2023)
  • Do we ensure that the application accuracy remains stable over time?

Structured Outputs

Although OpenAI now has function calling capabilities, it was a challenge obtaining an easily interpretable output from the prompt. One way to get around this is to ask for a specific format such as: Output JSON: <json with affiliated, affiliation_type, explanation>. Based on the context, `affiliated` returns a boolean; `affiliation_type` returns a type from an enumeration; and the free text was in the explanation field. It’s worth noting that although the output adhered to JSON standards, prompt engineering is still required to ensure the output is correctly interpreted.

Overall, the Monorepo approach leads to faster and more efficient development and deployment, but requires a careful and strategic approach. By implementing a comprehensive CI/CD process with the right components, organizations can realize the benefits of Monorepo while managing accompanying complexity and challenges.

Cost and latency

Azure-hosted GPT endpoints take a few seconds to return a response, and services are charged per-token. As the service can be called multiple times, potentially requesting the same information, you’ll can improve cost and latency by reducing redundant calculations. Utilizing DynamoDB, for example, as an in-memory cache allows us to accomplish this by caching the raw response we receive from GPT and other Azure services.

Observability

In addition to cost and latency optimizations, there is the issue of correctness, particularly when dealing with complex LLMs. We need reliable audit results, and must preserve process stability when making changes. 

For an audit trail, we developed an analytics pipeline storing debugging information including:

  • Inputs
  • Latency of API calls
  • Raw API responses (such as those stored in DynamoDB)
  • Calculated results

This information helps monitor the application’s performance and allows us to inspect issues such as input drift, which impacts accuracy.

To check whether the application is stable through changes to the API version or underlying logic, we constructed a regularly run set of test applications. As we utilized a number of different Azure APIs, the regularly generated test results provide insight into drift from APIs and/or custom application logic.

Conclusion

Incorporating ChatGPT into performance-sensitive pipelines requires careful scrutiny and testing. Validation layers are necessary to suppress false positives and verify outputs due to potential LLM hallucinations. Robust infrastructure for distributed processing and reliable service delivery is crucial. So addressing challenges including structured outputs, cost control, SLA impact, and stability is paramount. An analytics pipeline enables auditing and monitoring, while regular testing ensures stability during API or logic changes. By addressing these considerations, you can harness the power of LLMs while maintaining accuracy and stability in performance-sensitive applications.

author avatar
Bill Huang and Harshil Prajapati Staff Data Scientist, Senior Machine Learning Engineer
Bill Huang is a Staff Data Scientist at Informed.IQ with over 5 years of experience building machine learning models and MLOps infrastructure in fintech companies. Harshil Prajapati is a Senior Machine Learning Engineer at Informed.IQ where he develops Machine Learning models for Classification & Name Entity Extraction. He has a Masters Degree in Machine Learning from Boston University.

Just Released: Auto Loan Defect Industry Survey Report

X