Medical Animal Practice Standards: for the Broad Evaluation of AI-Generated Records
Authors:
PupPilot: Gary Peters, Nora Peters, Alec Coston
AWX Consultancy: Adele Williams-Xavier
The MAPS series is a collection of papers put forward by AI and Medical experts with the goal of providing a practitioner/ clinician guide to Veterinary AI tooling and usage. This specific paper is a guide on how to assess the holistic quality of clinical notes that are generated by AI.
Introduction
The advent of large language models (LLMs) like ChatGPT has opened up new possibilities for enhancing the accuracy and quality of medical records. In the field of veterinary medicine (VetMed), leveraging the capabilities of LLMs to summarize medical records and generate clinical notes from audio recordings of consults, could lead to significant improvements in record-keeping, structuring data, billing, communication, veterinary burnout, and ultimately patient care. Typically these tools perform ambient listening during a consult, convert the audio into dictated text and then perform a typed summary, usually standardizing the data to an extent into the subjective, objective, assessment and plan (SOAP) elements of the conversation. There is little to no regulation of veterinary AI tools globally, meaning that AI tools produced for the veterinary space can be released without interrogation of the quality, accuracy, or safety of their outputs. Despite the lack of legal regulatory frameworks, there remains ethical and professional aspects around accurate representation of the facts of a consult that a veterinarian needs to consider with regards to the creation of clinical notes of their patients. Therefore, before these systems are implemented we need a holistic framework to assess how well Generative AI is at drafting clinical notes for veterinarians. While the goal is not to replace clinical practice and individualized healthcare, there are use-cases for implementation of human-in-the-loop AI tools that hugely improve efficiency and decrease the admin burden on clinicians. As the ultimate responsibility lies with the clinician for the content of clinical notes, a clinician should be able to discern the value and potential errors of any tools that they are using. Where there is potential for AI software to hallucinate, creating inaccuracies (i.e., false information or omitting important pieces of information) in their summarisation of a consult, the potential for this needs to be recognised and risk assessed.
In this paper, we examine the criteria that need to be evaluated to examine the quality of a veterinary consult transcription. We utilize real-world data retrieved from a purpose-built veterinary transcribing software (PupPilot) to aid inform the criteria for formulation of the assessment tool, and then use novel data from PupPilot to give an example of testing the model.
This document will explore how to evaluate a clinical note generated by Generative AI. It will not be exploring how to use Generative AI to make a clinical note). This will provide a standard for veterinary medicine when broadly evaluating clinical notes generated by AI.
Adapting HELM for Veterinary clinical notes
This BEAR (Broad Evaluation of AI-Generated Records) standard will primarily be driven by adapting the methodology used within Stanford’s HELM (Holistic Evaluation of Language Models)[0] project to the veterinary domain and focus on one specific multi-dimensional task: the generation of veterinarian’s clinical notes from audio recordings of consults/ consult summary dictation. The HELM project states there are three elements needed for a holistic/ broad evaluation:
1. Broad coverage and recognition of incompleteness
It is as important to state what is covered as what is not covered by the evaluation standard set forward.
2. Multi-Metric Measurement
Holistic and broad evaluation requires “plural desiderata” or in other words, in the context of clinical notes, the AI is asked to perform many different tasks all at the same time. A single dimensional output measurement like “exact match accuracy” is incredibly narrow, and corresponding accuracy metric would not properly present what the practitioner is desiring when evaluating the clinical note. Thus, the outputs measured should be reflective of the tasks AND have multiple dimensions (i.e. multiple categories of metrics (i.e. accuracy, robustness), and multiple metrics within each category (i.e. in accuracy: exact match, F1, ROUGE…). Based on this broad evaluation, veterinary practitioners can assess the overall “quality” of the clinical note.
3. Standardization
Since there are now multiple different tools which provide “automated clinical notes” it is important to provide a reproducible methodology for assessing broad evaluations of generated clinical notes.
Who is this Paper for?
Ideally this paper will be valuable to veterinary practitioners and those within the veterinary medicine industry. This paper covers a number of topics which can range from those within the medical domain to those with the technical domain. The goal of this paper and others within the MAPS series is to democratize a standard of evaluating generative AI as it becomes increasingly more important for practitioners and those within veterinary medicine to become more familiar with new tools; which many times come without guardrails.
Process of Evaluation
This is “v1” or the first version of MAPS for BEARs and as such does not put forward an automated or even semi-automated process of evaluation, but rather a manual one. Within the world of Machine Learning and AI it is important to state if the ‘evaluator’ is a human or a computer (or some mix of both); in our v1 we assume a human domain expert (i.e. veterinarian) is the evaluator. The v2 of MAPS for BEARs will put forward an option of ‘computer-assisted’ evaluation and ultimately v3 will be fully automated. See Section “Evaluation Approach” and specifically sub-section “Rubric” for more information.
Additionally as a v1, this evaluation process is primarily focused on veterinary general practice. With future iterations of the MAPS for BEARs, the scope will be expanded to include specific considerations for emergency and speciality practices. With that said, the goal of this paper is ultimately to put forward a high level framework which, even if not targeting emergency or speciality, should still provide significant value for practitioners in those domains assessing these tools.
Schema
Overview of Inputs and Outputs
HELM principles suggest it is important to breakdown the inputs to and outputs from an LLM to properly measure and assess the different aspects that surmount to “quality” of the produced medical record. The input dimensions are regulated at the time of the consultation and are largely controlled by the clinician at the time of the consultation. The output dimensions are regulated by the software and the models within it.
To this end we must identify, categorize and define the parts or dimensions within the input as it pertains to automated scribing. This entails identifying the important aspects to consider when evaluating audio data (the input dimensions data, Table 1). Secondarily, we must do the same for the LLM’s outputs, which in this instance entails categorization of the elements of a SOAP style medical record (the output data).
It is critical for veterinary surgeons to realize that they will have control over the quality of the input data, which in turn can have an impact on the output data. On the theoretical assumption of a perfect input audio recording, the remainder of the assessment is based around the outputs of the scribing software model(s).
Input Schema Overview
Input Dimensions Data
Since automated scribing is ultimately a recording of a consultation (and/or a clinician dictation) there are two main attributes to consider: the quality of the recording itself (Voice Quality) and the actual content of the recording (Information Quality).
When considering the quality of the recording we need to look at the audio itself and assess if it is of good quality (Audio Quality) and secondarily if the recording fully captures the appointment/ information needed for the clinical note (Audio Capture). If either of these are below a certain threshold it will greatly negatively impact the quality of the expected output. In other words, if only 10 minutes of a one-hour appointment are recorded, there cannot be high expectations placed on the final output, irrespective of the AI scribe's abilities.
When looking at the quality of the information, an important aspect is how well the data can be compressed. This is generally determined by length of audio recording, but would be more accurately represented by total volume of medical content within an audio recording (i.e also assessing the medical density in relation to length of the appointment). So, if an appointment is 1.5 hours, this is generally considered a longer appointment – but if very little medical information is discussed (i.e. the doctor discusses personal matters with the client, or periods of time when an animal is being assessed without additional talking) within the 1.5 hours of audio, the compression quality may be quite strong regardless of the length. As opposed to a 50 minute recording which is very medically dense, the compression quality may struggle even though the audio is shorter in length: it is more difficult to succinctly summarize (or “compress”) a higher quantity of pertinent medical information.
Additionally, when looking at the quality of the information it is important to consider the semantic quality from a medical and non-medical perspective. Are there medical inconsistencies such as a doctor accidentally entering the wrong room and recording two appointments together? This should be distinguished from non-medical issues like submitting a casual conversation as if it were a medical discussion.
Input Dimensions Table
Primary Dimension | Secondary Dimension |
Voice Quality (the recording itself) | Audio Quality (are the spoken words audible?) |
Audio Capture (is full information captured in the recording?) | |
Information Quality (the content) | Compression Quality (% of time of the audio recording that medical information is discussed) |
Semantic Medical Quality (proportion of the medical conversation relevant to the patient being examined in the consult). | |
Semantic Non-Medical Quality (conversation not relevant to the medical examination) |
Table 1: Input Dimensions Table. The primary dimensions of input data to an automated scribing AI tool, broken down further into the secondary dimension components that attribute to the primary dimension.
Output Dimension Breakdown
The output data of any scribing software is the production of typed summarized notes of the audio recording. There is a tendency for addition of structure to these summaries, grouping information pertinent to each portion of the exam separately, to create a semi-structured dimension to medical records, such as the well-used SOAP structure. Therefore, this paper has scoped its discussion to a general practice’s standard SOAP note. A simple primary dimensional analysis could be the separate sections: subjective, objective, assessment, and plan. PupPilot additionally automatically includes procedure and diagnostic tests in this primary layer due to their uniqueness. Definitions for these sections are provided in the primary output dimensions in table 2. Each section also has multiple subsections which need to be considered and are defined in table 3.
Output Dimensions Table
Primary Dimension Definitions
Primary Output Dimension SOAP Section | Definition |
Subjective | Information about the patient's history, symptoms, and any observations reported by the owner or client. This could include how the animal has been acting, changes in behavior, appetite, or energy level, and any other relevant information provided by the client. |
Objective | Objective findings - physical exam findings and parameters that can be measured (temperature, heart rate, respiration rate etc) |
Assessment | The veterinarian's evaluation of the animal's condition based on the subjective and objective findings, including the most likely diagnoses (differential diagnoses) and interpretation of the overall situation. |
Procedure | Any medical procedures, treatments, or interventions performed on the animal during the encounter. |
Diagnostic | Any diagnostic tests that were performed on the animal during the encounter such as blood tests, imaging, or other laboratory tests, and their results. |
Plan | Recommended and planned courses of action based on the assessment, including any medications, diet and lifestyle changes, treatments, procedures, diagnostics, follow-up visits, or other actions to be taken by the veterinarian or owner. This also includes any recommendations made by the veterinarian during the visit that were declined by the owner. |
Table 2: The definitions for the Primary Output Dimensions in the standard SOAP format, with the additional inclusion of Procedure and Diagnostic, as per PupPilot’s standard clinical note production
Secondary Dimension Definitions
Primary Output Dimension SOAP Section | Secondary Dimension SOAP Sub Section | Definition |
Subjective Note: While not all sub-sections are strictly subjective information, these sections are generally included in the Subjective section of a SOAP note. | Signalment | The identifying information about the animal patient, such as species, breed, age, and sex. |
Subjective | Primary Complaint | The primary reason or concern that prompted the owner to bring the animal in for veterinary care. |
Subjective | History | The owner’s observations of the animal's health, including previous illnesses, treatments, and current symptoms. |
Objective/ Physical examination (PE) | Vitals | The animal's measured vital signs, such as temperature, heart rate, respiratory rate, and blood pressure. |
Objective/ PE | Body Systems | The findings from the physical examination of the animal's various body systems, such as the cardiovascular, respiratory, and gastrointestinal systems. |
Assessment | Problem List | The significant findings from the subjective and objective sections that contribute to the animal's diagnosis. |
Assessment | Differential diagnoses | The possible diagnoses or conditions that could explain the animal's symptoms and findings. |
Assessment | Interpretation | The veterinarian's analysis and understanding of the animal's condition based on the findings and differentials. |
Procedure Note: This section may also be included as a subcategory under the plan. | Completed | The medical procedures or treatments that were performed on the animal during the consult. |
Diagnostics Note: This section may also be included as a subcategory under the plan. | Resulted | The diagnostic tests conducted during the consult, including blood work, X-rays, and ultrasounds, along with their respective results. |
Plan | Discussed | Recommended and planned courses of action that were discussed with the owner, including treatment options, diet and lifestyle changes, diagnostics, procedures, medications, follow-up etc. This also includes any recommendations made by the veterinarian during the visit that were declined by the owner. |
Plan | Completed | The parts of the plan that were implemented or completed during the visit, including treatment, medications and optionally diagnostics and procedures performed during the consult etc. |
Plan | Medication | Take home medications prescribed for the animal, including dosage, frequency, trade name, generic name, concentration (mg/ml), volume per administration, and the route of administration. |
Table 3: Detailed definitions for the Secondary component sub sections that constitute the primary output dimensions of a SOAP style clinical note.
Note on Incompleteness
Each section of a clinical note encompasses multiple tasks that require precise definitions for accurate evaluation. While the potential tasks and dimensions within clinical notes are extensive and varied, this paper does not aim to provide an exhaustive list of all possibilities. Instead, we focus on presenting a framework that represents the typical SOAP (Subjective, Objective, Assessment, Plan) structure commonly utilized in veterinary practice. This approach allows us to establish a foundational schema that captures the essential elements necessary for evaluating AI-generated clinical notes.
Furthermore, we have intentionally not subdivided the input information dimension based on the specific content of appointments, for example, differentiating between consultations for vaccinations versus dental examinations. While the nature of the appointment is an important factor that practitioners must consider during their evaluations, incorporating such specificity adds an additional layer of complexity that falls outside the scope of this initial framework. Our goal is to provide a broad and general schema that offers practical value without necessitating intricate categorizations of appointment types.
We acknowledge that these nuances are significant and can impact the evaluation process. Practitioners are encouraged to consider these factors in their assessments. We plan to address these additional layers of complexity in future iterations of this work, where we will expand the schema to include more detailed considerations for different clinical scenarios.
Inputs - Further investigation
Input Types: Voice and Information
HELM’s evaluation of Language Models assumes consistent input quality. In the context of automated scribing, input quality can highly vary and needs to be considered and evaluated. First it is important to consider the mediums of input: voice and text.
It is important to evaluate voice separate from text because, from an input variability standpoint, there are independent risks associated with both.
VOICE QUALITY
At a high level, voice has risks associated with (1) audio quality and (2) audio capture (i.e. was the full conversation heard) depending on the medium of capture (i.e. phone call vs in-person appointment). We have assigned risk for voice quality based on assessment of data from real-life submissions to the PupPilot transcribing software, on the assumption that these will be representative of general submissions for veterinarians. Below is a simple chart highlighting the risk associated with each dimension based on PupPilot data.
Risk by Dimension Chart
Method of Capture | Risk: Audio Quality | Risk: Audio Capture |
Phone/ Voice Call: 1 Side | LOW | MEDIUM |
Phone/ Voice Call: 2 Sided | LOW | LOW |
In-Person Appointment | HIGH Initially then MEDIUM | MEDIUM |
Dictation | LOW | LOW |
Table 4: Risks for audio quality and audio capture for each method of audio recording. Recordings can either 1 or 2 sided phone calls, recording of an in-person 2-way conversation during the consult, or a one-way dictation by the veterinarian after the consult. These different circumstances of audio capture each carry their own risk to audio quality and capture, as outlined in the table.
Notes on Risk
One-Sided Phone Calls
Phone calls have audio capture issues because by definition the audio recording is only capturing one side of the conversation. From PupPilot’s internal data, even though this creates risk it is generally easily mitigated by veterinary practitioner awareness. If the veterinarian understands that only one side of the phone call is being recorded they mitigate all risk by either (1) repeating anything of medical value the client is saying out loud or (2) dictating additional information after the phone call. Without adaptation of what the veterinarian verbalizes, or without subsequent post-call dictation, the risk is innately much higher.
In-Person Appointments
In-person appointments generally have the largest data volume for a typical practice and require the greatest focus when assessing risk. According to PupPilot’s internal data, the highest amount of poor quality voice data comes within the first one to five sessions a doctor uses the automated scribing tool. PupPilot internal data shows a strong pattern of improved voice data quality when the practitioner is given feedback on data quality.This initial poor quality data suggests a causal factor that there is a learning curve for finding the appropriate placement/ positioning/ strategy around proper microphone placement and usage – along with remembering to use microphone.
Practitioner Notes
During this initial trial period it is important for the veterinary practitioner to receive clear feedback on audio quality and audio capture. PupPilot’s internal data has shown that the clearer and faster this information is communicated to the practitioner the faster overall voice quality improves.
Even though in-person appointments represent a higher risk associated with overall voice quality, if downstream data enrichment exists there is relatively low risk overall.
INFORMATION QUALITY
PupPilot’s internal data shows 2 important ‘input’ considerations in regards to information input: (1) Compression Quality and (2) Semantic Quality.
Sub Dimension 1: Compression Quality
This is the risk associated with sub-dimension of compressing the content of the audio recording, once transcribed, into the LLM’s context window. Or in other words the risk associated with shortening/abbreviating the transcription so that an AI model can read it all at once. This is often required as AI models have limitations regarding how many words can be processed at one time. From analysis of PupPilot data, the compression quality varies in different clinical settings, which then can impact the quality of summarisation when considering if all medical facts are retained, explained below.
Compression Quality Low Risk for General Practice
For most general practice veterinarians, the average consult length is 15 minutes, with the vast majority of appointments being under 30 minutes in length. They may not have significant medical density (i.e. the first third of the consult may be small talk not related to medical appointment). This means standard methods of language compression (i.e. LLM Map Reduce) are sufficient in capturing the medical entities within the appointment without loss of medical facts.
Compression Quality High Risk for ER and Speciality Clinics
Conversely, many specialist consults (e.g. radiology) have appointments typically lasting 90 minutes, with the conversation containing a high medical density (i.e. the appointment itself is very complex), meaning that standard language compression methodologies are incredibly risky and will more than likely result in lost information. Even more advanced methods like naive RAG will still result in lost information. From our internal data the only methodologies (current as of August 2024) that can capture and compress high-density large-volume medical information for LLM usage are: optimized RAG[6] and MoME. The MoME (Mixture of Model Experts) architecture, while incredibly powerful, in its current form, is unfeasible for implementation in high-volume, low-cost, and high-speed environments like clinical note scribing. This means that optimized RAG is currently the only methodology that can properly compress high-density information to be used by generative AI meeting the speed, cost, and accuracy requirements.
Sub-Dimension 2: Semantic Quality: Medical
PupPilot has noticed that (1) during testing or (2) due to accidentally leaving the record on – a medical record submitted may make no medical sense. For example, from PupPilot’s record:
Test Note that Doesn’t Make Medical Sense
“Ahhh… this is Fido… and uhh he came in today because he is really sad and his parents want to see if we could make him happy”
This test recording risk is very low risk in actual medical setting but may cause unpredictable results and may result in incorrect expectations of generative clinical notes.
Medical Inconsistency
“So for the allergies we are going to prescribe an antihistamine, prednisone*”
*prednisone is not an antihistamine
For medical inconsistencies, PupPilot’s stance is that generative AI should not interfere in the act of scribing. If the doctor said a medical inconsistency it should still be written down by the AI scribe but allow for tooling which brings the inconsistency to the doctors attention. PupPilot has found this functionality helpful in differential diagnosis in particular.
Record Being Left On and Accidentally Walking Into Another Appointment
“(1) [a full transcription discussing the plan for neutering a canine], (2) [the doctor walking into another exam room picking up the audio in a new room] “Ok… so we just finished the Spay of the 1 year old feline…”
The recording being left on is low likelihood of occurrence from PupPilot’s internal data, but poses a high level of risk if the scribing tool does not have “jailbreak detection” (i.e. an industry term used when discussing system guardrails for misusing the generative AI product). If you have questions on how to implement proper semantic jailbreak detection, reach out to the authors.
Sub-Dimension 3: Semantic Quality - Non - Medical
PupPilot has noticed that a number of audio files submitted have no (or relatively minimal) medical facts relative to personal conversation. This does not mean a clinical note cannot be generated/ generated at high quality, but may impact quality expectations, especially if the ratio exceeds 10:1 if total length is over 1 hour.
Input Metrics for Assessment
The metric used to assess input type is not just based on the input type but also on the type of risk we are trying to assess. Generally speaking the subtypes of the input (i.e. phone call versus in-appointment) are more important in highlighting the risks that we need to consider, but when assessing the metric we only need to view the associated risk. Below is the chart delineating this.
Input Dimension | Input Sub Dimension | Metric for Assessment |
Voice Quality | Audio Quality | Manual Binary: Good (1) or Bad (0) |
Voice Quality | Audio Capture | Manual Select: Full (1) Missing Info (.5) or Severely Missing Info (0) |
Information Quality | Compression Quality | Manual Binary: Low Density Under 45 minutes or High Density Under 25 minutes (1) or Greater than (0) |
Information Quality | Semantic Quality - Medical | Manual Binary: No Medical Inconsistencies (1) or Any Medical Inconsistencies (0) |
Information Quality | Semantic Quality - Non Medical | Manual Binary: Medical Information Greater than Non Medical Information (1) or Any More Non Medical Information than Medical (0) |
Table 5
Outputs Diving Deeper
Output Types
On a high level there is only one output type being assessed: a standard clinical note. But peeling back the top layer the clinical note represents a rich and complex array of outputs which need to be evaluated differently to have a broad understanding of output quality.
As delineated in HELM it is important to tie the assessment of the output back to not only the input but the overall task that is trying to be accomplished. Below represents a basic SOAP output outline broken out by its associated tasks[3]. To reduce complexity we will only be assessing one voice type: “in office appointment.”
Output Task Type Definitions
Data Extraction versus Exact Data Extraction[3]
In natural language processing, data extraction is a crucial element in understanding and processing the text. Traditionally this has been accomplished with NLP models but has increasingly become a task handled well by LLM[4]. With that said, within the scope of veterinary clinical notes, there are different levels of data extraction required. Depending on the task at hand a more semantic data extraction is acceptable, but in other cases an exact data extraction is required.
Medical Reasoning
Medical reasoning involves applying medical knowledge and clinical judgment to analyze patient information, connect relevant findings, and draw appropriate diagnostic or treatment conclusions. In veterinary notes, LLMs can engage in medical reasoning by synthesizing data from multiple sections to generate assessments, differentials, and plans. Typically this methodology requires Chain-of-Thought Reasoning[5].
Summarization[3]
Summarization condenses a longer passage of text into a shorter version that captures the key information. For clinical notes, LLM-based summarization can distill detailed histories, discussions, and interpretations into concise synopsis while preserving the essential clinical details.
Contextual Medical Question Answering
This involves an LLM understanding a medical query in the context of a specific patient record and extracting the relevant information to provide an accurate answer. It requires comprehending the clinical context and focusing the answer appropriately.
Paraphrasing[3]
Paraphrasing expresses the meaning of text in alternative words while preserving the original semantics. LLMs can use paraphrasing to normalize or standardize variable descriptions of the same clinical signs, symptoms or findings in a clinical note.
Relation Extraction[3]
Relation extraction identifies semantic relationships between entities mentioned in text. In a veterinary note, this could mean determining links between a finding and a body system, a medication and the condition it's meant to treat, or a diagnostic result and its clinical significance.
Stance Detection[3]
Stance detection discerns the attitude or position that a text segment takes towards a specified target, such as supporting, opposing, or being neutral about it. LLMs can use stance detection in notes to characterize how definitive a clinical finding is or the level of certainty expressed for a diagnosis.
Medical Natural Language Inference
This task determines if a hypothesis (e.g. a proposed diagnosis or interpretation) is true, false or undetermined given a premise (e.g. clinical findings or test results). LLMs can perform NLI to assess whether a diagnosis is supported by the objective data or if a treatment plan follows from the clinical assessment.
Output Task Type Grouped - Table
SOAP Section | SOAP Sub Section | Task Type |
Subjective | Signalment | Exact Data Extraction |
Subjective | Chief Complaint | Medical Reasoning, Summarization |
Subjective | History | Summarization |
Objective/ PE | Vitals | Exact Data Extraction |
Objective/ PE | Body Systems | Contextual Medical Question Answering, Medical Reasoning, Paraphrasing, Relation Extraction |
Assessment | Findings | Stance Detection, Relation Extraction, Medical Reasoning |
Assessment | Differentials | Stance Detection, Medical Natural Language Inference, Data Extraction, Medical Reasoning |
Assessment | Interpretation | Summarization, Data Extraction |
Procedure | Completed | Medical Natural Language Inference, Data Extraction |
Procedure | Discussed | Summarization |
Diagnostics | Resulted | Medical Natural Language Inference, Exact Data Extraction |
Diagnostics | Discussed | Summarization |
Plan | Discussed | Summarization |
Plan | Completed | Medical Natural Language Inference, Data Extraction |
Plan | Medication | Relation Extraction, Exact Data Extraction |
Table 6
Output Type Conclusion
In summary, these NLP task types work together to enable LLMs to thoroughly understand and extract useful structured information from the complex natural language data in veterinary clinical notes. The specific techniques applied depend on the section of the SOAP note and the clinical information required from that section.
Why do we need to measure more than accuracy?
Dimension Overview
It is not uncommon to hear “What is the accuracy of the clinical note?” when talking to a clinician evaluating a medical record generated by AI.
This is a very important question, but in terms of large language models, “accuracy” is considered one dimension within the overall evaluation. The goal of this section is to highlight all the different dimensions that should be considered when evaluating generative clinical notes. In line with the ethos of this paper, not each dimension is applied to each section of the clinical note. Rather the dimensions of both the input and output need to be assessed when evaluating this holistic quality.
The goal of this methodology is not to ignore accuracy, but rather strive for a broader evaluation of AI-generated Records (BEARs). Thus pushing for a higher quality medical record.
To highlight why this selection process is valuable, let's look at a simple real world example.
Example
Scenario
A client came in for a visit, because the patient, a 6 month old puppy, had a small cut on their paw. The client was not sure where the patient received the injury, so when asked the client went into a 20 minute story outlining: recent travels, recent outings, recent diet changes, recent new toys, etc.
After the appointment 2 clinical notes could possibly be generated: (1) one with high accuracy (2) one with high accuracy and high relevance. The 1st report has 2-3 pages of material within the subjective section capturing every single item the owner stated. The 2nd report contains all information related to the minor injury that took place – but it is still completely accurate in its recall. In this thought experiment, this second approach will be preferred by almost all providers.
Analysis
In generative AI “accuracy” is rarely the whole picture and can many times be a poor indicator on overall quality. Most doctors will want the second clinical note which only contains information relevant to the appointment at hand. Meaning that more than one dimension is important in assessing the overall quality of the clinical note.
Outline of Dimensions
Generative clinical notes should be evaluated against 2 groups of dimensions: primary dimensions and secondary dimensions. By definition primary dimensions are critical for assessing clinical note quality; if any one of these dimensions is poor, the note quality is more than likely very poor. Secondary dimensions are valuable, but one can still generate high quality notes even if some secondary dimensions are lacking.
Evaluation Metrics
Introduction
Accuracy in clinical note generation is a complex topic and is essential for monitoring and comparing model performance. AI must perform multiple different tasks to generate a clinical note so the corresponding evaluation suite must be able to assess performance in a multifaceted way. While capturing all key information and avoiding hallucinations are critical, it’s also essential to assess other criteria important to a clinician such as level of detail, organization/flow, appropriateness, and so on. In this write-up, we outline our multidimensional evaluation approach, how we adapt it based on the unique needs of different note sections, and how grading is performed.
Key Evaluation Dimensions
The primary dimensions, briefly defined below, are the key metrics to optimize for a useful and reliable clinical note.
Term | Definition |
Recall | Measures how well all relevant information is captured, ensuring no important details are missed. |
Precision | Assesses whether any false information is introduced into the output. This monitors for hallucination. |
F1 score (accuracy) | A combination of recall and precision. This is the traditional definition of accuracy in computational science. |
Completeness | Evaluates whether all necessary details are provided to form a comprehensive understanding, avoiding answers that are too brief. |
Relevance | The appropriateness of the information included in the note relative to the clinical context, avoiding info unimportant to task at hand. |
Classification | The ability of the AI to correctly categorize elements of the encounter into structured data fields such as symptoms, diagnosis, physical exam, and procedures. |
AI-Overinterpretation | Instances where the AI system inappropriately introduces conclusions, assumptions, or details not explicitly derived from the veterinarian’s input or clinical findings. |
Table 7
These metrics are assessed on a scale of 0 to 1. Each distinct subsection of the note receives its own independent score across all dimensions to allow for granular assessment.
Secondary metric dimensions, when optimized, also improve the clinicians overall approval of the note. However, these are not as critical as the primary dimensions. These are each assessed with a letter grade for the entire generated note for a holistic, subjective assessment of the quality of these dimensions. Secondary dimensions defined below.
Term | Definition |
Coherence | The logical flow and readability of the text generated by the AI. Coherent notes are easily understandable, well-structured, professional, and logically ordered. |
Consistency | The uniformity and agreement of information within a single clinical note. Consistency checks whether all sections and statements in a document logically correlate, avoiding contradictory or mismatched details. |
Table 8
Dimension Weights
Once scores have been generated for each primary dimension within each section of the note, each score is averaged together across each section based on predefined weights. The quality of each dimension impacts each section to varying degrees. For instance, we weight relevance lower in the narrative section of the history but weight it more heavily in the chief complaint section of the history. It is ok if the history narrative has some information reported by the owner not specific to the visit goal at hand, but less relevant information bloating the chief complaint sentence would be more disruptive. Each note subsection has its own set of weights based on this varying level of dimension impact on note quality.
Some sections require exact matching of extracted elements (vital signs), while others just need to be semantically similar (assessment problem list). Each subsection requires either exact match, loose semantic scope, or tight semantic scope which is kept in consideration by the evaluator for each subsection.
Evaluation Approach
Currently, in the v1 of MAPS for BEARs, this evaluation is performed manually by doctors on our team. A ‘gold standard’ list of claims is generated based on the reference note written by the doctor. This claims list organized by section is compared against the grading rubric – it is recommended that a different doctor perform the comparison so as to avoid bias. In v1 of MAPS for BEARs this comparison is manual. The rubric guides the evaluator through grading each of the primary and secondary dimensions, outlined above, for each note subsection. This medical evaluator manually grades each subsection of the note this way.
Automated evaluation metrics are under constant research and experimentation by many groups, and there currently does not exist an adequate automated system. Classic LLM evaluation metrics such as ROGUE, BLUE, MEDCON do not perform well in the medical domain. Even newer metrics harnessing GPT-4 only agree with manual human evaluation approximately 78% of the time[1]. In an effort to lean on exceptional quality evaluation, we will rely on manual evaluation until a more dependable automated approach is designed. Having doctors on our team with technical expertise makes this approach feasible and much more desirable than relying on the above metrics.
In that line of thought, we are in the process of automating note evaluation ourselves. The process will use a human domain expert generated reference note, the model output note, and our custom rubric with dimensions outlined above. Currently scoring is performed manually, but soon this scoring will be automated. The ultimate goal will be to allow this evaluation to occur without a human generated reference note, which is often the most time consuming step. This will allow for large scale evaluation and could be done in real time at the point of output note generation. This will be a novel approach, and we are collaborating with leading universities’ medical and AI branches to pioneer a solution.
Rubric
The clinical note is separated into distinct sections (history, physical exam, etc) and, when necessary, further divided into subsections. For example, the history section is divided into signalment, chief complaint, narrative history components, and technical history components, and each of these subsections are graded separately. Every subsection has its own priming questions for each of the primary dimensions which will be answered on a scale 0 to 1. These scores are averaged against the predefined weights to determine a final 0 to 1 score for each primary dimension. You may find the rubric by reaching out to PupPilot directly.
The detailed grading rubric used in this study is available upon request. Interested veterinarians or official bodies may contact PupPilot directly to obtain a copy for non-commercial, clinical use. Please note that the rubric is not available for public distribution or commercialization.
Methodology for Determining Sample Size
In the evaluation of AI scribing solutions, selecting an appropriately large sample size is crucial, particularly due to the significant dimensionality of the inputs and outputs identified in our framework. This methodology intentionally omits the 'appointment type' as a dimension, leaving it to practitioners to incorporate this variable based on specific use cases. The focus here is on the essential attributes such as 'Voice Quality' and 'Information Quality' within the input schema, alongside the structural components of a SOAP note in the output schema.
To ensure the reliability of our assessment, the sample must not only be large but also diverse, encompassing a broad range of scenarios to fully capture the inherent variations in the data. It is imperative that the selection process is stochastic, providing a sample that is truly representative of the possible permutations within the defined dimensions. This strategy is essential for a robust evaluation of the AI tool, ensuring that the quality measurements of the scribed notes are both accurate and meaningful, in alignment with the principles outlined by the HELM project.[0][2]
Example PupPilot Results
Below are the averaged results across a random sample of audited notes.
Primary Dimensions | F1 (Accuracy) | Completeness | Relevance | Classification | AI Overinterpretation |
Average | 98.7% | 96.8% | 91.8% | 100% | 98.6% |
Table 9A
Secondary Dimensions | Consistency | Coherency |
Average | 100% | 100% |
Table 9B
As practitioners gain experience using our service, they have been getting more accustomed to optimizing audio recording positioning. However we are still occasionally faced with poor quality incoming audio files. The metrics above are with poor audio files excluded (thus lowering dimensionality within the input). Approximately ~%5 of notes have enough audio discrepancies to cause a small but tangible drop in accuracy.
F1 is the core accuracy measure that quantifies the level of missed data and hallucinated data. We continue to strive to get this number to 1.0 aka 100%. We know that even as this number approaches 1.0, AI models have the tendency to overfit to their training data sets. Thus it is crucial for developers to test on a wide variety of samples and in varied clinical settings. AI software that does not recognize this often finds a wide discrepancy between their reported near 1.0 accuracy and real world results with consumers.
Assessment of Metrics
Scoring Brackets
F1 - Accuracy Brackets | |
99.5 - 100% | AAA |
99 - 99.49% | AA |
98 - 98.9% | A |
95 - 97.9% | B |
92.5 - 95% | C |
90 - 92.5% | D |
< 90% | FAIL |
Table 10A
The bounds for accuracy are high because accuracy inside of the medical record is critical
Completeness Brackets | |
99 - 100% | AAA |
97 - 98.9% | AA |
95 - 96.9% | A |
92.5 - 94.9% | B |
90 - 92.49% | C |
85 - 89.9% | D |
< 85% | Fail |
Table 10B
The bounds for completeness are relatively high because it is important that the AI takes the correct thoroughness when responding (or alternatively the correct brevity). With that said it is not as stringent a dimension as accuracy.
Relevance Brackets | |
95 - 100% | AAA |
92.5 - 94.9% | AA |
90 - 92.4% | A |
85 - 89.9% | B |
80 - 84.9% | C |
75 - 79.9% | D |
< 75% | Fail |
Table 10C
Relevance is a dimension which does not need the exacting standards as accuracy, but it is one that still requires a standard to ensure the medical record is not left with material that can be misleading or irrelevant.
Classification | |
99.5 - 100% | AAA |
99 - 99.49% | AA |
98 - 98.9% | A |
97 - 97.9% | B |
96 - 96.9% | C |
95 - 95.9% | D |
< 90% | Fail |
Table 10D
Classification is an incredibly important dimension and will create significant issues if incorrect.
AI-Overinterpretation | |
99.5 - 100% | AAA |
99 - 99.49% | AA |
98 - 98.9% | A |
97 - 97.9% | B |
96 - 96.9% | C |
95 - 95.9% | D |
< 90% | Fail |
Table 10E
If the AI were to overinterpret or over-diagnose – even if technically correct – it is highly dangerous for the medical practitioner and thus this dimension has a very high bar for quality purposes.
Example Score Using PupPilot Data
F1 (Accuracy) | Completeness | Relevance | Classification | AI Overinterpretation | |
Average | 98.7% | 96.8% | 91.8% | 100% | 98.6% |
Scoring | A | A | A | AAA | A |
Table 11
Rules of Scoring
All scoring is additive
Examples:
2 B’s means +1, +1, thus the total score is 2
B and C is +1, +4 thus the total score is 5
Scoring System
Any dimension score of B = +1
Any score of C = +4
Any score of D = +6
Any score of Fail = Automatic Fail
Top Tier is Total Score = 0
This is the recommended level for clinical usage
Secondary Tier is Total Score: 1-2
Usable by clinic, but corrections will be needed
Third Tier is Total Score: 3-7
Not recommended, the amount of time correcting will more than likely outweigh any perceived benefit. Significant corrections will be required by practitioners.
Fourth Tier is Total Score: +8
Do not use, significant errors and corrections will be required and more than likely not all correction will be caught.
Analysis of Score Using PupPilot Data
So based on the scoring system above PupPilot’s assessment is class: A.
Anything A or above is considered acceptable for medical use.
Anything ranked B is more than likely not ready for medical use, but will more than likely not cause harm to the practitioner, they will just need to make sure to monitor the system closely.
Systems with a C ranking are considered unreliable and are a cause for concern. The practitioner will more than likely spend more time reviewing the work, and if not in-accuracies will prevail within their system.
Systems D or below is considered Dangerous and should not be used. The practitioner is better off writing notes themselves.
Systems ranked FAIL are considered highly dangerous and should be actively avoided. The practitioner will more than likely not catch all of the mistakes and open themselves up to significant liability.
Looking Forward
Semi-Automated and Fully Automated Evaluation Using a RAG
As we look to the future of evaluating AI-generated clinical notes in veterinary medicine, it is clear that transitioning from manual to semi-automated or fully automated evaluation methods will be crucial for ensuring robustness and scalability. However, this transition presents significant challenges, as it requires the development of highly accurate and independently developed Retrieval-Augmented Generation (RAG) engines specifically tailored for the veterinary domain.[2]
RAG techniques involve augmenting language models with the ability to retrieve relevant information from external knowledge sources. By enabling LLMs to access vetmed-specific knowledge bases, RAG engines can potentially catch errors and inconsistencies in medical records with higher accuracy. Additionally, they can provide real-time evaluation of clinical note accuracy by directly linking back to the source material using a deterministic methodology that is independent of the methodology used to generate the original note.
The Importance of Optimized RAG Engines
Optimizing RAG configurations is crucial for building high-quality systems in veterinary medicine. The numerous interacting components and parameters in a RAG system can significantly impact its performance, and finding the optimal configuration for a specific task and dataset is essential.[6] By automating the optimization process and exploring a wide range of configurations, teams can ensure that their RAG systems consistently deliver accurate, relevant, and reliable results when generating veterinary clinical notes. This optimization process helps to identify the best settings for chunking strategies, embedding models, similarity search metrics, and prompt structures, among other factors, tailoring the system to the unique requirements of the veterinary domain.[6] Investing in RAG optimization is not just recommended, but is a necessity for any team serious about deploying production-grade RAG systems for veterinary medicine.
Key Values of Optimized RAG Engines
Here are some key reasons why an independent RAG optimized engine is essential for robust evaluation:
Avoiding bias: By using a RAG engine that is developed independently from the one used in the main system, you can ensure that the evaluation process is not influenced by the same design choices or assumptions. This helps to minimize bias and provides a more accurate assessment of the system's performance.
Comprehensive evaluation: An independent RAG engine can be designed to focus specifically on evaluation tasks, such as measuring retrieval quality, relevance, and overall system performance. This allows for a more thorough and targeted evaluation process, helping to identify areas for improvement and ensuring the system meets the desired quality standards.
Scalability and automation: Just like the main RAG system, an independent evaluation engine can benefit from scalable infrastructure and automated processes. This enables efficient and repeated evaluations, making it easier to track performance over time and identify potential issues or regressions.
Continuous improvement: By incorporating an independent RAG evaluation engine into the development lifecycle, teams can establish a feedback loop that drives continuous improvement. Regular evaluations can help identify opportunities for optimization, model updates, or changes to the data pipeline, ensuring the system remains at peak performance as it evolves.
Investing in an independently developed RAG engine for evaluation purposes is a critical step in building robust, production-grade RAG systems. By combining this approach with automated configuration optimization, teams can ensure their systems consistently deliver accurate, relevant, and reliable results.
RAG Synopsis
RAG systems are powerful tools for enhancing LLM applications with domain-specific knowledge, but building a high quality RAG system can be challenging.[6] The complexity of these systems, with their numerous interacting components and parameters, means teams often want optimization but might not have enough engineering resources to do it well.
Automated RAG optimization is a necessity for any team serious about deploying production-grade RAG systems.[6]
Conclusion
In this white paper, we have outlined a comprehensive, standardized approach for evaluating the accuracy and quality of clinical notes generated by AI systems in veterinary medicine. By adapting the HELM methodology to the veterinary domain and to a specific scenario – clinical notes, we have established a framework that emphasizes broad coverage, multi-dimensional measurement, and reproducibility.
Our evaluation metrics go beyond simple accuracy measures to assess critical aspects such as recall, precision, completeness, relevance, classification accuracy, and the risk of AI overinterpretation. This multifaceted approach ensures that the AI-generated clinical notes are not only accurate but also clinically useful and appropriate for the specific context.
By providing clear definitions, scoring brackets, and rules for aggregating scores across dimensions, we aim to make this evaluation framework accessible and practical for veterinary practitioners and industry stakeholders. The goal is to empower them to critically assess the performance of AI tools and make informed decisions about their implementation in practice.
Moving forward, we are committed to refining and automating this evaluation process. Our data-centric deep learning approach allows us to efficiently incorporate new note formats, identify weaknesses, and continuously improve the AI models. As we progress from manual evaluation to computer-assisted and fully automated methods, we will maintain our focus on delivering reliable, high-quality AI solutions that enhance veterinary care.
Ultimately, by establishing a robust, standardized evaluation framework, we aim to promote the responsible and effective use of AI in veterinary medicine. This will help to improve record-keeping, reduce veterinarian burnout, and optimize patient care, while ensuring that the AI tools meet the highest standards of accuracy, completeness, and clinical relevance.
Since the goal of this paper is predominately to be a practical guide, please do not hesitate to reach out to principal authors:
Gary Peters: gary@puppilot.co
Nora Peters, DVM: nora@puppilot.co
Alec Coston, MD: alec@puppilot.co
Adele Williams-Xavie, BVSc, PhD: (adelewxvet@gmail.com)
If you would like to contribute to future MAPS papers or would like a MAPS for a specific topic please reach out.
References
[0] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda - Holistic Evaluation of Language Models – https://arxiv.org/abs/2211.09110
[1] Yiqing Xie, Sheng Zhang, Hao Cheng, Zelalem Gero, Cliff Wong, Tristan Nauman, Hoifung Poon – Enhancing Medical Text Evaluation with GPT-4 - https://ar5iv.labs.arxiv.org/html/2311.09581
[2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen - Shubham Vatsal & Harsh Dubey - A Novel Ambient Clinical Intelligence Dataset For Benchmarking Automatic Visit Note Generation - https://arxiv.org/pdf/2306.02022
[3] Shubham Vatsal & Harsh Dubey - A Survey Of Prompt Engineering Methods In Large Language Models For Different NLP Tasks - https://arxiv.org/pdf/2407.12994
[4] Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp,
Alireza Khanteymoori, Dawn Craig, Mark Engelbert, and James Thomas – Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study – https://arxiv.org/pdf/2405.14445
[5] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems – https://arxiv.org/abs/2201.11903
[6] Andrew Maas, Scott Wey, Mike Wu, & The Pointable Team - Why you need RAG optimization - https://www.pointable.ai/blog/why-you-need-rag-optimization