Over the past few months at Bloom, we’ve devoted considerable effort towards building various AI-powered features. Our team has experimented with different approaches to identify viable methods for utilizing large language models (LLMs) such as OpenAI in our app.
We have come to understand that leveraging this technology to develop superior experiences requires an entirely new skill set and innovative approach to creating products that users love.
Navigating the Ambiguity of Language
The most significant challenge encountered while working with these models is ambiguity. Tasks can be interpreted in multiple ways, leading to uncertainties about the output. This is mostly because we interact with them using natural language, which is less precise than the coding languages used in software development previously. Additionally, these models use statistics to make their best guess when generating responses. However, because they’re guessing, inaccuracies and unexpected failures are inevitable.
Today, many AI user experiences are designed to be highly fault-tolerant, implying that they are built to expect possible errors. For example, image generation tools such as MidJourney create multiple low-resolution images for users to select from, rather than assuming they fully understand the user's desired image. Similarly, chat applications let users steer follow-up responses, whether by asking for more detail or modifying the initial request. Therefore, these interfaces create a form of feedback loop, increasing the chances that users will ultimately receive the outcome they desire.
We believe that in many use cases, particularly those related to health, an accuracy rate of 80%-90% is unlikely to satisfy the user’s expectations.
In this article we will share our key takeaways from the past six months on how to fight ambiguity.
5 tactics to fight ambiguity
1. Split it into smaller tasks
Much like giving instructions to another person, the likelihood of success increases when tasks are broken down into smaller steps. For instance, instead of instructing children to get ready for bed, the directive could be more specific, including tasks such as tidying up their toys, brushing their teeth and putting on their pajamas. Similarly, providing guidance to an AI about what exactly we expect can be equally beneficial. These subtasks can either be included in a single prompt or the result of one prompt could be repurposed and used as input for the next in a sequence of chained prompts
2. Chain of Thoughts OR Few-shot
Let’s use another human metaphor. If you show someone an example of what you want from a task, it can make things a lot clearer. It helps set good expectations without going into every detail. This is really helpful when the task needs to have a certain fomat or structure, which can be hard to explain and might not always come out right. By using examples we can make the results more consistent and the prompt simpler because not every details needs to be spelled out. In fact, some researchers found that using examples tripled the solve rate for school math problems with LLMs.
3. Control Flows
Instead of decomposing a task into smaller steps within a single prompt, another approach involves separating tasks into different prompts and chaining these commands. This process waits for the completion of the first task, uses its output, and then initiates new tasks. This approach can be further refined into more complex logics using classic programming elements like IF-ELSE statements, loops, or even allowing the large language model (LLM) to design the next prompt based on the input. Control flows can prove beneficial when there's a need to refine broad use cases into more specific instructions for more detailed output. However, a drawback is that every additional step increases the application's latency and cost. The skill lies in maintaining simplicity while still achieving specificity and depth in the output.
4. Model Tuning
Fine-tuning involves training a pre-existing model with your own data. This process is significantly more time-consuming than optimizing and iterating on the prompt. Moreover, there are limited empirical examples illustrating how to effectively fine-tune in practice, given that every use case has unique elements. The general consensus suggests that at least 100-200 examples are required to see an impact on model performance. For certain use cases, the requirement could be closer to 1,000 examples. Eventually, the effect of additional examples reaches a plateau, and the effort invested in adding more examples yields minor or no improvements in performance.
Truth to be told, we believe it's worth investing in the fine-tuning process, as the outcomes can be a game-changer. A significant part of the success of GPT-4 can be attributed to the fine-tuning of GPT-3 using datasets from Reinforcement Learning from Human Feedback (RLHF). This underscores how crucial targeted adjustments can be in enhancing the performance of AI models and transforming the overall user experience.
5. Long-term Memory
Indeed, enhancing the quality of output often requires more than simply improving the model, especially if the input data is ambiguous or lacks certain contextual information. In the context of health this could mean information like age, relationship status, or professional career. In contexts such as therapy, the patient's situation plays a critical role when interventions are being introduced. For instance, having children can significantly change how relationship conflicts are addressed. In such cases, both the patient and the therapist need to work together to find what's not only best for the patient, but also for the children.
When we interact with people, we can usually assume they remember our past conversations. Large Language Models (LLMs), on the other hand, don't inherently have this long-term memory capability. This feature needs to be designed into the system. This design process involves both storing the information and retrieving the correct data when needed for a specific task.
6. Connecting more data sources
Additional context can be provided not only through long-term memory but also by integrating other data sources into the system. For instance, in a diabetes AI application, it would be beneficial to incorporate information about physical activity, continuous glucose monitoring, and insulin doses for the day or month. This additional data can significantly enhance the system's ability to provide meaningful and personalized support.
The Challenge of Testing and Monitoring AI Systems
Let's say we have successfully tackled the ambiguity issues and implemented the above strategies. A new challenge promptly presents itself: How do we build reliable testing systems that ensure these complex AI models don't malfunction, leaving us oblivious? Moreover, how do we monitor alignment to guarantee our continuous efforts are indeed enhancing the system? This becomes increasingly crucial as systems evolve in complexity.
Testing domain expertize
Indeed, since the output is generated in natural language and may refer to specific domain knowledge, the verification of the results can be challenging. For instance, if you're building an AI application that interprets blood test results to propose medical interventions, the results' validity would be difficult to confirm without a medical professional expertise. This highlights the need for interdisciplinary collaboration in developing and assessing complex AI applications.
There could be AI agents designed to add an additional layer of quality control by verifying some of the results. They could potentially identify and correct hallucinations and other inaccuracies. However, even these AI agents would require oversight and periodic review to ensure their effectiveness and accuracy. Thus, a combination of AI and human expertise can be used to ensure the highest level of quality and accuracy in AI applications.
The Future of AI Testing and Monitoring
If a prompt is accidentally altered, it won't trigger an error because the compiler won't be able to detect the deviation. The Large Language Model (LLM) will continue to process the prompt and provide an output based on the altered input. This underscores the need for careful prompt design and routine checks to ensure that the prompt remains aligned with the intended task. It's a unique challenge with AI systems that they do not inherently understand our intentions and instead operate strictly based on the information provided to them.
In the long-term, we anticipate the emergence of more sophisticated tools designed to enhance the testing and monitoring of our AI systems. These innovations will undoubtedly streamline the integration of AI technologies into our products, ensuring they function optimally and deliver the desired outcomes.
However, until such advancements materialize, we are in an exciting position of invention and discovery. We're effectively tasked with "reinventing the wheel," to move forward with our AI integration endeavors.
Conclusion
While it remains essential to strive for the most robust AI systems possible, it's unsurprising that Large Language Models (LLMs), trained on human reasoning, embody not only our capacity to make sense of the world but also the same flaws intrinsic to human nature. Recognizing this dual inheritance of strengths and weaknesses can help us set realistic expectations for AI performance and guide our ongoing efforts to improve these systems.
In developing AI applications with potentially lower robustness, I believe we need to cultivate a new mindset concerning software applications. Most of the tasks we envision introducing AI to in the future are currently performed by humans. When considering these human tasks, we must acknowledge that errors, inconsistencies, variations in robustness, and biases are inherent to the process. For instance, consulting with ten different doctors could potentially yield ten distinct diagnoses.
Given these various strategies for addressing challenges in building AI applications and features, it becomes imperative for product teams to learn how to effectively allocate their limited resources to achieve the best results. Decisions, such as when to apply fine-tuning or iterate the prompt, become critical. Estimating the Return on Investment (ROI) for specific engineering work becomes increasingly complex as systems grow, and the effectiveness of certain techniques may diminish over time. Balancing these considerations will be central to successfully deploying AI in product development.
Comments