Aimpoint Digital: AI Agent Systems for Building Travel Itineraries
Summary
This blog explores how Aimpoint Digital leveraged AI agent systems and Databricks to build travel itineraries. By combining Retrieval-Augmented Generation (RAG) frameworks and AI Agent systems, the architecture integrates up-to-date data from vector databases for places, restaurants, and events. The prompts were optimized for high-quality outputs, ensuring precise and personalized itineraries were generated within seconds. This design simplifies travel planning and lets our customers plan their vacations easily.
Going on vacation is an enjoyable experience, but planning the trip can take time and effort for most people. There are numerous places to visit, countless restaurants to dine at, and endless reviews to sift through and make decisions. According to a recent research poll by Expedia, travelers spend over 5 hours researching and planning trips. People often visit up to ~270 web pages before finalizing their trip activities, and this process can start as early as 45 days before the trip. Planning trips can be overwhelming for some people due to the sheer number of choices. Could we leverage GenAI to streamline this process and produce an itinerary in 30 seconds or less? What if travelers could have a personal agent to tailor and customize activities in their itineraries? In this blog, we dive into the details of an AI agent system we developed with the Databricks Data Intelligence Platform to build travel itineraries.
Approach
Generative AI has dramatically shaped the travel industry in the past few years. Standalone GenAI tools like ChatGPT can generate travel itineraries. Still, the itineraries can be misleading or incorrect because they are based on LLMs that lack up-to-date information. For example, imagine a traveler planning a trip to Morocco in December 2024. An LLM last trained in December 2023 is unlikely to be aware of a restaurant that closed in July 2024 and may incorrectly recommend it to a traveler. Most LLMs are not trained or fine-tuned with recent data and suffer from this “recency issue.” Another challenge is that LLMs may be prone to hallucinating or making up inaccurate information.
Using Retrieval Augmentation Generation (RAG) allows the LLM to augment their training data with recent data which addresses recency and hallucination issues with LLMs. RAGs overcome the recency issue by maintaining regularly updated databases containing the latest information. These databases are called vector databases (one example is Databricks Mosaic AI Vector Search) and store relevant data as vectorized embeddings. This vector database is updated nightly with data on these attractions, including their opening and closing hours. This RAG framework can power a GenAI application that retrieves the most relevant places based on a traveler’s interest and formulate an accurate itinerary.
AI Agent Systems
An itinerary is seldom complete with just places of interest; travelers may seek information on restaurants and events happening at their destination. To solve this problem, we combined multiple RAGs in our architecture (one for places, one for restaurants and one for events) in an AI agent system.
AI agent systems represent the evolution of GenAI architecture from reliance on a single LLM to integrating multiple components, retrievers, models and tools. Systems incorporating multiple interacting components have been proven to perform better than standalone AI models in a wide range of standard tests. In a recent research paper from June of 2024, researchers have shown that allowing LLMs with predefined roles to interact with each other can enable them to produce quality code for software engineering tasks. These LLMs have detailed role descriptions (Developer, Senior Developer, Project Manager, etc.) and can take turns writing, reviewing, and testing software code. This is an excellent example of an AI agent system where a group of LLMs (or agents) performs better than any standalone LLM. Given the clear advantages of these systems, we decided to pursue the creation of an AI agent for our itinerary generation tool.
User Query for Itinerary Generation
We need to collect information on a traveler’s plans and interests to generate relevant and valuable itineraries. Some of these parameters are destination city, destination country, dates of travel, travel purpose (business, leisure, recreation, etc.), travel companion(s) (friends, partner, solo, etc.) and budget. Inputs generated from the user query are passed through the embedding model and used to retrieve the places, restaurants and events that closely align with the traveler profile.
Motivating the Architecture
To generate itineraries with places, restaurants and events, our architecture consisted of three RAGs configured in parallel. The user query is converted to a vector using the embedding model, and the retriever in each of the RAGs attempts to retrieve the top matches from the respective Vector Search Indices. The number of retrieved matches corresponds to the length of the trip, as shorter trips require fewer activities, while longer trips require more activities. On average, our system is configured to retrieve three places or events and three restaurants daily (breakfast, lunch, and dinner).
Our solution utilizes two Vector Search Indexes, providing flexibility to support future expansion to hundreds of European cities. We collected data on ~500 restaurants in Paris, with plans to scale to nearly 50,000 citywide. Each Vector Search index is deployed to a standalone Databricks Vector Search Endpoint, ensuring easy querying during runtime. Moreover, all our data source tables containing raw information about the places of attraction, restaurants, and events are Delta tables utilizing ‘Change Data Feed’. This ensures that any changes to the raw data will automatically update the Vector Search Indices without manual intervention. Three simultaneous calls are made to the different RAGs in parallel to gather recommendations.
The final call in the series is made to the LLM to synthesize the responses. Once the RAGs have retrieved places, restaurants, and events, we use an LLM to combine the recommendations into a cohesive itinerary. We are using open source LLMs like DBRX Instruct and Meta-Llama-3.1-405b-Instruct on Databricks using Provisioned Throughput Endpoints with built-in guardrails to prevent misuse of the AI agent system.
Retrieval Metrics
We used a collection of metrics to evaluate the performance of our retrievers for restaurants, places of attraction and events.
- Precision at k: Simply described, precision at k tells us how many items within the top k retrieved documents are actually relevant to a certain query. If no documents are retrieved, then the precision at k is 0. Here is some MLFlow documentation on the exact definition of precision at k.
- Recall at k: Recall at k informs us about the fraction of relevant results in the k retrieved documents with respect to the total relevant documents in the population. If no relevant documents are retrieved or if no appropriate ground truth documents are specified then the recall at k is 1. Here is some MLFlow documentation on recall at k.
- NDCG at k: Normalized Discounted Cumulative Gain (NDCG) at k makes use of a relevance score to evaluate the retriever. A binary score is used for retrieved documents in the ground truth (relevance = 1) and for retrieved documents not in the ground truth (relevance = 0). Once a set of relevance scores has been assigned, NDCG makes use of the concept of cumulative gain (CG), which measures the total number of relevant documents retrieved at a set threshold (k). For example, if your retriever is set to retrieve the top 10 relevant documents (k = 10), but if only 7/10 retrieved documents are part of the ground truth – then CG is 7.
CG, however, does not paint the full picture about the ranks of those 7 correctly retrieved documents. Out of the 10 documents retrieved, document 2/10 is more similar to the query than document 9/10. To account for this, we introduce the concept of Discounted Cumulative Gain (DCG). This is a logarithmic penalty for documents that are correctly retrieved but less similar to the original query. Normalizing the DCG gives us NDCG. Here is some official MLFlow documentation on NDCG at k.
LLM-as-a-Judge
We used an LLM to evaluate travel itineraries for professionalism. This is an automated way to evaluate responses from AI agent solutions without ground truths. The LLM requires the following as input to perform a good job of evaluating the itineraries.
- Metric Definition: A clear definition for the metric that the LLM is evaluating. This definition will tell the LLM what aspect of the response needs to be evaluated.
- Rubric: A well-defined rubric that acts as a scoring guide for the LLM. Our scoring guide was on a range of 1-5 and had clear descriptions of the level of professionalism required for each level. To avoid confusing the LLM, it is important for the different score levels to be as disparate as possible.
- Few Shot Examples: Example itineraries of varying levels of professionalism that serve as examples to the LLM. This will guide the LLM to assign the correct score.
The following are some of our responses evaluated by the LLM-as-judge along with justifications on why responses were scored a certain way.
Optimizing the Prompt
The prompts to the LLM in our architecture are critical to the quality and format of the final synthesized itinerary. We observed that minor changes to the prompt can sometimes have significant, unintended consequences in the output. To mitigate this, we used a state-of-the-art package called DSPy. DSPy uses an LLM-as-a-judge in conjunction with a custom-defined metric to evaluate responses based on a ground truth dataset. As illustrated in the code snippet below, our custom metric used the following rubric to assess responses:
- Is the itinerary complete? Does it match what the traveler has indicated in the prompt?
- Can the traveler reasonably commute between the places on the itinerary via public transportation, etc.?
- Is the response using polite and cordial language?
We noticed that using DSPy to optimize prompts yielded precise prompts that were hyper-focused on the outcomes. Any additional language to force the LLM to respond in a particular manner was eliminated. It is important to note that the quality of the optimized prompt depends significantly on the custom metric defined and the quality of the ground truths.
A Note on Tool Calling
Our architecture utilizes an AI agent system that makes three parallel calls to retrieve recommendations for places, restaurants, and events. Once the top options are collected, a final call is made to an LLM (Large Language Model) to synthesize these recommendations into a cohesive itinerary. The sequence in which the components of our AI system are invoked remains fixed, and we found that this consistently produced reliable itineraries.
An alternative approach would involve using another LLM to dynamically determine which tools to call and in what order, based on the traveler's preferences. For example, if the traveler is not interested in events, the Events RAG would not be triggered. This method, known as tool calling, can tailor the itinerary more effectively to the traveler’s needs. It may also improve latency by skipping unnecessary tools. However, we observed that the itineraries generated using tool calling were less consistent, and the LLM responsible for selecting the appropriate tools occasionally made errors.
While this approach did not align with our application, it is worth highlighting that using LLMs for tool calling is still an emerging area of research with significant potential for future development.
Conclusion
The AI-driven itinerary generation tool has demonstrated transformative potential in the travel industry. During development, the tool received overwhelmingly positive feedback from stakeholders, who appreciated the seamless planning experience and the accuracy of recommendations. The solution's scalability also ensures it can cater to a diverse range of travel destinations, making it adaptable for broader implementations. As this AI agent system evolves, we anticipate deeper integrations with dynamic pricing tools, enhanced contextual understanding of diverse travel preferences, and support for real-time itinerary adjustments.
About Aimpoint Digital
Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From integrating self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Learn more by visiting: https://www.aimpointdigital.com/
This blog post was jointly authored by Elizabeth Khan (Aimpoint Digital), Vishaal Venkatesh (Aimpoint Digital) and Maria Zervou (Databricks).