Data Preparation for RAG-Part 1

Akash Deep
5 min readMar 2, 2024
Retrieval Augmented Generation

In this blog, we aim to thoroughly dissect and understand the entire RAG (Retrieval-Augmented Generation) process, with a special emphasis on the Data Preparation phase. This stage is crucial for setting the groundwork for effective data utilisation in later stages. Following this, we will advance our discussion to explore the subsequent phases of the RAG journey, focusing on Data Retrieval. We’ll examine strategies for leveraging the retrieved data efficiently, ensuring it’s optimally formatted and ready for submission to Large Language Models (LLMs) for insightful answers.

Moreover, I will introduce and share several Python helper classes and functions specifically designed to facilitate our understanding and manipulation of file data. These tools are invaluable for distinguishing between different data types within a file, such as identifying the number of tables it contains, distinguishing normal text from structured data, and recognising headers and footers. These resources aim to significantly streamline the process of data extraction and preparation, making it more accessible and less time-consuming for everyone involved.

What Is RAG?

RAG, which stands for Retrieval-Augmented Generation, hinges on two pivotal elements. The first is the preparation of your data to ensure its readiness for efficient retrieval. This involves a crucial preprocessing step, where the focus is on constructing a pipeline capable of handling any type of data. The objective is to process this data in an optimized manner by extracting relevant information from files and subsequently creating a “Chunk” from this extracted data. Thus, the process is twofold: 1) extracting the content, and 2) creating a Chunk from the extracted content. Fortunately, there is a Python library named Unstructured that facilitates both these processes seamlessly.

In this blog, our primary focus will be on delving into the Data Preparation process. In forthcoming blogs, I will explore further topics, including how to create an Embedding from these chunks and how to store them in a Vector Database. Additionally, I will discuss the intricacies of indexing logic on Vector databases and more. This series aims to provide a comprehensive understanding of the entire RAG process, from data preparation to the advanced techniques of data utilisation and optimization.

Unstructured

Unstructured is a Python package designed to streamline the process of loading and processing files with two distinct strategies: basic and by_title. Additionally, Unstructured leverages advanced deep learning models, such as YOLOX and Detectron2 from Meta, to categorize document content. By incorporating a metadata label named ‘category’, it facilitates the classification of content types within a document, including Table, Narrative Text, Header, Footer, etc. For each categorisation, Unstructured assigns a ‘detection_class_prob’ value, which indicates the model’s confidence level in its classification accuracy for that specific category. This innovative approach not only enhances the precision of data extraction and classification but also significantly contributes to the efficiency of document processing workflows.

basic
by_title

For those interested in delving deeper into the capabilities and features of the Unstructured package, the API documentation provides a wealth of information. You can access it through the following link: Unstructured API Documentation. This resource is invaluable for understanding the full range of functionalities offered by Unstructured, including detailed explanations on how to utilize the package for data processing and categorisation tasks effectively.

I have developed a custom class that inherits from the Unstructured package, offering easy-to-use methods as detailed below.

The get_category_counts() method provides the count of all categories identified within a specific document. It returns a dictionary with the name of the category as the key and the count as the value. This method is particularly useful for gaining a quick overview of the content distribution across different categories within the document.

Following that, the fetch_content_by_category('Table') method allows for the retrieval of content by specifying the category identified with get_category_counts(). This targeted approach enables users to efficiently access and manipulate data of interest within the document, streamlining the process of content extraction based on category.

Lastly, the enhance_table_content() method is designed to merge table content. For instance, if you are processing a document containing a large table that spans multiple pages, this method automatically identifies such tables using a parent_id. It then identifies candidates for merging and provides the final merged content along with the parent_id. After obtaining the final content, it is up to the user to decide how to break the table content into chunks, whether to keep the table content as is for ingestion, or to apply advanced retrieval techniques.

In the upcoming part of this blog, I will delve into post-processing, as well as the best practices for embedding and chunking techniques. For now, let’s focus on gaining a thorough understanding of your document before making any decisions regarding chunking and embedding strategies. This foundational knowledge is crucial for optimising document processing and enhancing the efficiency of data retrieval and utilisation.

Additionally, as we delve deeper into the data across various categories, we’ll uncover numerous duplicates within different categories, which need to be addressed appropriately. In future segments of this series, I plan to discuss strategies for managing such instances. Furthermore, I will introduce some practical automatic functions designed to tackle these cases effectively. It’s important to note that not all duplicates can be simply deleted; we must carefully consider their positions within the document. This crucial information is obtained from the metadata of the processed document after utilising the CategoryContentFetcher class. Such insights will enable us to make informed decisions about how to handle duplicates, ensuring that our data cleaning and preparation efforts are both efficient and tailored to the specific structure and content of our documents.

Below is the link to the GitHub repository containing the usage.ipynb notebook, which demonstrates how to use the class effectively. To execute the notebook, ensure you have a valid PDF document in the directory. Your feedback is invaluable, so please do not hesitate to post suggestions or comments. If you require a more detailed explanation, find anything unclear, or have any confusion regarding the repository, feel free to reach out.

If you want to deep dive in deep learning topics consider checking my other blogs. Below are the link:

--

--