STEM Diary: Machine Learning Mini-Project Series IV: RAG Integration of LLM and Computer Vision

Author's Commentary (Hong Kong China, 20/02/2026): The A.I. hackathon is getting closer, and I hope my experiences with these A.I. prototypes will be helpful. Next week, possibly on a Monday, I may demonstrate another prototype, perhaps with a focus on compound A.I. - the existence of many Python libraries makes such prototyping at the high level feasible.

0 Introduction

LLMs have become extremely hot recently, in part due to the influences of Silicon Valley and corresponding "bubbles". Whatever the case, such topics cannot be ignored, and despite the fact that a recent report, from MIT, had discovered that 95% of organizations that had adopted LLMs have seen zero returns, the "hype", as they say, continues.

Yet, LLMs do have potential, clearly, as these are revolutionary models that provide state-of-the-art performances, specifically in natural language processing (NLP). So, as long as their strengths, and weaknesses, are appraised, then they could be useful nonetheless.

So, this blog posts reveal a mini-project, still belonging to "machine learning" (the models are predominantly deep learning models, called via python libraries), demonstrating the feasibility of interesting applications of RAG architectures utilizing LLMs - one must remember a key point here, for a vast majority of specialized tasks, it is not the LLMs that are responsible, instead, the LLMs are responsible for smooth communications.

1 The RAG Architecture: Detecting Traffic Information and Relaying to a Chatbot

The prototyped architecture is simple. We begin with a pipeline in a "vision.py", that extracts .mp4 videos from a file, and a computer vision model is provided to extract various information from the .mp4 videos, consisting of vehicles travelling along highways. One video is tagged as "highway a", whereas the other, "highway b". Information such as vehicular type, average speed, and color are extracted, including video frame information.

Then, the information as extracted are fed into .txt and .json, where the .txt. is useful for human inspection, and the .json, more so for automated extraction later.

After, there is "retriever.py", which aids the RAG architecture. Specifically, this simple program opens the .json file as produced by vision.py, then utilizes a "all-mpnet-base-v2" embedding model, a sentence transformer from a "sentence_transformers" library (a deep learning model), which will aid in retrieval. Source attribution and confidence filtering were also implemented.

Then, there is "rag.py", which contains the LLM, along with a simple user interface, using "tkinter", that which was not separated out. The LLM is the "Qwen2.5-3B-Instruct" and there is also a verifier LLM, which is the "deepseek-coder-1.3b-instruct".

Incidentally, I must state that the prototyped pipeline was run on a CPU, although a fast CPU to be sure, it is still a CPU. I actually tried to access my AMD GPU, but then, it turned out that some of the libraries are not updated for python 3.13, and it was too much of a hassle. In which case, in many iterations of the prototyping, I've noticed that in RAG, the quality of the LLM is of exceptional importance, including in the number of parameters. The two modes, Qwen2.5-3B-Instruct and deepseek-coder-1.3b-instruct, have 3 billion and 1.3 billion parameters, respectively. These are tiny models, but they demonstrated to be sufficiently good for simple tasks, and achieve a compromise between efficiency, computational resource, accuracy, and convenience (for instance, I don't need to "log in"). In fact, I even tried extremely rudimentary models like GPT-2.0, which was illuminating to say the least (RAG performance is heavily dependent on LLM performance, which should be unsurprising).

Below is a depiction of the architecture,

2 Computer Vision with the YOLO Library

The computer vision simply utilizes the YOLO library. Now, it was tempting for me to develop some computer vision models from scratch, and I was thinking about some combination of CNN and RNN, where the RNN provides and supplements temporal resolution (after all, it is known that, before the transformers and the LLM introduction, RNN was almost the temporal counterpart to the CNN). But, this would've added significantly more time in debugging and in iterating, in rapid prototyping. Sure, CNNs were implemented, almost from scratch, in Machine Learning Mini-Project Series III, but the point in that blog was also to experience, to make mistakes, and to become more intimate with certain aspects of CNNs (which we will return to in the future, particularly on methods of automated hyperparameter optimizations - that is why, obviously, MINST was utilized to provide the datapoints).

Whatever the case, YOLO is reasonably useful, using bounding boxes. For color detection, we used the "cv2" library, which demonstrated to be remarkably weak. Obviously, if this were a serious project, one that others rely on no less, then these issues would have to be resolved, but it's just prototyping.

Additionally, a rudimentary speed estimation is provided using moving centers. I must say, it was exceptionally basic and not particularly accurate, and there are many methods which would be far more accurate. In fact, the fact that cameras are angled means that such transformations would have to be accounted for. Of course, it goes without saying, due to developments in satellite technologies in both spatial and temporal resolution, isometric "birds-eye views" would be fantastically suited for speed estimations. Clearly, YOLO specializes in vehicle identification and tracking with bounding boxes in these simple contexts. For better accuracy, many more specialized computer vision models would have to be used, and YOLO may be abandoned.

Below are some examples,

Above, we see a vehicle travelling at 17.4 km/h, with one in front at 48.1 km/h, and another behind it at 51.8km/h, an impending crash? No, the speed estimates are wrong here, because the vehicular sizes are becoming smaller, and the distance estimations fail due to a failure to account for transformations. Note that YOLO can provide pixels per second with no issues. Also, the truck on the bottom right has 2 colors, yellow and white, so, in real-world applications, it is the color profiles that should be extracted (actually, there's also black on yellow).

Above, we see a number of vehicles on the right travelling at relatively low speeds, all quite consistent, which makes sense. A vehicle on the left, however, bounded by red, travelling at extremely high speeds, relatively. The value is clearly wrong, but the qualitative estimates are there (i.e., fast versus slow) - as the vehicle gets closer to the camera, it covers more pixels, exaggerating the speed estimate.

3 The RAG: Results and Commentary

When running the RAG, a following result is,

The interface makes it a little more convenient to look at, but actually, one reason the interface was introduced was to show potential onlookers, like you, dear reader, of "implementation feasibility"; in fact, one of the reasons why I even prototyped a RAG in the first place (the market for LLMs is hot). If you click on the image and zoom in, additional information are provided in the "Retrieved Documents" window, for sanity checks. Additionally, the LLM was successful in answering simple questions. The bar on top just shows the stage of the RAG, where generation typically takes the most time.

Finally, there's a "Play Clip (5s)" button, where, when you click on it, the clip frames that identify the unique ID of the vehicle play. If you look at the above image closely (click on it and zoom in), you can see that the appropriate frames are retrieved, on the left, that which is the car on the bottom right of the video (the color may be described as black by others rather than blue, but that's an issue with the cv2 model).

So, the RAG successfully answers the questions correctly, retrieves the frames where such a vehicle has been identified, and shows it.

In fact, the LLM can even engage in normal conversation.

The generation time is also shown at the bottom left.

4 Future Directions

I remind the reader again that the point is prototyping, of "proof of concept", which will no doubt aid a coming hackathon. If, however, I were to pursue such a project professionally, then the scope would be massively enhanced, the LLMs, far more powerful, and the hardware, of appropriate GPUs (or even NPUs), instead of running on an overclocked Intel 15-13600K.

Whatever the case, the flexibility of LLMs, especially their strengths in natural language processing, will no doubt mean that they will become integrated in a greater and greater diversity of applications, if only to communicate with people, non-technical people, including with law enforcement (if we do not use LLMs for convincing communication, can we use other models, like RNNs? Yes, but they have their issues).

On improvements, there are actually two simple approaches forward. One approach is to introduce a database, perhaps an SQL database, that assists in answering structured questions, freeing the RAG and lowering the possibility of hallucinations for complex queries. But then, for human beings, there are so many queries, so many possibilities and so many modes of expression; does that mean that the SQL database would grow to contain millions of lines? Well, perhaps, a simple hand-crafted rules-based system, although millions of lines is not exactly unfeasible compared to what was in the 20th century. But could such a database grow?

Of course, via a so-called "human-in-the-loop", where human counterparts could be introduced, possibly to aid in prompt engineering so as to reduce hallucinations (i.e., rephrase other people's questions into clearer prompts), or, to aid in the expansion of query engines (it goes without saying, automated expansions of query engines, perhaps aided by human counterparts, must be anticipated).

In fact, this is the approach of a recent Adobe "Summit Concierge", where an interesting paper is available on arXiv. "Human-in-the-loop" is kind of a cop out (humans have general intelligence), but the efficiency of teams can be improved, which is the point of automation and integration of non-trivial A.I..

Incidentally, unless I resolve the GPU issue, I am in no position to realistically prototype and build a LLM transformer model from scratch.

STEM Diary

Machine Learning Mini-Project Series IV: RAG Integration of LLM and Computer Vision

0 Introduction

1 The RAG Architecture: Detecting Traffic Information and Relaying to a Chatbot

2 Computer Vision with the YOLO Library

3 The RAG: Results and Commentary

4 Future Directions

No comments:

Post a Comment

Variational Analysis and the Calculus of Variations I: An Application in Neurobiology, The Finite Element Method, and an Evolutionary Optimization Algorithm

Search This Blog