Monochrome vector graphic of a cat's nose and furry ears. The ears look similar to those of a desert fox.
Available languages
Tags: Selfhosting
Last change:
Written by: traumweh

Immich's Smart Search #

I’ve been using Immich for years at this point, but have never used its search feature. My images are sorted into albums, which means I don’t really need to use the search, but instead know where to find things.

But while updating Immich today, I read about new search-models being added, so I thought, why not, let’s give it a try.

Machine Learning #

But before I get into my impressions of this search-system, we’ll first need to talk about the machine learning models it uses, and what that entails.

The search uses a pre-trained machine learning model which runs locally and gets further trained locally on one’s images. It finds associations between a text query and a set of images, to determine which ones fit the most.

One can choose from a couple of pre-trained CLIP models. CLIP stands for Contrastive Language-Image Pre-training and was originally created by OpenAI in 2021. But nowadays, there exist multiple implementations of CLIP, e.g. the OpenCLIP or Multilingual-CLIP projects. They are all trained with different compute-budgets and datasets. As far as I can tell, all datasets are based on some subset of the public web. I have read through a couple of the dataset’s research papers, and all of those use some subset of Common Crawl.

Common Crawl is a gigantic web-crawling project, which exists since 2008 and creates multiple datasets per year. As per the project’s website, “The corpus contains raw web page data, metadata extracts, and text extracts.” To create a dataset for CLIP models, this data gets filtered by images with a large enough resolution and sufficiently long alt-text. Duplicates get removed and images get further classified. This includes things like their resolution, the text’s language, or whether they are safe for work. But it also includes keyword-tagging and transformation: Let’s say the image description is: “Photograph of a pink building in Tokyo.” Then extracted tags could be: Photograph, pink, building and Tokyo. Transformations of the image and extracted tags could then also include: realistic image, aesthetic image, architecture and Japan.

These associations can then be used to train a CLIP model which learns how much an image can be associated with a given text query. This score can then be used to sort a list of images by association to the query.

My Problem With This #

This is a really great application for machine learning, because it is very difficult to program classical algorithms for this. But I have one big problem with this: These datasets rely on publicly available data on the open web, but they do not take into account the data’s licenses. Not everything that is publicly available, is also public domain or under an open license.

I heavily oppose this methodology when it comes to Large Language Models (LLMs) and Generative AI (GenAI), because they create hallucinate ‘new’ things directly from the stolen work of others, require enormous amounts of energy to train and run, and steal the jobs of many while actually doing a worse job.

But for this type of algorithm (i.e. image classification), I am split. On the one hand, yes, peoples works are used without asking for permission, but on the other hand, the data isn’t used to create anything new, and running many of these models require only very small amounts of energy. And it doesn’t steal anyone’s job either. Quite the opposite, it actually makes one’s life a lot easier by not needing to manually tag one’s entire collection of thousands of photographs and hundreds of videos.

And if we ignore OpenAI and Google, most alternative CLIP models reuse existing web-crawling databases, instead of constantly re-crawling every single webpage there is. Common Crawl actually does respect one’s robots.txt, both the Disallow and the Crawl-delay parameters.

In the end, I think I have decided for myself, that I am okay with the crawling by Common Crawl as well as with models built on it, which do not try and hallucinate something new, but actually try and make mundane tasks such as image-tagging more accessible and easy.

Search Results #

Anyway, let’s talk about my actual experience with using the so-called Smart Search feature of Immich. I used the model ViT-B-32__laion2b-s34b-b79k; an OpenCLIP model using the English language subset of the laion2b dataset, which contains 2.3 billion samples from Common Crawl, although I could not find out, which snapshot of Common Crawl was used.

The system runs a decade old Intel i7-4790 CPU with 24 GB of DDR3 RAM and no acceleration besides the CPU’s integrated graphics.

My Immich instance contains many thousands of photographs from family, friends and personal events and trips. I did specifically use queries which I thought should have some chance of finding something. I won’t include all queries I tried (for privacy reasons) and I also did not follow any scientific methodology. This is just about me wanting to try out a feature and see if I’d deem it to work well.

I started with single word queries such as red, cat, hat, mountain, cliff or railing. As long as I had enough images which could be associated with these words, the Smart Search feature was able to find them, without pretty much any false positives.

After I couldn’t think of any more single words to try, I moved on to more abstract concepts. These included things like a birthday, school trip, train ride, sleepover or animal park. And once again, the first couple dozen or so images fit the criteria exactly. I do have photographs of a few childhood birthdays, sleepovers and school trips from different years in Immich and the search results were spanning across those, able to classify young people of varying age as school students.

Satisfied, I decided to try some more complex queries. Sentences consisting of combinations of different criteria, to create the context of a specific type of event or trip. I was pretty surprised when it could successfully find photographs from a train ride on an old steam train, or distinguish between a children’s birthday inside and one outside or in the summer. It was also able to provide me with photographs showing a field in front of a forest on a cloudy day or a ruin of a concrete building in the forest.

Overall I am really impressed by this. I wouldn’t be able to tag this number of photographs and videos at a level of detail required to achieve such depth of query specificity. I don’t know how much I will use this in the future, but I am nonetheless happy that it is possible, completely local using a decade old CPU.