One of the first things many people do online is search for information. A simple search may not seem like sensitive information, but all of everyone’s searches, on every topic, all the time absolutely includes a lot of sensitive information. If that wasn’t bad enough, it’s relatively simple for those collecting the information to make inferences on that data to learn incredibly sensitive things about individuals.

There is a wide variety of search engines that have attempted to build their own indexes, or tried to position themselves as privacy-respecting alternatives to mainstream search:

To figure out how to protect our information online, we have to take a serious look at one of the biggest culprits of not only invasions of privacy, but also online manipulation and censorship: search engines. It is incredibly tempting for some to want to replace a known-malicious part of their tech life (like Google) by merely introducing a “better” drop-in replacement recommended online. Unfortunately, solving some of the serious problems regarding dragnet surveillance and censorship will require more consideration and effort.

To many, web search is synonymous with “Google”. Google certainly isn’t the only search engine out there, but it is by far the largest and most used in the English language. Despite it still being the de-facto standard for many, the concerns related to Google Search are quite numerous. From banning particular results and queries, to the significant privacy concerns, there are a lot of reasons why somebody would want to switch to a different search engine.

But the friction is quite high. Many report that no matter what engine they switch to, they never get quite as good results. This itself can sometimes be a bit of a “skill issue” given that searching is something one can get better at over time. There definitely is truth to people’s trouble however. Any competing search engine has a phenomenal amount of work to do before even beginning to think about being comparable. Any existing search engine has the advantage of time and hindsight. A new project effectively has to out-innovate both of those advantages in addition to building almost entirely from scratch.

The most valuable and important parts of any search engine are the crawler and the index. These are the two crucial components any serious attempt at building a new search engine has to get right.

The Crawler

To allow others to search the entire web, you’d have to actually visit the entire web at one point. If for no other reason to load the site and store what’s on the site. Building a bot (or set of bots) to scan the entire internet and record their contents is a massive undertaking. This alone sets a very high minimum on the amount of resources required to build and maintain a search engine. Presumably, you want your search engine to be reasonably up to date, so this requires constant upkeep.

Loading so many web sites, and downloading any needed resources alone is a huge burden in both bandwidth and storage costs. Some smaller search engines will focus on a particular niche. This comes with significant downsides however, crawling less creates blind spots for your search engine thus limiting its usefulness for others. That can be a perfectly reasonable trade-off when aiming for specialization. Wiby for example, primarily indexes non-commercial hobbyist style web sites.

While crawling the web there are effectively two big questions that need answers. 1) How wide do you cast your net over the web to discover resources? and 2) How much information do you collect on each resource? Number 2 is harder because one could just store links and their titles, but a more sophisticated engine will store keywords or even the entire text to have the most accurate information possible. With the the greater accessibility of machine transcription, it may even make sense to download and transcribe audio/video content for even more accuracy.

If all that wasn’t challenging enough, there will always be content online that is inaccessible to your crawler. Any information that requires an account or payment will automatically be refused to your crawler until it becomes well-known and approved by those resources. The more time people create & share information behind these walled-gardens the worse off all search engines inevitably become.

The Index

After you’ve crawled all over the web your work still has only really just begun. A proper search engine needs to be fast, complete, and accurate. All of this relies on the information you’ve stored being intelligently analyzed and categorized for retrieval. For fast retrieval you’ll probably want to calculate scores for common queries in advance, or use mathematics to generate them with a carefully curated subset of your information. A search engine without a crawler is frozen in time, but a search engine without an index is just a massive unsorted collection of online resources.

Properly building a search index requires a great deal of forethought and maintenance. There will always be ways to improve accuracy and speed. The algorithms built on top of your index may be specifically designed for particular kinds of information. There is ultimately no limit to what can be done to craft the perfect index of the entire web. I would argue that the index is the most important resource of a search engine, big or small.

The Problem of Surveillance

This is important to understand to make sense of the privacy risks of search engines. When you use a web search, it’s very difficult to hide the fact that you’re using it, especially from the provider itself. While tools like Tor & VPNs can mitigate the risk a bit, the vast majority of people won’t even take these (far from bullet-proof) measures. This creates a wide variety of concerns, especially as the searches from a wide variety of people starts becoming massively aggregated. Search data from particular people is also very useful information to have against that specific group.

There have been many “private web search” providers that have entered the space. Ultimately the first thing they offer is a promise to collect less data about you than Google. While that’s good, it’s not a high bar. Even when a search engine actually collects very little, the aggregate information as a whole is still a treasure trove of power that can be used against individuals and groups alike. And if no user data is collected at all, that will never stop governments that snoop on their own citizens.

In a time where people running tech companies are placed under arrest, it’s time to start wondering how independent a search engine truly can be. Even with no data collection and no snooping involved, every entity will face the pressure to leave out inconvenient information from their index, even merely as a cost saving measure. The difficulty of running a large search engine itself creates so many challenges that are almost impossible for any truly independent organization to achieve without external interference.

To make matters worse, given the difficulty in hiding that one is using a particular public search engine, it’s entirely possible that using an obscure search engine itself reveals a fair amount about a person.

Alternative Strategies

Metasearch

One strategy is to use a privacy-protecting front-end to route your search queries through a variety of public search engines. SearXNG is a notable example that’s highly configurable. Using a metasearch engine alone is a meaningful improvement over using many public search engines. A community of people sharing a metasearch engine provides more people “safety in numbers”, at least from the search engines themselves.

Of course, the danger of using a metasearch engine is that one is placing their trust on the operator to be responsible, trustworthy, and competent. Another point to consider is that a metasearch engine can not find information that is concealed (or otherwise not indexed) by the configured search engines.

The Pirate Bay

The Pirate Bay is an example of a web service that has been able to stay online despite a massive amount of opposition both legal and technical. It’s important to note that this was possible because The Pirate Bay exists to index a narrow set of content (Torrents) and stores the bare-minimum amount of information to work. This allows the service to be quickly mirrored or migrated elsewhere incredibly easily. The larger and more complicated a service is, the harder it is to make resilient to external pressure and influence.

Searching Particular Sites

Many seem to forget that various websites often have their own search function. This amnesia is sometimes warranted because sadly many sites own search engine is not as efficient as just using google and specifying the site. When looking for specific information, it’s often quite effective to search specialized or related wiki’s or forums to get a specific answer. This lost art is becoming part of many people’s habits again now that search engines are gradually getting worse

Of course, searching a specific site is only as private as the site you’re visiting is. It is sad to hear that allegedly “Gen Z” is notorious for using TikTok as their go-to search engine, which is a condemnation of our current situation.

Can We Do Better?

It’s hard to believe that yet another private search engine is the panacea to resolve all our privacy concerns. While many of them provide non-trivial benefits, for those aiming to make dragnet surveillance impossible, we need to go even further. Breaking away from the cloud works for various kinds of software, why can’t it work for search?

Usually, your device has an embedded search function that can search files and even settings. This is a very useful thing to have. It doesn’t immediately sound like a crazy idea to distribute indexes to devices for people to search on their own. A narrow, but useful search engine would be a powerful thing to include on one’s own machine. Kiwix is a fascinating project that aims to provide (searchable) online resources for offline use.

Personally, I actually do have my own locally hosted search engine. One of my favorite RSS readers, FreshRSS stores every article from each feed I’m subscribed to. This means that when I want to find relevant commentary on a particular story, I can easily find stories related to topics I’m interested in. Having done this for quite some time, I’ve found it a very valuable tool for looking back as issues develop.

Indexing the entire Internet is an immense affair. It’s entirely possible that the best search engine for an individual to represent an incredibly tiny fraction of the entire whole. Where possible it would be nice if such a search engine was able to be dynamically extended with additional indexes. That would mean that provided the device one is searching on is sound, all searches would be safe from prying eyes, and even censorship!

Information Management

What ultimately makes one search engine better than another is relevance. This is what makes LLM-generated search summaries convincing even when they’re obviously wrong. If the solution to make offline search a viable endeavor is sharing small, but useful indexes, then a wide variety of information management tools are needed. Once such tool I’ve been starting to use is Logseq and I’ve been quite impressed.

Software like this, that allows users to leverage their own computing and knowledge in a harmonious way is clearly the way we build not only better privacy, but a much better digital experience for people. Those interested in learning how to build software should consider how to build tools that enable that experience, rather than ones that fit neatly within the cloud paradigm. More and more people are learning to break free from the cloud for their workflows, but often find themselves looking for tools. There are many niches worth filling, so find one you find interesting!

Building High-quality Resources

Condensing information and resources is incredibly valuable. By maintaining a public online resource where you curate useful knowledge, you can greatly assist others in saving time. This is where the Free and Open Web really shines. High quality independent websites are the backbone of a truly free cyberspace and can do innumerable things to tip the scales in favor of people and communities.

As long as your content is available in accessible and/or machine-readable formats, a great deal can be done to make those resources indexed more easily. With a large collection of valuable information to share, you can reduce the odds that snoopers have a clear idea of what specific information people are accessing. This is why I would highly encourage any web site to include a full-text RSS archive of all their posts. You never know when such a thing may be seriously helpful.

Conclusion

Private web search is almost impossible in our current environment where large institutions are fighting over control of people’s very own knowledge. It is crucial to build solutions that broaden people’s ability to learn, create & share independently. Information control is a terrifying force that can be incredibly difficult to reverse. For many, search is where they acquire most of their information online, and it’s getting increasingly controlled. Thinking about ways to acquire information online without systems acting as gatekeepers is going to become only more important as time goes on.

Gabriel
Support this work Liberapay Buy Me a Coffee Monero


Published: Sep 11 2024
Tags:
Search Information Management Privacy Information Control

Operation: Shadow

Sep 12 2023 Gabriel

Anonymity is a lot less dangerous than uncontested power

The Coordinated Attack on Cyberspace

Apr 13 2023 Gabriel

The technocracy is grabbing new powers to control your digital life


Prev B @ Next