MacOS File Indexing using Python
Spotlight is great until you can’t quite remember a filename — you recall it was something like a config, on an external drive, from last year. This project is a small Python tool that fixes exactly that: it indexes every file on a macOS volume, stores the metadata in SQLite, and then lets you search by meaning (semantic similarity) and by near-spelling (fuzzy matching), not just exact names.
Download the script here: https://github.com/earthinversion/macos-file-indexing
The one mental model
The tool is a four-stage pipeline:
crawl the volume → store metadata in SQLite → embed filenames into a FAISS index → search (semantic + fuzzy).
Indexing happens once (and caches); searching then runs against the stored metadata and the vector index, so lookups are fast even across a whole disk.
What it does
- Index all files on a specified macOS volume.
- Store metadata (file path, file kind, size, volume name, modified time) in SQLite.
- Perform semantic search using FAISS and Sentence Transformers.
- Use fuzzy matching to find similar filenames.
- Show formatted search results in a human-readable table.
- Improve speed using FAISS caching.
How it works
The search step actually combines two different notions of “similar”:
- Semantic search uses Sentence Transformers to turn each filename into a vector, and FAISS
to find the nearest vectors to your query. This matches by meaning, so
location_info.yamlcan surfacerun_info.ymland other config-like files even without a shared spelling. - Fuzzy matching uses edit distance (Levenshtein) to find the closest spelling — great for typos and near-misses on the exact name.
You search for notes_2024.md but the file is actually named meeting-notes.markdown. Which layer is most likely to surface it?
Installation
Install dependencies
pip install faiss-cpu sentence-transformers fuzzywuzzy pandas tabulate tqdm numpy python-Levenshtein
Note: fuzzywuzzy still installs and works, but it’s no longer actively maintained — its
successor is thefuzz (same API, pip install thefuzz). If you
start fresh, prefer thefuzz. Keeping python-Levenshtein installed makes the fuzzy matching much
faster than the pure-Python fallback either way.
Clone the repository
git clone https://github.com/earthinversion/macos-file-indexing.git
cd macos-file-indexing
Set up the database
- Edit the configuration in
config.yamlfile. - Run the following command to build the indexing database:
python file_indexer.py
Searching for files
- To search for a file, use:
python search_files.py
Example output:
Do you want to rebuild the search cache? (yes/no): no
Enter filename to search: location_info.yaml
Exact match not found. Suggested files:
+---+-------------------------------------------------------------------+------------------------------------+--------------+-----------+---------------------+
| | Path | File Kind | Size (bytes) | Volume | Modified Time |
+---+-------------------------------------------------------------------+------------------------------------+--------------+-----------+---------------------+
| 0 | /Volumes/QSIS_DISK/event_data_download_waveform_api/._config.yaml | AppleDouble encoded Macintosh file | 4.00 KB | QSIS_DISK | 2025-01-26 14:18:00 |
| 1 | /Volumes/QSIS_DISK/QSIS-Server-run/run_info.yml | ASCII text | 101 B | QSIS_DISK | 2022-06-27 23:55:45 |
| 2 | /Volumes/QSIS_DISK/qsis-server-inspect/data/run_info.yml | ASCII text | 2.46 KB | QSIS_DISK | 2023-03-18 02:08:30 |
| 3 | /Volumes/QSIS_DISK/line-bot-qsis/config.yaml | ASCII text | 140 B | QSIS_DISK | 2023-01-14 17:21:30 |
| 4 | /Volumes/QSIS_DISK/event_data_download_waveform_api/config.yaml | Unicode text, UTF-8 text | 511 B | QSIS_DISK | 2025-01-25 12:51:35 |
+---+-------------------------------------------------------------------+------------------------------------+--------------+-----------+---------------------+
Best fuzzy match:
+---+----------------------------------------------------------+------------+--------------+-----------+---------------------+
| | Path | File Kind | Size (bytes) | Volume | Modified Time |
+---+----------------------------------------------------------+------------+--------------+-----------+---------------------+
| 0 | /Volumes/QSIS_DISK/qsis-server-inspect/location_info.yml | ASCII text | 1.10 KB | QSIS_DISK | 2023-04-07 22:32:39 |
+---+----------------------------------------------------------+------------+--------------+-----------+---------------------+
Enter filename to search: wpa_supplicant.conf
Exact match found:
+---+----------------------------------------+------------+--------------+-----------+---------------------+
| | Path | File Kind | Size (bytes) | Volume | Modified Time |
+---+----------------------------------------+------------+--------------+-----------+---------------------+
| 0 | /Volumes/QSIS_DISK/wpa_supplicant.conf | ASCII text | 161 B | QSIS_DISK | 2022-03-30 20:18:02 |
+---+----------------------------------------+------------+--------------+-----------+---------------------+
The script prints exact matches, suggested files, and the best fuzzy match — each with path and metadata.
macOS gotcha: notice the ._config.yaml entry — those ._ “AppleDouble” files are metadata
sidecars macOS scatters on non-native volumes. They show up in a raw file crawl, so don’t be surprised
to see them alongside the real files in your results.
Recap
Without scrolling up — can you name the four stages? The tool:
- Crawls a macOS volume and reads each file’s metadata,
- Stores that metadata (path, kind, size, volume, mtime) in SQLite,
- Indexes filenames as embeddings in a FAISS vector index (cached for speed),
- Searches with two complementary layers — semantic (meaning, via FAISS) and fuzzy (spelling, via Levenshtein) — falling back gracefully when there’s no exact match.
Where to go next
- The source, to run and adapt: earthinversion/macos-file-indexing.
- FAISS — the similarity-search index.
- Sentence Transformers — the embedding models behind the semantic search.
- thefuzz — the maintained successor to
fuzzywuzzy.
Disclaimer of liability
The information provided by the Earth Inversion is made available for educational purposes only.
Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.
UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.
Leave a comment