🏛️ Mining Companies House for startups
One of the more creative things I’ve built is a pipeline that scrapes the UK’s Companies House registry to find recently incorporated startups that match our investment thesis — before they show up on anyone’s radar.
The idea
Companies House is a goldmine of public data. Every UK company has to register, declare their SIC codes (industry classification), and list their officers and persons of significant control (PSCs). If you know what you’re looking for, you can use this data to find newly formed tech companies operating in your target sectors.
Our fund focuses on “real asset technologies” — PropTech, ConTech, EnergyTech, ClimaTech, InsurTech, and WorkforceTech. I curated 24 SIC codes across two groups: “tech SICs” (software development, data processing, R&D) and “real asset SICs” (property, construction, energy, insurance). The sweet spot is a company registered under both.
How it works
The system is a five-stage CLI pipeline:
-
Scrape: Query the Companies House Advanced Search API by SIC code and incorporation date. I iterate over all 24 SIC codes in 30-day date chunks to stay under the API’s 2,000-result cap per query. For each hit, fetch the full company profile and active officers.
-
Enrich: For each scraped company, fetch PSC data (persons with significant control, i.e. ownership) and filing history. This is the key differentiator — PSC data reveals whether a company is founder-owned or a corporate shell.
-
Score: Apply a rules-based scoring model. The strongest signal (+40 points) is having both a “strong tech” SIC code (e.g. 62010 Computer Programming) AND a “real asset” SIC code (e.g. 68310 Real Estate). Other positive signals include company age, number of individual PSCs (founders), officer count, UK tech hub location, and filing activity. Negative signals penalise property-only companies, corporate-only ownership, and names containing “HOLDINGS” or “LETTINGS”. Companies scoring 80+ are hot, 35+ are review, below 35 are pass.
-
Discover: Find websites for hot/review companies using Clearbit’s free autocomplete API and brute-force domain guessing (slugify the company name and probe common TLDs like .com, .co.uk, .io, .ai, .tech).
-
Export: Dump filtered results to CSV for review.
What makes it interesting
The whole thing is remarkably minimal — about 400 lines of Python, two pip dependencies (requests and python-dotenv), and a single SQLite file. Yet it constitutes a functional dealflow pipeline.
The scoring model is entirely hand-tuned heuristics, not ML — and the logic is specific enough to be effective. For example, companies whose SIC codes are a subset of {68100, 68209, 68310, 68320} are almost certainly landlords, not startups, so they get penalised.
The pipeline is also fully idempotent. A searched_ranges table records every (SIC code, date_from, date_to) chunk that’s been fetched. Re-running the scraper skips completed chunks. Enrichment skips companies already enriched. You can safely re-run the whole thing whenever you want.
Tech: Python, Companies House API, Clearbit API, SQLite.