At the beginning of the automobile era, Michelin, a tire company, created a travel guide, including a restaurant guide.
Through the years, Michelin stars have become very prestigious due to their high standards and very strict anonymous testers. Michelin Stars are incredibly coveted. Gaining just one can change a chef’s life; losing one, however, can change it as well.
The dataset is curated using Go Colly.
This software is only used for research purposes, users must abide by the relevant laws and regulations of their location, please do not use it for illegal purposes. The user shall bear all the consequences caused by illegal use.
The dataset contains a list of restaurants along with additional details (e.g. address, price range, cuisine type, longitude, latitude, etc.) curated from the MICHELIN Restaurants guide. The culinary distinctions (i.e. the ‘Award’ column) of the restaurants included are:
Content | Link | Description |
---|---|---|
CSV | CSV | Good’ol comma-separated values |
Kaggle | Kaggle | Data science community |
Inspired by this Reddit post, my initial intention of creating this dataset is so that I can map all Michelin Guide Restaurants from all around the world on Google My Maps (see an example).
NOTE Check out the Makefile or run
make help
.
To crawl, run:
make crawl # go run cmd/mym/mym.go
Alternatively, you can install this directly via go install
:
go install github.com/ngshiheng/michelin-my-maps/v2/cmd/mym
rm michelin.db
mym -log debug
As websites use JavaScript to dynamically generate content, the content may not be present in the initial HTML response. Disabling JavaScript can help you see the underlying HTML structure of the page and make it easier to identify the elements you want to scrape.
To extract relevant information from the site’s HTML, we use XPath as our choice of selector language. You can make use of this XPath cheat sheet.
To run all tests locally, run:
make test # go test ./... -v -count=1
Caching is enabled by default to avoid hammering the targeted site with too many unnecessary requests during development. After your first run, a cache/
folder (size of ~6GB) will be created. Your subsequent runs should be cached, they should take less than a minute to finish scraping the entire site.
To clear the cache, simply delete the cache/
folder.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
git checkout -b feature/bar
)git commit -am 'feat: add some bar'
, make sure that your commits are semantic)git push origin feature/bar
)