ML Data Sets

To see how well a model will generalize to new cases, the data is split into a training set, a test set and a validation set.

It is common to use 80% of the data for training and hold out 20% for testing.

In cross-validation, a validation set is randomly held out from the training set during training.

The No Free Lunch Theorem states that there is no model that is guaranteed to work best on a given dataset. The only way to know for sure is to evaluate them all.

References

The Unreasonable Effectiveness of Data (googleusercontent.com)
The Lack of A Priori Distinctions Between Learning Algorithms - Google Scholar
UCI Machine Learning Repository
Kaggle
AWS
Data Portals
Open Data Monitor
Retail Trading Activity Tracker: Keep track of retail sentiment (nasdaq.com)
Wikipedia
Where can I find large datasets open to the public? - Quora
Datasets (reddit.com)