Venue photos, as a new type of multimedia contents, are exploding on the Internet because users like to take photos and share with their friends in which venue they spent time and what impressed them there. Discovering a venue by a social photo is very useful for supplementing venue retrieval and recommendation. However, little research focused on fine-grained venue discovery by leveraging multimodal venue dataset. In this paper, we present the first multimodal dataset specially built for venue discovery, which includes venue photos, descriptions, and categories. Using this dataset, we propose a novel framework for fine-grained venue discovery through correlating venue photos and descriptions, aiming to learn a VenueNet representing a knowledge base and association for venues and their properties in different modalities. In the training phase, visual and textual features of the same venues, by two sub-networks, are respectively mapped to a same semantic space, in which canonical correlation analysis (CCA) is applied to these features to train the two sub-networks. In the query phase, given a photo, its correlation with textual features in the dataset is analyzed to find the most similar venue. Experimental results verify the practicability of the Deep CCA model for fine-grained venue discovery from large-scale multimodal dataset.