Konstantinos Skoularikis

Konstantinos Skoularikis

Machine Learning Platform Engineer

Book Search Engine

• Team Project • Information Retrieval

Summary

A semantic search engine designed to find books through their descriptions using three distinct vectorization methodologies. The project demonstrates the progression from traditional information retrieval techniques to modern deep learning approaches.

Users can query the system with any keywords, and results are returned using all three methods for comparison. The evaluation reveals that the deep learning approach (Universal Sentence Encoder) significantly outperforms traditional methods by matching books and descriptions even when query words are not explicitly present in the text.

Vectorization Methodologies

1. TF-IDF (Term Frequency-Inverse Document Frequency)

Traditional statistical approach that weights terms based on their frequency in documents and rarity across the corpus.

2. BM25 (Elasticsearch)

Probabilistic ranking function based on the bag-of-words model, considered a more sophisticated evolution of TF-IDF.

3. Universal Sentence Encoder (Google)

Deep learning model that captures semantic meaning, enabling matches based on conceptual similarity rather than exact keyword matching. Best performing method.

Data Source

Dataset acquired from Kaggle's Goodreads Best Books collection:

📊 View Dataset on Kaggle

Personal Contributions

  • Collected and preprocessed book description data from Kaggle into appropriate format for analysis
  • Loaded and integrated the Universal Sentence Encoder from TensorFlow Hub to generate semantic embeddings for book descriptions
  • Led a team of three students, strategically delegating tasks based on each member's technical strengths
  • Integrated all three vectorization models with Elasticsearch backend
  • Built a Flask web application with intuitive UI to demonstrate system capabilities to both technical and non-technical stakeholders

Team Members

Technology Stack

Python TensorFlow Keras Flask Elasticsearch Pandas Scikit-Learn NumPy

Code & Resources

💻 View on GitHub