Khalifeh AlJadda holds Ph.D. in computer science from the University of Georgia (UGA), with a specialization in machine learning. He has experience implementing large scale, distributed machine learning algorithms to solve challenging problems in domains ranging from Bioinformatics to search and recommendation engines. He is the lead data scientist on the search data science team at CareerBuilder, which is one of the largest job boards in the world. He leads the data science effort to design and implement the backend of CareerBuilder’s language-agnostic semantic search engine leveraging Apache Spark and the Hadoop ecosystem.
"If you can not measure it, you can not improve it". Relevancy of the documents retrieved by search and recommendation engines is crucial to the end users. Exposing end users to irrelevant documents is very expensive since those users will turn away; therefore, companies that rely on search services strive to improve their search algorithms. Whenever a tweak of an existing algorithm is done or a new algorithm is implemented, an assessment is required. Most of the existing techniques rely on running an A/B test by exposing a portion of the end users to a new search algorithm, then comparing the (Click Through Rate) CTR between the existing algorithm and the new one to measure the quality of each algorithm. In this talk we introduce a fully automated QA system for search and recommendation engines, which leverage implicit user feedback. The proposed system has been used successfully to assess CareerBuilder's search engine. CareerBuilder operates the largest job board in the U.S. and has an extensive and growing global presence, with millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable documents, and more than a million searches per hour. We implemented this system using Apache Spark. Spark enables us to derive implicit user feedback using about 19M search logs, then calculate the NDCG for different algorithms in less than 2 hours. We can report the estimated impact of a proposed changes in a few hours instead of running an A/B test and wait for days to figure out the impact. Given the size of search logs that we collect everyday, running this system in reasonable time requires a powerful distributed platform. We find Apache Spark as the best platform to fulfill our needs.
Understanding the types of entities expressed in a search query (Company, Skill, Job Title, etc.) enables more intelligent information retrieval based upon those entities compared to a traditional keyword-based search. Because search queries are typically very short, leveraging a traditional bag-of-words model to identify entity types would be inappropriate due to the lack of contextual information. We implemented a novel entity type recognition system which combines clues from different sources of varying complexity in order to collect real-world knowledge about query entities. We employ distributional semantic representations of query entities through two models: 1) contextual vectors generated from encyclopedic corpora like Wikipedia, and 2) high dimensional word embedding vectors generated from millions of job postings using Spark MLlib. In order to enable real-time recognition of entity types, we utilize Apache Solr to cache the embedding vectors generated by Spark MLlib. This approach enable us to recognize entity types for entities expressed in search queries in less than 60 milliseconds which makes this system applicable for real-time entity type recognition.