The Expertise Locator

This company wishes to improve the collaboration across their departments and reduce the onboarding attrition.

This organization is distributed across an entire continent - with eventually poor connectivity.

The organization’s public web sites store thousands of PDFs (projects TARs & RRPs, briefs and papers, …) , blog posts, author biographies, … and more.

We have converted non-structured and semi-structured data into structured data. We have extracted publication dates, author names, countries, and texts. We have then mapped the recognized author names to a primary source of truth.

Challenges:

map the orthographic variants of a name to the content of a normalized database
determine the best size of a paragraph, before converting it to an embedding
keep a good quality across a corpus of thousands of documents, distributed over the last 30 years
keep the performances of the vector database high enough to satisfy the queries of thousands of users
keep the bundle as small as possible in order to guarantee a good network connectivity from all around the globe