Clustering vectors in higher dimensions

Objective

Determine whether we can understand which are the concepts in a delimited portion of high-dimensional space.

Context

We are analyzing a corpus of 600k sentences from the Asian Development Bank public documents. These documents have been split into sentences and vectorized using the OpenAI word embeddings API. The resulting vectors, along with the sentence text and the paper uid, have been stored in a table called sentencev.

Process

We select a random point in the space. We compute the closest points in this vector space to constritute a sub-corpus. We compute the frequency of words in the sub-corpus. We evaluate the effectiveness of the process.

Conclusion

It is possible, using L2 distance, to get a list of the closest sentences to a given sentence, even in high dimensions space. For research purposes, we arbitrally chose a distance of 0.1 as the threshold for similarity. Experimental analysis of the L2 distance in pgvector shows that the distance between a vector and itself is 0. All other vectors have a distance greater than 0.5. Visual inspection confirms that the sentences are semantically close, even at the bigger distances (ie. 0.1). The sentences are scattered across a set of papers, which is expected. The word frequency over the corpus, once the stopwords have been disposed, shows that the most frequent words across the corpus capture the semantics of the sentences. The performances of the L2 distance algorithm can be improved.

Next steps

Cluster all the points in the vector space. Randomly choose a point in the space, not yet categorized. Analyze the distribution of points in the clusters. Determine the number of clusters. Determine a threshold below which it is not worth to continue the clusterization.

References

L2 distance is meaningless it in higher dimensions spaces (where n > 9)

How to cluster in higher dimensions

SQL Instructions

-- `sentencev` table structure: uid, text, paper_uid, vector
SELECT uid FROM sentencev
ORDER BY RANDOM()
LIMIT 1;

-- analyze the distance of all other vectors from the current one
SELECT 
    other_vectors.uid,
    target_vector.vector <-> other_vectors.vector AS distance
FROM 
    sentencev AS other_vectors,
    (SELECT vector FROM sentencev WHERE uid = 'a4d09d88-6e93-565f-02aa-b2ca230216a1') AS target_vector
ORDER BY 
    distance
LIMIT 100;

-- how many sentences and papers are included in the given subspace?
SELECT count(*), count(distinct(paper_uid))
FROM sentencev
WHERE vector <-> (SELECT vector FROM sentencev where uid = '8af34aad-512c-2d77-d9bd-c5cc25224494') < 0.6;

-- visual inspection: are the sentences semantically close?
WITH 
    vector1 AS (SELECT uid, vector FROM sentencev WHERE uid = '8af34aad-512c-2d77-d9bd-c5cc25224494')
SELECT text, paper_uid, uid, vector <-> (SELECT vector FROM vector1) as distance
FROM sentencev
WHERE vector <-> (SELECT vector FROM vector1) < 0.6
ORDER BY distance;

-- count the most frequent words across the sentences; defect: stopwords
WITH
    texts AS (SELECT uid, text
    FROM sentencev
    WHERE vector <-> (SELECT vector FROM sentencev where uid = '8af34aad-512c-2d77-d9bd-c5cc25224494') < 0.6
)
SELECT 
    word,
    COUNT(*) AS frequency
FROM 
    (SELECT regexp_split_to_table(text, E'\\\\s+') AS word FROM texts) AS words
GROUP BY 
    word
ORDER BY 
    frequency DESC;

--
CREATE TABLE
    texts_1 AS (SELECT uid, text, paper_uid
    FROM sentencev
    WHERE vector <-> (SELECT vector FROM sentencev where uid = '8af34aad-512c-2d77-d9bd-c5cc25224494') < 0.6)

-- advantage: no more stop words, defect: training is now train
SELECT 
    word,
    COUNT(*) AS frequency
FROM 
    (SELECT unnest(tsvector_to_array(to_tsvector('english', text))) AS word FROM texts_1) AS words
GROUP BY 
    word
ORDER BY 
    frequency DESC;

Sample

word,frequency
train,579
employ,444
worker,398
skill,335
educ,308
job,239
inform,221
women,216
work,191
least,175
labor,134
sector,125
survey,122
increas,115
peopl,109
school,107
countri,106
report,105
baselin,105
particip,104
level,101
self,100
formal,98
graduat,96
program,95
industri,95
year,92
higher,91
10,90
base,89
also,84
figur,84
develop,84
employe,83
learn,82
high,82
workforc,82
famili,80
provid,80
self-employ,80
household,79
tvet,79
technic,78
receiv,78
occup,77
20,76
share,74
busi,74
firm,73

Involving GitHub Copilot

Represent the first 50 terms on the list with a sentence