每日一練 | Data Scientist & Business Analyst & Leetcode 面試題 1068

2021-03-02 大數據應用

latent semantic indexing:

Latent Semantic Indexing is Principal Component Analysis (PCA) in document analysis, it is simply applying PCA to (the variance-covariance matrix) of X and the principal directions (eigenvectors) now define topics.

It uses a term-document matrix X that describes the occurrences of terms in documents.  Rows correspond to terms(vocabulary) and columns correspond to documents.  Elements of X are typically weights that are proportional to the number of times a term appears in a document, with rare terms upweighted to reflect the relative importance.  The matrix X is usually large and sparse.

LSA finds a low-rank approximation of the original term-document matrix, which merges the dimensions of terms that have similar meanings.  

What is it used for:

LSA can be applied to compare documents in the low-dimensional space (document classification), find relations between terms (synonym identification), find matching documents by translating a query of terms to low-dimensional space (information retrieval), and etc.

Limitations include: 

The resulting dimensions can be difficult to interpret

LSA cannot capture multiple meanings of a word

The terms of a document are represented unordered

Eigenvectors can have negative components

Reference:

https://en.wikipedia.org/wiki/Latent_semantic_analysis

相關焦點