
    c_g                       d Z ddlZddlZddlZddlZddlmZ ddlmZ ddl	m
Z
mZmZmZmZmZmZmZmZmZmZmZ ddl	ZddlmZ ddlmZ ddlmZmZ dd	l m!Z! dd
l"m#Z#  ej$        e%          Z&e'e(ej)        fZ*e'e(ej)        ej        fZ+d Z, G d dej-                  Z.e.Z/e.Z0e.Z1 G d d          Z2e2Z3d Z4d Z5	 ddZ6d Z7d Z8d Z9ddddej:        eddfdZ;d Z<de=fdZ>ddefdZ?dS ) u  This module implements word vectors, and more generally sets of vectors keyed by lookup tokens/ints,
 and various similarity look-ups.

Since trained word vectors are independent from the way they were trained (:class:`~gensim.models.word2vec.Word2Vec`,
:class:`~gensim.models.fasttext.FastText` etc), they can be represented by a standalone structure,
as implemented in this module.

The structure is called "KeyedVectors" and is essentially a mapping between *keys*
and *vectors*. Each vector is identified by its lookup key, most often a short string token, so this is usually
a mapping between {str => 1D numpy array}.

The key is, in the original motivating case, a word (so the mapping maps words to 1D vectors),
but for some models, the key can also correspond to a document, a graph node etc.

(Because some applications may maintain their own integral identifiers, compact and contiguous
starting at zero, this class also supports use of plain ints as keys – in that case using them as literal
pointers to the position of the desired vector in the underlying array, and saving the overhead of
a lookup map entry.)

Why use KeyedVectors instead of a full model?
=============================================

+---------------------------+--------------+------------+-------------------------------------------------------------+
|        capability         | KeyedVectors | full model |                               note                          |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| continue training vectors | ❌           | ✅         | You need the full model to train or update vectors.         |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| smaller objects           | ✅           | ❌         | KeyedVectors are smaller and need less RAM, because they    |
|                           |              |            | don't need to store the model state that enables training.  |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| save/load from native     |              |            | Vectors exported by the Facebook and Google tools           |
| fasttext/word2vec format  | ✅           | ❌         | do not support further training, but you can still load     |
|                           |              |            | them into KeyedVectors.                                     |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| append new vectors        | ✅           | ✅         | Add new-vector entries to the mapping dynamically.          |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| concurrency               | ✅           | ✅         | Thread-safe, allows concurrent vector queries.              |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| shared RAM                | ✅           | ✅         | Multiple processes can re-use the same data, keeping only   |
|                           |              |            | a single copy in RAM using                                  |
|                           |              |            | `mmap <https://en.wikipedia.org/wiki/Mmap>`_.               |
+---------------------------+--------------+------------+-------------------------------------------------------------+
| fast load                 | ✅           | ✅         | Supports `mmap <https://en.wikipedia.org/wiki/Mmap>`_       |
|                           |              |            | to load data from disk instantaneously.                     |
+---------------------------+--------------+------------+-------------------------------------------------------------+

TL;DR: the main difference is that KeyedVectors do not support further training.
On the other hand, by shedding the internal data structures necessary for training, KeyedVectors offer a smaller RAM
footprint and a simpler interface.

How to obtain word vectors?
===========================

Train a full model, then access its `model.wv` property, which holds the standalone keyed vectors.
For example, using the Word2Vec algorithm to train the vectors

.. sourcecode:: pycon

    >>> from gensim.test.utils import lee_corpus_list
    >>> from gensim.models import Word2Vec
    >>>
    >>> model = Word2Vec(lee_corpus_list, vector_size=24, epochs=100)
    >>> word_vectors = model.wv

Persist the word vectors to disk with

.. sourcecode:: pycon

    >>> from gensim.models import KeyedVectors
    >>>
    >>> word_vectors.save('vectors.kv')
    >>> reloaded_word_vectors = KeyedVectors.load('vectors.kv')

The vectors can also be instantiated from an existing file on disk
in the original Google's word2vec C format as a KeyedVectors instance

.. sourcecode:: pycon

    >>> from gensim.test.utils import datapath
    >>>
    >>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
    >>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)  # C bin format

What can I do with word vectors?
================================

You can perform various syntactic/semantic NLP word tasks with the trained vectors.
Some of them are already built-in

.. sourcecode:: pycon

    >>> import gensim.downloader as api
    >>>
    >>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
    >>>
    >>> # Check the "most similar words", using the default "cosine similarity" measure.
    >>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
    >>> most_similar_key, similarity = result[0]  # look at the first match
    >>> print(f"{most_similar_key}: {similarity:.4f}")
    queen: 0.7699
    >>>
    >>> # Use a different similarity measure: "cosmul".
    >>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
    >>> most_similar_key, similarity = result[0]  # look at the first match
    >>> print(f"{most_similar_key}: {similarity:.4f}")
    queen: 0.8965
    >>>
    >>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
    cereal
    >>>
    >>> similarity = word_vectors.similarity('woman', 'man')
    >>> similarity > 0.8
    True
    >>>
    >>> result = word_vectors.similar_by_word("cat")
    >>> most_similar_key, similarity = result[0]  # look at the first match
    >>> print(f"{most_similar_key}: {similarity:.4f}")
    dog: 0.8798
    >>>
    >>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
    >>> sentence_president = 'The president greets the press in Chicago'.lower().split()
    >>>
    >>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
    >>> print(f"{similarity:.4f}")
    3.4893
    >>>
    >>> distance = word_vectors.distance("media", "media")
    >>> print(f"{distance:.1f}")
    0.0
    >>>
    >>> similarity = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
    >>> print(f"{similarity:.4f}")
    0.7067
    >>>
    >>> vector = word_vectors['computer']  # numpy vector of a word
    >>> vector.shape
    (100,)
    >>>
    >>> vector = word_vectors.wv.get_vector('office', norm=True)
    >>> vector.shape
    (100,)

Correlation with human opinion on word similarity

.. sourcecode:: pycon

    >>> from gensim.test.utils import datapath
    >>>
    >>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

And on word analogies

.. sourcecode:: pycon

    >>> analogy_scores = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

and so on.

    N)Integral)Iterable)dotfloat32doublezerosvstackndarraysumprodargmaxdtypeascontiguousarray
frombuffer)stats)cdist)utilsmatutils)
Dictionary)
deprecatedc                    | g S t          | t                    s-t          | t                    rt          | j                  dk    r| gS t          | t                    r't          | j                  dk    rt          |           S | S )zEnsure that the specified value is wrapped in a list, for those supported cases
    where we also accept a single key or vector.N      )
isinstance
_KEY_TYPESr
   lenshapelist)values    :lib/python3.11/site-packages/gensim/models/keyedvectors.py_ensure_listr!      s      	%$$ E7)C)C EKHXHX\]H] w%!! c%+&6&6!&; E{{L    c            	       d    e Zd Zdej        dfdZd Z fdZd ZdTdZ	d Z
d	 ZdUd
Zd Zd ZdVdZdWdZ ed          d             ZdXdZd ZdYdZd Zd Zd Zd Zd Z ed          d             Zd Zed             Zej        d             Zd  Z dWd!Z!ed"             Z"e"j        d#             Z"ed$             Z#e#j        d%             Z#ed&             Z$e$j        d'             Z$d( Z% fd)Z&	 	 dZd+Z'd[d,Z(d[d-Z)d[d.Z*d\d/Z+	 d]d0Z,d\d1Z-d2 Z.e/d3             Z0d^d5Z1d6 Z2d7 Z3d8 Z4e/d9             Z5	 	 d_d<Z6e/d=             Z7e/d>             Z8	 	 d`dAZ9 edB          dWdC            Z:dD Z;dadEZ<	 	 dbdHZ=e>ddd@dIde?dfdJ            Z@dcdLZA	 	 dddMeBdNeCdOeCdPd fdQZDdR ZEdS ZF xZGS )eKeyedVectorsr   Nc                     || _         dg|z  | _        d| _        i | _        t	          ||f|          | _        d| _        i | _        || _        dS )a  Mapping between keys (such as words) and vectors for :class:`~gensim.models.Word2Vec`
        and related models.

        Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

        To support the needs of specific models and other downstream uses, you can also set
        additional attributes via the :meth:`~gensim.models.keyedvectors.KeyedVectors.set_vecattr`
        and :meth:`~gensim.models.keyedvectors.KeyedVectors.get_vecattr` methods.
        Note that all such attributes under the same `attr` name must have compatible `numpy`
        types, as the type and storage array for such attributes is established by the 1st time such
        `attr` is set.

        Parameters
        ----------
        vector_size : int
            Intended number of dimensions for all contained vectors.
        count : int, optional
            If provided, vectors wil be pre-allocated for at least this many vectors. (Otherwise
            they can be added later.)
        dtype : type, optional
            Vector dimensions will default to `np.float32` (AKA `REAL` in some Gensim code) unless
            another type is provided here.
        mapfile_path : string, optional
            Currently unused.
        Nr   r   )	vector_sizeindex_to_key
next_indexkey_to_indexr   vectorsnormsexpandosmapfile_path)selfr'   countr   r.   s        r    __init__zKeyedVectors.__init__   s`    4 '!FUNe[1???
 (r"   c                 P    | j         j         d| j         dt          |            dS )Nz<vector_size=, z keys>)	__class____name__r'   r   r/   s    r    __str__zKeyedVectors.__str__  s0    .)]]8H]]CPTII]]]]r"   c                     t          t          |           j        |i | t          | d          r|                                  t          | d          s9| j                            d| j                            dd                    | _        t          | d          s7| j                            dd          | _        | j        j	        d         | _
        t          | d	          sd| _        t          | d
          si | _        d| j        vr|                                  t          | d          st          |           | _        dS dS )zXHandle special requirements of `.load()` protocol, usually up-converting older versions.doctagsr(   
index2wordindex2entityNr+   syn0r   r,   r-   r*   r)   )superr$   _load_specialshasattr_upconvert_old_d2vkv__dict__popr(   r+   r   r'   r,   r-   _upconvert_old_vocabr   r)   r/   argskwargsr4   s      r    r>   zKeyedVectors._load_specials  sG   0lD!!0$A&AAA4## 	(%%'''t^,, 	i $ 1 1,@Q@QR`bf@g@g h hDtY'' 	5=,,VT::DL#|1!4DtW%% 	DJtZ(( 	DM. 	(%%'''t\** 	(!$iiDOOO	( 	(r"   c                    | j                             dd          }i | _        |                                D ]\}||         }|j        | j        |<   |j                                         D ])}|                     |j        ||j         |                    *]d| j        v r4| j        d                             t          j	                  | j        d<   dS dS )z\Convert a loaded, pre-gensim-4.0.0 version instance that had a 'vocab' dict of data objects.vocabN
sample_int)
rA   rB   r*   keysindexset_vecattrr-   astypenpuint32)r/   	old_vocabkold_vattrs        r    rC   z!KeyedVectors._upconvert_old_vocab  s    M%%gt44	!! 	J 	JAaLE#(;Da ++-- J J  dEN44HIIIIJ 4=( 	X*.-*E*L*LRY*W*WDM,'''	X 	Xr"   c           	          |4t           j                                                  } fd|D             }t           j                  }t          ||          D ]\  }}|t          u rt          j        }|t          u rt          }| j        vrt          j        ||           j        |<   S j        |         }t          j        ||j                  st          d| d| d|j                   t          |          |k    rt          |          }t          j        ||j                   j        |<   |dt          ||          f          j        |         dt          ||          f<   dS )aA  Ensure arrays for given per-vector extra-attribute names & types exist, at right size.

        The length of the index_to_key list is canonical 'intended size' of KeyedVectors,
        even if other properties (vectors array) hasn't yet been allocated or expanded.
        So this allocation targets that size.

        Nc                 4    g | ]}j         |         j        S  )r-   r   ).0rS   r/   s     r    
<listcomp>z2KeyedVectors.allocate_vecattrs.<locals>.<listcomp>6  s#    AAA4T]4(.AAAr"   r&   zCan't allocate type z for attribute z#, conflicts with its existing type )r   r-   rJ   r   r(   zipintrN   int64strobjectr   
issubdtyper   	TypeErrormin)r/   attrstypestarget_sizerS   tprev_expando
prev_counts   `       r    allocate_vecattrszKeyedVectors.allocate_vecattrs+  s     	B++--..EAAAA5AAAE$+,,5%(( 	s 	sGD!Cx HCx  4=( &(h{!&D&D&Dd#=.L=L$677 M1 M MT M M8D8JM M   <  K/ \**J"$(;l>P"Q"Q"QDM$DPQoSVWacnSoSoQoQpDrDM$ >#j+">"> > ?AA)	s 	sr"   c                     |                      |gt          |          g           |                     |          }|| j        |         |<   dS )a  Set attribute associated with the given key to value.

        Parameters
        ----------

        key : str
            Store the attribute for this vector key.
        attr : str
            Name of the additional attribute to store for the given key.
        val : object
            Value of the additional attribute to store for the given key.

        Returns
        -------

        None

        )ra   rb   N)rg   type	get_indexr-   )r/   keyrS   valrK   s        r    rL   zKeyedVectors.set_vecattrN  sO    & 	dVDII;???s##%(dE"""r"   c                 R    |                      |          }| j        |         |         S )a  Get attribute value associated with given key.

        Parameters
        ----------

        key : str
            Vector key for which to fetch the attribute value.
        attr : str
            Name of the additional attribute to fetch for the given key.

        Returns
        -------

        object
            Value of the additional attribute fetched for the given key.

        )rj   r-   )r/   rk   rS   rK   s       r    get_vecattrzKeyedVectors.get_vecattre  s'    $ s##}T"5))r"   c                     t          | j                  | j        f}t          || j        |          | _        |                                  d| _        dS )zPMake underlying vectors match index_to_key size; random-initialize any new rows.)prior_vectorsseedN)r   r(   r'   prep_vectorsr+   rg   r,   )r/   rq   target_shapes      r    resize_vectorszKeyedVectors.resize_vectorsz  sP    D-..0@A#LSWXXX   


r"   c                 *    t          | j                  S N)r   r(   r6   s    r    __len__zKeyedVectors.__len__  s    4$%%%r"   c                      t          |t                    r                     |          S t           fd|D                       S )ab  Get vector representation of `key_or_keys`.

        Parameters
        ----------
        key_or_keys : {str, list of str, int, list of int}
            Requested key or list-of-keys.

        Returns
        -------
        numpy.ndarray
            Vector representation for `key_or_keys` (1D if `key_or_keys` is single key, otherwise - 2D).

        c                 :    g | ]}                     |          S rV   
get_vectorrW   rk   r/   s     r    rX   z,KeyedVectors.__getitem__.<locals>.<listcomp>  s%    CCCts++CCCr"   )r   r   r{   r	   )r/   key_or_keyss   ` r    __getitem__zKeyedVectors.__getitem__  sL     k:.. 	0??;///CCCC{CCCDDDr"   c                     | j                             |d          }|dk    r|S t          |t          t          j        f          r$d|cxk    rt          | j                  k     rn n|S ||S t          d| d          )zReturn the integer index (slot/position) where the given key's vector is stored in the
        backing vectors array.

        r   NKey 'z' not present)	r*   getr   rZ   rN   integerr   r(   KeyError)r/   rk   defaultrl   s       r    rj   zKeyedVectors.get_index  s    
 ##C,,!8 	7Jc2:.// 	7A 	7 	7 	7 	7s4CT?U?U 	7 	7 	7 	7 	7J 	7N53555666r"   Fc                     |                      |          }|r0|                                  | j        |         | j        |         z  }n| j        |         }|                    d           |S )a  Get the key's vector, as a 1D numpy array.

        Parameters
        ----------

        key : str
            Key for vector to return.
        norm : bool, optional
            If True, the resulting vector will be L2-normalized (unit Euclidean length).

        Returns
        -------

        numpy.ndarray
            Vector for the specified key.

        Raises
        ------

        KeyError
            If the given key doesn't exist.

        F)write)rj   
fill_normsr+   r,   setflags)r/   rk   normrK   results        r    r{   zKeyedVectors.get_vector  sl    0 s## 	)OO\%(4:e+<<FF\%(Fe$$$r"   zUse get_vector insteadc                      | j         |i |S )z_Compatibility alias for get_vector(); must exist so subclass calls reach subclass get_vector().rz   r/   rE   rF   s      r    word_veczKeyedVectors.word_vec  s     t////r"   Tc                 l   t          |          dk    rt          d          t          |t                    rt	          j        |          }|!t	          j        t          |                    }t          |          |j        d         k    rt          d          t	          j        | j	        | j
        j                  }d}t          |          D ]\  }}	t          |	t                    r'|||         |	z  z  }|t          ||                   z  }A|                     |	          r>|                     |	|          }
|||         |
z  z  }|t          ||                   z  }|st#          d|	 d          |dk    r||z  }|r,t%          j        |                              t*                    }|S )a  Get the mean vector for a given list of keys.

        Parameters
        ----------

        keys : list of (str or int or ndarray)
            Keys specified by string or int ids or numpy array.
        weights : list of float or numpy.ndarray, optional
            1D array of same size of `keys` specifying the weight for each key.
        pre_normalize : bool, optional
            Flag indicating whether to normalize each keyvector before taking mean.
            If False, individual keyvector will not be normalized.
        post_normalize: bool, optional
            Flag indicating whether to normalize the final mean vector.
            If True, normalized mean vector will be return.
        ignore_missing : bool, optional
            If False, will raise error if a key doesn't exist in vocabulary.

        Returns
        -------

        numpy.ndarray
            Mean vector for the list of keys.

        Raises
        ------

        ValueError
            If the size of the list of `keys` and `weights` doesn't match.
        KeyError
            If any of the key doesn't exist in vocabulary and `ignore_missing` is false.

        r   z!cannot compute mean with no inputNz8keys and weights array must have same number of elementsr   r   z' not present in vocabulary)r   
ValueErrorr   r   rN   arrayonesr   r   r'   r+   r   	enumerater
   abs__contains__r{   r   r   unitvecrM   REAL)r/   rJ   weightspre_normalizepost_normalizeignore_missingmeantotal_weightidxrk   vecs              r    get_mean_vectorzKeyedVectors.get_mean_vector  s   D t99> 	B@AAAgt$$ 	(hw''G 	)gc$ii((Gt99a(( 	J   x($,*<==!$ 		I 		IHC#w'' Is**GCL 1 11""3'' Iooco>>s**GCL 1 11# IGsGGGHHHI ! 	',&D 	7#D))0066Dr"   c                 p   | j         }|t          |           k    s| j        |         `t          |           }t          j        dt
                     |                     |g|g           |                                  |dz   | _         n.|| j        |<   || j        |<   || j	        |<   | xj         dz  c_         |S )a  Add one new vector at the given key, into existing slot if available.

        Warning: using this repeatedly is inefficient, requiring a full reallocation & copy,
        if this instance hasn't been preallocated to be ready for such incremental additions.

        Parameters
        ----------

        key: str
            Key identifier of the added vector.
        vector: numpy.ndarray
            1D numpy array with the vector values.

        Returns
        -------
        int
            Index of the newly added vector, so that ``self.vectors[result] == vector`` and
            ``self.index_to_key[result] == key``.

        NzAdding single vectors to a KeyedVectors which grows by one each time can be costly. Consider adding in batches or preallocating to the required size.r   )
r)   r   r(   warningswarnUserWarningadd_vectorsrg   r*   r+   )r/   rk   vectortarget_indexs       r    
add_vectorzKeyedVectors.add_vector  s    * 3t99$ 	!(9,(G 	!t99LMT   cUVH---""$$$*Q.DOO /2Dl+%1Dc")/DL&OOq OOr"   c                 b    t          t                    r,gt          j        |                              dd          }n)t          |t
                    rt          j        |          }i                                                      fd                                D                        t          j        t                    t                    }t                    D ]\  }}| j        v rd||<   t          j        |           d         D ]@}|         }t           j                   j        |<    j                            |           At!           j        ||                               j        j                  f           _        D ]5\  }}	t          j         j        |         |	|          f           j        |<   6|rU fdt          j        |          d         D             }
||          j        |
<   D ]\  }}	|	|          j        |         |
<   dS dS )	a^  Append keys and their vectors in a manual way.
        If some key is already in the vocabulary, the old vector is kept unless `replace` flag is True.

        Parameters
        ----------
        keys : list of (str or int)
            Keys specified by string or int ids.
        weights: list of numpy.ndarray or numpy.ndarray
            List of 1D np.array vectors or a 2D np.array of vectors.
        replace: bool, optional
            Flag indicating whether to replace vectors for keys which already exist in the map;
            if True - replace vectors, otherwise - keep old vectors.

        r   r   Nc                 *    g | ]}|         j         S rV   r&   )rW   rQ   extrass     r    rX   z,KeyedVectors.add_vectors.<locals>.<listcomp>O  s    .V.V.V1vay.V.V.Vr"   r&   Tr   c                 F    g | ]}                     |                   S rV   rj   )rW   r   rJ   r/   s     r    rX   z,KeyedVectors.add_vectors.<locals>.<listcomp>c  s)    ___3T^^DI66___r"   )r   r   rN   r   reshaper   rg   rJ   r   r   boolr   r*   nonzeror(   appendr	   r+   rM   r   r-   )r/   rJ   r   r   replacein_vocab_maskr   rk   rS   extrain_vocab_idxss   `` `       r    r   zKeyedVectors.add_vectors6  sX    dJ'' 	(6Dhw''//266GG&& 	(hw''G 	F 	v{{}}.V.V.V.V.V.V.VWWWT$777!$ 	* 	*HCd'' *%)c" :}n--a0 	* 	*Cs)C%():%;%;Dc"$$S)))) t|Wm^-D-K-KDLL^-_-_`aa! 	Z 	ZKD%"$)T]4-@%BW,X"Y"YDM$  	J_____"*]B[B[\]B^___M*1-*@DL'% J Je5:=5Id#M22		J 	JJ Jr"   c                     t          |t                    s|g}|                    dd          }|                     ||d           dS )a  Add keys and theirs vectors in a manual way.
        If some key is already in the vocabulary, old vector is replaced with the new one.

        This method is an alias for :meth:`~gensim.models.keyedvectors.KeyedVectors.add_vectors`
        with `replace=True`.

        Parameters
        ----------
        keys : {str, int, list of (str or int)}
            keys specified by their string or int ids.
        weights: list of numpy.ndarray or numpy.ndarray
            List of 1D np.array vectors or 2D np.array of vectors.

        r   r   T)r   N)r   r   r   r   )r/   rJ   r   s      r    __setitem__zKeyedVectors.__setitem__h  sQ     $%% 	-6Dooa,,Gw55555r"   c                 6    |                      |d          dk    S )a9  Can this model return a single index for this key?

        Subclasses that synthesize vectors for out-of-vocabulary words (like
        :class:`~gensim.models.fasttext.FastText`) may respond True for a
        simple `word in wv` (`__contains__()`) check but False for this
        more-specific check.

        r   r   r   r/   rk   s     r    has_index_forzKeyedVectors.has_index_for}  s     ~~c2&&!++r"   c                 ,    |                      |          S rv   )r   r   s     r    r   zKeyedVectors.__contains__  s    !!#&&&r"   c                 J     |t           fd|D                                S )z6Get the `key` from `keys_list` most similar to `key1`.c                 <    g | ]}                     |          S rV   
similarity)rW   rk   key1r/   s     r    rX   z6KeyedVectors.most_similar_to_given.<locals>.<listcomp>  s'     Q Q Qs!;!; Q Q Qr"   )r   )r/   r   	keys_lists   `` r    most_similar_to_givenz"KeyedVectors.most_similar_to_given  s/     Q Q Q Q Qy Q Q QRRSSr"   c                                            |          }                     |                               |          }t          j        |||         k               d         } fd|D             S )z@Get all keys that are closer to `key1` than `key2` is to `key1`.r   c                 6    g | ]}|k    j         |         S rV   r(   )rW   rK   e1_indexr/   s     r    rX   z,KeyedVectors.closer_than.<locals>.<listcomp>  s,    ___UUV^M^_!%(___r"   )	distancesrj   rN   where)r/   r   key2all_distancese2_indexcloser_node_indicesr   s   `     @r    closer_thanzKeyedVectors.closer_than  su    t,,>>$''>>$'' h}}X7N'NOOPQR_____6I____r"   zUse closer_than insteadc                 .    |                      ||          S rv   )r   )r/   word1word2s      r    words_closer_thanzKeyedVectors.words_closer_than  s    u---r"   c                 N    t          |                     ||                    dz   S )z]Rank of the distance of `key2` from `key1`, in relation to distances of all keys from `key1`.r   )r   r   )r/   r   r   s      r    rankzKeyedVectors.rank  s%    4##D$//00144r"   c                      t          d          )NzThe `.vectors_norm` attribute is computed dynamically since Gensim 4.0.0. Use `.get_normed_vectors()` instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4AttributeErrorr6   s    r    vectors_normzKeyedVectors.vectors_norm  s    b
 
 	
r"   c                     d S rv   rV   )r/   _s     r    r   zKeyedVectors.vectors_norm  s    r"   c                 l    |                                   | j        | j        dt          j        f         z  S )a  Get all embedding vectors normalized to unit L2 length (euclidean), as a 2D numpy array.

        To see which key corresponds to which vector = which array row, refer
        to the :attr:`~gensim.models.keyedvectors.KeyedVectors.index_to_key` attribute.

        Returns
        -------
        numpy.ndarray:
            2D numpy array of shape ``(number_of_keys, embedding dimensionality)``, L2-normalized
            along the rows (key vectors).

        .)r   r+   r,   rN   newaxisr6   s    r    get_normed_vectorszKeyedVectors.get_normed_vectors  s.     	|djbj999r"   c                 r    | j         |r-t          j                            | j        d          | _         dS dS )z
        Ensure per-vector norms are available.

        Any code which modifies vectors should ensure the accompanying norms are
        either recalculated or 'None', to trigger a full recalculation later on-request.

        Nr   axis)r,   rN   linalgr   r+   )r/   forces     r    r   zKeyedVectors.fill_norms  s>     : 	> 	>1==DJJJ	> 	>r"   c                      t          d          )NzThe index2entity attribute has been replaced by index_to_key since Gensim 4.0.0.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4r   r6   s    r    r;   zKeyedVectors.index2entity      b
 
 	
r"   c                     || _         d S rv   r   r/   r   s     r    r;   zKeyedVectors.index2entity      !r"   c                      t          d          )NzThe index2word attribute has been replaced by index_to_key since Gensim 4.0.0.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4r   r6   s    r    r:   zKeyedVectors.index2word  r   r"   c                     || _         d S rv   r   r   s     r    r:   zKeyedVectors.index2word  r   r"   c                      t          d          )Na!  The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4r   r6   s    r    rH   zKeyedVectors.vocab  s    b
 
 	
r"   c                 .    |                                   d S rv   )rH   r   s     r    rH   zKeyedVectors.vocab  s    

r"   c                     t                     sdS t          j         j        d                   ddd         } fd|D              _                                           j        D ]} j        |         |          j        |<   t           j                  r,t                              d            j        |          _        d t           j                  D              _
        dS )zGSort the vocabulary so the most frequent words have the lowest indexes.Nr0   r   c                 *    g | ]}j         |         S rV   r   )rW   r   r/   s     r    rX   z=KeyedVectors.sort_by_descending_frequency.<locals>.<listcomp>  s!    TTTT.s3TTTr"   zDsorting after vectors have been allocated is expensive & error-pronec                     i | ]\  }}||	S rV   rV   )rW   iwords      r    
<dictcomp>z=KeyedVectors.sort_by_descending_frequency.<locals>.<dictcomp>  s    QQQDT1QQQr"   )r   rN   argsortr-   r(   rg   r+   loggerwarningr   r*   )r/   count_sorted_indexesrQ   s   `  r    sort_by_descending_frequencyz)KeyedVectors.sort_by_descending_frequency  s    4yy 	F!z$-*@AA$$B$GTTTT?STTT    	F 	FA#}Q/0DEDM!t| 	>NNabbb<(<=DLQQId>O4P4PQQQr"   c                 H     t          t          |           j        |i | dS )a  Save KeyedVectors to a file.

        Parameters
        ----------
        fname : str
            Path to the output file.

        See Also
        --------
        :meth:`~gensim.models.keyedvectors.KeyedVectors.load`
            Load a previously saved model.

        N)r=   r$   saverD   s      r    r   zKeyedVectors.save  s-     	'lD!!&777777r"   
   c           	          t          |t                    r|dk     rg S t          |          }t          |          }                                  |pt	           j                  }|rd|}g }t          j        t          j        t	          |                    dt          j        t	          |                    z  f          }	t          ||z             D ]V\  }
}t          |t                    r|                    |           0|                    |d                    |d         |	|
<   W                     ||	ddd          } fd|D             |+t          |t                    r|                    ||          S t           j        |         |           j        |         z  |sS t#          j        |t	                    z   d	          } fd
|D             }|d|         S )a  Find the top-N most similar keys.
        Positive keys contribute positively towards the similarity, negative keys negatively.

        This method computes cosine similarity between a simple mean of the projection
        weight vectors of the given keys and the vectors for each key in the model.
        The method corresponds to the `word-analogy` and `distance` scripts in the original
        word2vec implementation.

        Parameters
        ----------
        positive : list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional
            List of keys that contribute positively. If tuple, second element specifies the weight (default `1.0`)
        negative : list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional
            List of keys that contribute negatively. If tuple, second element specifies the weight (default `-1.0`)
        topn : int or None, optional
            Number of top-N similar keys to return, when `topn` is int. When `topn` is None,
            then similarities for all keys are returned.
        clip_start : int
            Start clipping index.
        clip_end : int
            End clipping index.
        restrict_vocab : int, optional
            Optional integer which limits the range of vectors which
            are searched for most-similar values. For example, restrict_vocab=10000 would
            only check the first 10000 key vectors in the vocabulary order. (This may be
            meaningful if you've sorted the vocabulary by descending frequency.) If
            specified, overrides any values of ``clip_start`` or ``clip_end``.

        Returns
        -------
        list of (str, float) or numpy.array
            When `topn` is int, a sequence of (key, similarity) is returned.
            When `topn` is None, then similarities for all keys are returned as a
            one-dimensional numpy array with the size of the vocabulary.

        r   r   g      TF)r   r   r   c                     g | ]A}t          |t                                        |          ,                    |          BS rV   )r   r   r   rj   r|   s     r    rX   z-KeyedVectors.most_similar.<locals>.<listcomp>J  s[     
 
 
$':c:3N3N
SWSeSefiSjSj
NN3
 
 
r"   Ntopnreversec                 h    g | ].}|z   v	j         |z            t          |                   f/S rV   r(   float)rW   simall_keys
clip_startdistsr/   s     r    rX   z-KeyedVectors.most_similar.<locals>.<listcomp>V  sV     
 
 
j 0A
sZ/0%c
2C2CD
 
 
r"   )r   r   r!   r   r   r+   rN   concatenater   r   _EXTENDED_KEY_TYPESr   r   rZ   most_similarr   r,   r   r   )r/   positivenegativer   r  clip_endrestrict_vocabindexerrJ   weightr   itemr   bestr   r  r  s   `   `          @@r    r  zKeyedVectors.most_similar  sH   P dH%% 	$( 	I  ))))0s4<00 	&J%H X!7!7H@V@V9V WXX"8h#677 	& 	&IC$ 344 &D!!!!DG$$$"1gs ##D&UYjo#pp
 
 
 
+/
 
 
  	4:dC#8#8 	4''d333DLH!45t<<tz*U]J]?^^ 	LD3x==,@$OOO
 
 
 
 
 
 

 
 
 ete}r"   c                 0    |                      |||          S )z)Compatibility alias for similar_by_key().)similar_by_key)r/   r   r   r  s       r    similar_by_wordzKeyedVectors.similar_by_word\  s    ""4~>>>r"   c                 4    |                      |g||          S )a  Find the top-N most similar keys.

        Parameters
        ----------
        key : str
            Key
        topn : int or None, optional
            Number of top-N similar keys to return. If topn is None, similar_by_key returns
            the vector of similarity scores.
        restrict_vocab : int, optional
            Optional integer which limits the range of vectors which
            are searched for most-similar values. For example, restrict_vocab=10000 would
            only check the first 10000 key vectors in the vocabulary order. (This may be
            meaningful if you've sorted the vocabulary by descending frequency.)

        Returns
        -------
        list of (str, float) or numpy.array
            When `topn` is int, a sequence of (key, similarity) is returned.
            When `topn` is None, then similarities for all keys are returned as a
            one-dimensional numpy array with the size of the vocabulary.

        r	  r   r  r  )r/   rk   r   r  s       r    r  zKeyedVectors.similar_by_key`  s!    0   3%d> ZZZr"   c                 4    |                      |g||          S )a  Find the top-N most similar keys by vector.

        Parameters
        ----------
        vector : numpy.array
            Vector from which similarities are to be computed.
        topn : int or None, optional
            Number of top-N similar keys to return, when `topn` is int. When `topn` is None,
            then similarities for all keys are returned.
        restrict_vocab : int, optional
            Optional integer which limits the range of vectors which
            are searched for most-similar values. For example, restrict_vocab=10000 would
            only check the first 10000 key vectors in the vocabulary order. (This may be
            meaningful if you've sorted the vocabulary by descending frequency.)

        Returns
        -------
        list of (str, float) or numpy.array
            When `topn` is int, a sequence of (key, similarity) is returned.
            When `topn` is None, then similarities for all keys are returned as a
            one-dimensional numpy array with the size of the vocabulary.

        r  r  )r/   r   r   r  s       r    similar_by_vectorzKeyedVectors.similar_by_vectorz  s!    0   6(n ]]]r"   c                 v    ddl m} t          |          }t          |          } fd|D             } fd|D             }|t          |          z
  }|t          |          z
  }|dk    s|dk    rt                              d||           |r|s)t                              d           t          d          S t          ||g          t                    d	k    rd
S t          t          |                    }	t          t          |                    }
t          j         fd|	D                       }t          j         fd|
D                       }                    |	          }                    |
          }t          ft                    }t          ||          |t          j        ||          <   t#          t%          |                    dk     r)t                              d           t          d          S fd} ||          } ||          } ||||          S )u  Compute the Word Mover's Distance between two documents.

        When using this code, please consider citing the following papers:

        * `Rémi Flamary et al. "POT: Python Optimal Transport"
          <https://jmlr.org/papers/v22/20-451.html>`_
        * `Matt Kusner et al. "From Word Embeddings To Document Distances"
          <http://proceedings.mlr.press/v37/kusnerb15.pdf>`_.

        Parameters
        ----------
        document1 : list of str
            Input document.
        document2 : list of str
            Input document.
        norm : boolean
            Normalize all word vectors to unit length before computing the distance?
            Defaults to True.

        Returns
        -------
        float
            Word Mover's distance between `document1` and `document2`.

        Warnings
        --------
        This method only works if `POT <https://pypi.org/project/POT/>`_ is installed.

        If one of the documents have no words that exist in the vocab, `float('inf')` (i.e. infinity)
        will be returned.

        Raises
        ------
        ImportError
            If `POT <https://pypi.org/project/POT/>`_  isn't installed.

        r   )emd2c                     g | ]}|v |	S rV   rV   rW   tokenr/   s     r    rX   z+KeyedVectors.wmdistance.<locals>.<listcomp>  "    CCCuUd]CUCCCr"   c                     g | ]}|v |	S rV   rV   r  s     r    rX   z+KeyedVectors.wmdistance.<locals>.<listcomp>  r  r"   zARemoved %d and %d OOV words from document 1 and 2 (respectively).zGAt least one of the documents had no words that were in the vocabulary.inf)	documentsr           c                 >    g | ]}                     |           S r   rz   rW   r  r   r/   s     r    rX   z+KeyedVectors.wmdistance.<locals>.<listcomp>  )    OOOUtu488OOOr"   c                 >    g | ]}                     |           S r$  rz   r%  s     r    rX   z+KeyedVectors.wmdistance.<locals>.<listcomp>  r&  r"   r&   g:0yE>z;The distance matrix is all zeros. Aborting (returning inf).c                     t          t                    }                    |           }t          |           }|D ]\  }}|t	          |          z  ||<   |S )Nr&   )r   r   doc2bowr   r  )documentdnbowdoc_lenr   freq
dictionary	vocab_lens         r    r,  z%KeyedVectors.wmdistance.<locals>.nbow  sa    iv...A%%h//D(mmG! / /	Tg.#Hr"   )otr  r   r   infor   r  r   r   setrN   r   doc2idxr   r   r   ix_r   np_sum)r/   	document1	document2r   r  len_pre_oov1len_pre_oov2diff1diff2doclist1doclist2v1v2doc1_indicesdoc2_indicesdistance_matrixr,  d1d2r/  r0  s   `  `               @@r    
wmdistancezKeyedVectors.wmdistance  ss   N 	 9~~9~~CCCC	CCC	CCCC	CCC	s9~~-s9~~-19 	k	 	kKK[]bdijjj 	 	 	 NNdeee<<9i*@AAA

OO	> 	3I''I''XOOOOOhOOOPPXOOOOOhOOOPP!))(33!))(33  I 6fEEE>CBmm|\::;vo&&''$. 	 KKUVVV<<	 	 	 	 	 	 T)__T)__ tBO,,,r"   c                    	
 t          |t                    r|dk     rg S t          |          }t          |          }                                  t          |t                    r|g}t          |t                    r|g} fd||z   D             	 fd|D             } fd|D             }|st          d           fd|D             } fd|D             }t          |d	          t          |d	          d
z   z  
|s
S t          j        
|t          	          z   d          }	
 fd|D             }|d|         S )a  Find the top-N most similar words, using the multiplicative combination objective,
        proposed by `Omer Levy and Yoav Goldberg "Linguistic Regularities in Sparse and Explicit Word Representations"
        <http://www.aclweb.org/anthology/W14-1618>`_. Positive words still contribute positively towards the similarity,
        negative words negatively, but with less susceptibility to one large distance dominating the calculation.
        In the common analogy-solving case, of two positive and one negative examples,
        this method is equivalent to the "3CosMul" objective (equation (4)) of Levy and Goldberg.

        Additional positive or negative examples contribute to the numerator or denominator,
        respectively - a potentially sensible but untested extension of the method.
        With a single positive example, rankings will be the same as in the default
        :meth:`~gensim.models.keyedvectors.KeyedVectors.most_similar`.

        Allows calls like most_similar_cosmul('dog', 'cat'), as a shorthand for
        most_similar_cosmul(['dog'], ['cat']) where 'dog' is positive and 'cat' negative

        Parameters
        ----------
        positive : list of str, optional
            List of words that contribute positively.
        negative : list of str, optional
            List of words that contribute negatively.
        topn : int or None, optional
            Number of top-N similar words to return, when `topn` is int. When `topn` is None,
            then similarities for all words are returned.
        restrict_vocab : int or None, optional
            Optional integer which limits the range of vectors which are searched for most-similar values.
            For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order.
            This may be meaningful if vocabulary is sorted by descending frequency.


        Returns
        -------
        list of (str, float) or numpy.array
            When `topn` is int, a sequence of (word, similarity) is returned.
            When `topn` is None, then similarities for all words are returned as a
            one-dimensional numpy array with the size of the vocabulary.

        r   c                 v    h | ]5}t          |t                    |j        v                      |          6S rV   )r   r
   r*   rj   rW   r   r/   s     r    	<setcomp>z3KeyedVectors.most_similar_cosmul.<locals>.<setcomp>+  sU     
 
 
%)dG,,
159J1J
NN4  
 
 
r"   c                 l    g | ]0}t          |t                    r                    |d           n|1S Tr   r   r\   r{   rI  s     r    rX   z4KeyedVectors.most_similar_cosmul.<locals>.<listcomp>0  M     
 
 
 1;40E0EODOODtO,,,4
 
 
r"   c                 l    g | ]0}t          |t                    r                    |d           n|1S rL  rM  rI  s     r    rX   z4KeyedVectors.most_similar_cosmul.<locals>.<listcomp>4  rN  r"   z'cannot compute similarity with no inputc                 V    g | ]%}d t          j        |          j        z  z   dz  &S r   r   r   r+   r,   rW   termr/   s     r    rX   z4KeyedVectors.most_similar_cosmul.<locals>.<listcomp>>  6    [[[$q3t|T22TZ??1D[[[r"   c                 V    g | ]%}d t          j        |          j        z  z   dz  &S rQ  rR  rS  s     r    rX   z4KeyedVectors.most_similar_cosmul.<locals>.<listcomp>?  rU  r"   r   r   gư>Tr   c                 \    g | ](}|vj         |         t          |                   f)S rV   r   )rW   r  	all_wordsr  r/   s     r    rX   z4KeyedVectors.most_similar_cosmul.<locals>.<listcomp>F  s>    fff#QT\eQef4$S)5s+<+<=fffr"   N)
r   r   r!   	init_simsr\   r   r   r   r   r   )r/   r	  r
  r   r  	pos_dists	neg_distsr  r   rX  r  s   `        @@r    most_similar_cosmulz KeyedVectors.most_similar_cosmul  s   T dH%% 	$( 	I  ))))h$$ 	" zHh$$ 	" zH
 
 
 
-5-@
 
 
	

 
 
 
 
 
 

 
 
 
 
 
 

  	HFGGG \[[[RZ[[[	[[[[RZ[[[	YQ'''4	+B+B+BX+MN 	LD3y>>,A4PPPfffffffffete}r"   c                                                         fd|D             }t          |          t          |          k    r:t          |          t          |          z
  }t                              d|           |st          d          t           fd|D                                           t                    } 	                    |d          }t          ||          }t          t          ||          d          S )a  Rank the given words by similarity to the centroid of all the words.

        Parameters
        ----------
        words : list of str
            List of keys.
        use_norm : bool, optional
            Whether to calculate centroid using unit-normed vectors; default True.

        Returns
        -------
        list of (float, str)
            Ranked list of (similarity, key), most-similar to the centroid first.

        c                     g | ]}|v |	S rV   rV   rI  s     r    rX   z3KeyedVectors.rank_by_centrality.<locals>.<listcomp>[  s"    ===t=d===r"   zGvectors for words %s are not present in the model, ignoring these wordsz'cannot select a word from an empty listc                 >    g | ]}                     |           S r$  rz   )rW   r   r/   use_norms     r    rX   z3KeyedVectors.rank_by_centrality.<locals>.<listcomp>a  s)    VVV4$//$X/>>VVVr"   T)r   )r   )r   r   r3  r   r   r   r	   rM   r   r   r   sortedrY   )r/   wordsr`  
used_wordsignored_wordsr+   r   r  s   ` `     r    rank_by_centralityzKeyedVectors.rank_by_centralityI  s     	====u===
z??c%jj( 	uJJZ8MNNdfsttt 	HFGGGVVVVV:VVVWW^^_cdd##GD#AAGT""c%,,d;;;;r"   c                 D    |                      |          d         d         S )a	  Which key from the given list doesn't go with the others?

        Parameters
        ----------
        words : list of str
            List of keys.

        Returns
        -------
        str
            The key further away from the mean of all keys.

        r   r   )re  )r/   rb  s     r    doesnt_matchzKeyedVectors.doesnt_matchf  s"     &&u--b1!44r"   c                     t           j                            |           }t           j                            |d          }t          ||           }|||z  z  }|S )a*  Compute cosine similarities between one vector and a set of other vectors.

        Parameters
        ----------
        vector_1 : numpy.ndarray
            Vector from which similarities are to be computed, expected shape (dim,).
        vectors_all : numpy.ndarray
            For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

        Returns
        -------
        numpy.ndarray
            Contains cosine distance between `vector_1` and each row in `vectors_all`, shape (num_vectors,).

        r   r   )rN   r   r   r   )vector_1vectors_allr   	all_normsdot_productssimilaritiess         r    cosine_similaritiesz KeyedVectors.cosine_similaritiesv  sR    " y~~h''INN;QN77	;11#ti'78r"   rV   c                      t          |t                    r                     |          }n|}|s j        }n fd|D             } j        |         }d                     ||          z
  S )ah  Compute cosine distances from given word or vector to all words in `other_words`.
        If `other_words` is empty, return distance between `word_or_vector` and all words in vocab.

        Parameters
        ----------
        word_or_vector : {str, numpy.ndarray}
            Word or vector from which distances are to be computed.
        other_words : iterable of str
            For each word in `other_words` distance from `word_or_vector` is computed.
            If None or empty, distance of `word_or_vector` from all words in vocab is computed (including itself).

        Returns
        -------
        numpy.array
            Array containing distances to all words in `other_words` from input `word_or_vector`.

        Raises
        -----
        KeyError
            If either `word_or_vector` or any word in `other_words` is absent from vocab.

        c                 :    g | ]}                     |          S rV   r   rI  s     r    rX   z*KeyedVectors.distances.<locals>.<listcomp>  s%    JJJdT^^D11JJJr"   r   )r   r   r{   r+   rn  )r/   word_or_vectorother_wordsinput_vectorother_vectorsother_indicess   `     r    r   zKeyedVectors.distances  s    . nj11 	*??>::LL)L 	8 LMMJJJJkJJJM L7M4++L-HHHHr"   c                 4    d|                      ||          z
  S )aX  Compute cosine distance between two keys.
        Calculate 1 - :meth:`~gensim.models.keyedvectors.KeyedVectors.similarity`.

        Parameters
        ----------
        w1 : str
            Input key.
        w2 : str
            Input key.

        Returns
        -------
        float
            Distance between `w1` and `w2`.

        r   r   r/   w1w2s      r    distancezKeyedVectors.distance  s    " 4??2r****r"   c                     t          t          j        | |                   t          j        | |                             S )a  Compute cosine similarity between two keys.

        Parameters
        ----------
        w1 : str
            Input key.
        w2 : str
            Input key.

        Returns
        -------
        float
            Cosine similarity between `w1` and `w2`.

        )r   r   r   rw  s      r    r   zKeyedVectors.similarity  s2      8#DH--x/?R/I/IJJJr"   c                     t          |          rt          |          st          d          |                     |d          }|                     |d          }t          t	          j        |          t	          j        |                    S )a<  Compute cosine similarity between two sets of keys.

        Parameters
        ----------
        ws1 : list of str
            Sequence of keys.
        ws2: list of str
            Sequence of keys.

        Returns
        -------
        numpy.ndarray
            Similarities between `ws1` and `ws2`.

        z)At least one of the passed list is empty.F)r   )r   ZeroDivisionErrorr   r   r   r   )r/   ws1ws2mean1mean2s        r    n_similarityzKeyedVectors.n_similarity  s      C 	QSXX 	Q#$OPPP$$S$>>$$S$>>8#E**H,<U,C,CDDDr"   c                     t          | d                   t          | d                   }}||z   dk    rdS |||z   z  }t                              d| d         d|z  |||z              |S )a  Calculate score by section, helper for
        :meth:`~gensim.models.keyedvectors.KeyedVectors.evaluate_word_analogies`.

        Parameters
        ----------
        section : dict of (str, (str, str, str, str))
            Section given from evaluation.

        Returns
        -------
        float
            Accuracy score if at least one prediction was made (correct or incorrect).

            Or return 0.0 if there were no predictions at all in this section.

        correct	incorrectr   r"  %s: %.1f%% (%i/%i)section      Y@r   r   r2  )r  r  r  scores       r    _log_evaluate_word_analogiesz)KeyedVectors._log_evaluate_word_analogies  s}    $ !!344c'+:N6O6OY!# 	37Y./(')*<eemWV]`iVijjjr"    r  c           	           j         d|         }|r fdt          |          D             }n fdt          |          D             }d}t                              d||           g d}
}	d}t	          j        |d          5 }t          |          D ]\  }}t	          j        |          }|                    d          rX|
r*|		                    |
            
                    |
           |                    d                                          g g d}
|
st          d	||fz            	 |r$d
 |                                D             \  }}}}n#d |                                D             \  }}}}n-# t          $ r  t                              d||           Y w xY w|dz  }||vs||vs||vs||vr|dz  }|rNt                              d||                                           |
d         	                    ||||f           n.t                              d||                                            j        }| _        |||h}d}                     ||g|gd|          }| _        |D ]e}|r|d                                         n|d         }||v r;||vr7||k    r/t                              d|                                ||            nf||k    r!|
d         	                    ||||f           v|
d         	                    ||||f           	 ddd           n# 1 swxY w Y   |
r*|		                    |
            
                    |
           dt'          t(          j                            d |	D                                 t'          t(          j                            d |	D                                 d}t/          |          |z  dz  }t                              d|           |st                              d            
                    |          }|		                    |           ||	fS )ay  Compute performance of the model on an analogy test set.

        The accuracy is reported (printed to log and returned as a score) for each section separately,
        plus there's one aggregate summary at the end.

        This method corresponds to the `compute-accuracy` script of the original C word2vec.
        See also `Analogy (State of the art) <https://aclweb.org/aclwiki/Analogy_(State_of_the_art)>`_.

        Parameters
        ----------
        analogies : str
            Path to file, where lines are 4-tuples of words, split into sections by ": SECTION NAME" lines.
            See `gensim/test/test_data/questions-words.txt` as example.
        restrict_vocab : int, optional
            Ignore all 4-tuples containing a word not in the first `restrict_vocab` words.
            This may be meaningful if you've sorted the model vocabulary by descending frequency (which is standard
            in modern word embedding models).
        case_insensitive : bool, optional
            If True - convert all words to their uppercase form before evaluating the performance.
            Useful to handle case-mismatch between training tokens and words in the test set.
            In case of multiple case variants of a single word, the vector for the first occurrence
            (also the most frequent if vocabulary is sorted) is taken.
        dummy4unknown : bool, optional
            If True - produce zero accuracies for 4-tuples with out-of-vocabulary words.
            Otherwise, these tuples are skipped entirely and not used in the evaluation.
        similarity_function : str, optional
            Function name used for similarity calculation.

        Returns
        -------
        score : float
            The overall evaluation score on the entire evaluation set
        sections : list of dict of {str : str or list of tuple of (str, str, str, str)}
            Results broken down by each section of the evaluation set. Each dict contains the name of the section
            under the key 'section', and lists of correctly and incorrectly predicted 4-tuples of words under the
            keys 'correct' and 'incorrect'.

        Nc                 `    i | ]*}|                                                     |          +S rV   upperrj   rW   rQ   r/   s     r    r   z8KeyedVectors.evaluate_word_analogies.<locals>.<dictcomp>0  /    PPP		4>>!#4#4PPPr"   c                 <    i | ]}|                     |          S rV   r   r  s     r    r   z8KeyedVectors.evaluate_word_analogies.<locals>.<dictcomp>2  '    HHH4>>!,,HHHr"   r   z=Evaluating word analogies for top %i words in the model on %srbz: r  r  r  z,Missing section header before line #%i in %sc                 6    g | ]}|                                 S rV   r  rW   r   s     r    rX   z8KeyedVectors.evaluate_word_analogies.<locals>.<listcomp>E  s     0W0W0W$0W0W0Wr"   c                     g | ]}|S rV   rV   r  s     r    rX   z8KeyedVectors.evaluate_word_analogies.<locals>.<listcomp>G  s    0O0O0O$0O0O0Or"   zSkipping invalid line #%i in %sr   z-Zero accuracy for line #%d with OOV words: %sr  z$Skipping line #%i with OOV words: %s   )r	  r
  r   r  z%s: expected %s, predicted %sr  zTotal accuracyc              3   &   K   | ]}|d          V  dS )r  NrV   rW   ss     r    	<genexpr>z7KeyedVectors.evaluate_word_analogies.<locals>.<genexpr>n  s&      9Y9Y1!I,9Y9Y9Y9Y9Y9Yr"   c              3   &   K   | ]}|d          V  dS )r  NrV   r  s     r    r  z7KeyedVectors.evaluate_word_analogies.<locals>.<genexpr>o  s&      ;];]qAkN;];];];];];]r"   d   z0Quadruplets with out-of-vocabulary words: %.1f%%zrNB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dummy4unknown=True")r(   reversedr   r2  r   openr   
to_unicode
startswithr   r  lstripstripr   splitdebugr*   r  r  r   	itertoolschainfrom_iterabler  )r/   	analogiesr  case_insensitivedummy4unknownsimilarity_functionok_keysok_vocaboovsectionsr  quadruplets_nofinline_nolineabcexpectedoriginal_key_to_indexignore	predictedsimselementtotal	oov_ratioanalogies_scores   `                          r    evaluate_word_analogiesz$KeyedVectors.evaluate_word_analogies  sc   R #O^O4 	IPPPPhw>O>OPPPHHHHHHhw6G6GHHHHSUcenooo'Z	4(( /	IC!*3 .I .I'--??4(( ,I C 00099'BBB*.++d*;*;*A*A*C*CPRacddGG" p()W[bdmZn)nooo!+ P0W0W$**,,0W0W0W-Aq!XX0O0O$**,,0O0O0O-Aq!X% ! ! !$EwPYZZZ ! #a'N( !AX,= !(AR !V^fnVn !q( h"LL)XZacgcmcmcocoppp#K077Aq(8KLLLL"LL)OQXZ^ZdZdZfZfggg ,0,=)(0D%AYF $I  ,,q!fsQRcq,rrD(=D%#' " ":J$ZGAJ$4$4$6$6$6PWXYPZ	$0 "Yf5L "(H4 q &-Ldjjll\dfo p p p!E H, I	*111aH2EFFFF,33Q1h4GHHHH].I/	I /	I /	I /	I /	I /	I /	I /	I /	I /	I /	I /	I /	I /	I /	I`  	7OOG$$$--g666 (IO999Y9YPX9Y9Y9YYYZZio;;;];]T\;];];]]]^^
 
 #JJ/#5	F	RRR 	KKD   ;;EBB((s9   B-L71A	E;:L7;&F%!L7$F%%FL77L;>L;c                     t          | d                   t          | d                   }}||z   dk    r2t                              d| d         d|z  ||z   z  |||z              d S d S )Nr  r  r   r  r  r  r  r  s      r    log_accuracyzKeyedVectors.log_accuracy~  s     !344c'+:N6O6OY" 	KK$	"EGOw7J$KWV]`iVi    	 	r"   c                     t                               d|| d                    t                               d||d                    t                               d|           d S )Nz0Pearson correlation coefficient against %s: %.4fr   z<Spearman rank-order correlation coefficient against %s: %.4fz&Pairs with unknown words ratio: %.1f%%)r   r2  )pearsonspearmanr  pairss       r    log_evaluate_word_pairsz$KeyedVectors.log_evaluate_word_pairs  sX    FwWXzZZZRTY[cde[fggg<cBBBBBr"   	utf8c                      j         d|         }|r fdt          |          D             }n fdt          |          D             }g }	g }
d} j        |c} _        	 t          j        ||          5 }t          |          D ]\  }}|r|                    d          r	 |r$d |                    |          D             \  }}}n#d |                    |          D             \  }}}t          |          }n3# t          t          f$ r t                              d	||           Y w xY w||vs||vr|d
z  }|rYt                              d||                                           |
                    d           |	                    |           n.t                              d||                                           C|	                    |           |
                                         ||                     	 ddd           n# 1 swxY w Y   | _        n# | _        w xY wt#          |	          t#          |
          k    sJ |	st          d| d            t%          j        |	|
          }t%          j        |	|
          }|r#t          |          t#          |	          z  dz  }n%t          |          t#          |	          |z   z  dz  }t                              d||d         |d
                    t                              d||d         |d
                    t                              d|                                ||||           |||fS )a  Compute correlation of the model with human similarity judgments.

        Notes
        -----
        More datasets can be found at
        * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html
        * https://www.cl.cam.ac.uk/~fh295/simlex.html.

        Parameters
        ----------
        pairs : str
            Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value.
            See `test/test_data/wordsim353.tsv` as example.
        delimiter : str, optional
            Separator in `pairs` file.
        restrict_vocab : int, optional
            Ignore all 4-tuples containing a word not in the first `restrict_vocab` words.
            This may be meaningful if you've sorted the model vocabulary by descending frequency (which is standard
            in modern word embedding models).
        case_insensitive : bool, optional
            If True - convert all words to their uppercase form before evaluating the performance.
            Useful to handle case-mismatch between training tokens and words in the test set.
            In case of multiple case variants of a single word, the vector for the first occurrence
            (also the most frequent if vocabulary is sorted) is taken.
        dummy4unknown : bool, optional
            If True - produce zero accuracies for 4-tuples with out-of-vocabulary words.
            Otherwise, these tuples are skipped entirely and not used in the evaluation.

        Returns
        -------
        pearson : tuple of (float, float)
            Pearson correlation coefficient with 2-tailed p-value.
        spearman : tuple of (float, float)
            Spearman rank-order correlation coefficient between the similarities from the dataset and the
            similarities produced by the model itself, with 2-tailed p-value.
        oov_ratio : float
            The ratio of pairs with unknown words.

        Nc                 `    i | ]*}|                                                     |          +S rV   r  r  s     r    r   z4KeyedVectors.evaluate_word_pairs.<locals>.<dictcomp>  r  r"   c                 <    i | ]}|                     |          S rV   r   r  s     r    r   z4KeyedVectors.evaluate_word_pairs.<locals>.<dictcomp>  r  r"   r   encoding#c                 6    g | ]}|                                 S rV   r  r  s     r    rX   z4KeyedVectors.evaluate_word_pairs.<locals>.<listcomp>  s     (X(X(X$(X(X(Xr"   c                     g | ]}|S rV   rV   r  s     r    rX   z4KeyedVectors.evaluate_word_pairs.<locals>.<listcomp>  s    (P(P(P$(P(P(Pr"   zSkipping invalid line #%d in %sr   z/Zero similarity for line #%d with OOV words: %sr"  z$Skipping line #%d with OOV words: %sz(No valid similarity judgements found in z8: either invalid format or all are out-of-vocabulary in r  z>Pearson correlation coefficient against %s: %f with p-value %fzJSpearman rank-order correlation coefficient against %s: %f with p-value %fzPairs with unknown words: %d)r(   r  r*   r   r  r   r  r  r  r   r_   r   r2  r  r  r   r   r   r   	spearmanrpearsonrr  )r/   r  	delimiterr  r  r  r  r  r  similarity_goldsimilarity_modelr  r  r  r  r  r  r  r  r  r  r  s   `                     r    evaluate_word_pairsz KeyedVectors.evaluate_word_pairs  s>   V #O^O4 	IPPPPhw>O>OPPPHHHHHHhw6G6GHHHH373Dh0t0	6EH555 C%.s^^ C CMGT !4??3#7#7 ! !+ Q(X(X$**YBWBW(X(X(XIAq##(P(P$**Y:O:O(P(P(PIAq##Cjj&	2 ! ! !$EwPUVVV ! ( !AX,= !q( g"LL)Z\ceieoeoeqeqrrr,33C888+2237777"KK(NPWY]YcYcYeYefff #**3///$++DOOAq,A,ABBBB/CC C C C C C C C C C C C C C C4 !6D 5D5555?##s+;'<'<<<<< 	75 7 7047 7   ??4DEE.2BCC 	Hc

S%9%99C?IIc

c/&:&:S&@ACGIUW\^efg^hjqrsjtuuuX8A;	
 	
 	
 	3S999$$Wh	5III)++sV    H) 6.H%AC>=H>-D.+H-D..CH
H) HH) HH) )	H2zmUse fill_norms() instead. See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4c                     |                                   |r0t                              d           |                                  dS dS )a(  Precompute data helpful for bulk similarity calculations.

        :meth:`~gensim.models.keyedvectors.KeyedVectors.fill_norms` now preferred for this purpose.

        Parameters
        ----------

        replace : bool, optional
            If True - forget the original vectors and only keep the normalized ones.

        Warnings
        --------

        You **cannot sensibly continue training** after doing a replace on a model's
        internal KeyedVectors, and a replace is no longer necessary to save RAM. Do not use this method.

        zXdestructive init_sims(replace=True) deprecated & no longer required for space-efficiencyN)r   r   r   unit_normalize_all)r/   r   s     r    rY  zKeyedVectors.init_sims  sQ    , 	 	&NNuvvv##%%%%%	& 	&r"   c                     |                                   | xj        | j        dt          j        f         z  c_        t          j        t          | j                  f          | _        dS )z{Destructively scale all vectors to unit-length.

        You cannot sensibly continue training after such a step.

        .N)r   r+   r,   rN   r   r   r   r6   s    r    r  zKeyedVectors.unit_normalize_all  sS     	
3
?33Wc$,//122


r"   c                     |                      ||          }|st          d          t          |                     ||                    t	          d |D                       z  }|S )a  Compute the relative cosine similarity between two words given top-n similar words,
        by `Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc "A Minimally Supervised Approach
        for Synonym Extraction with Word Embeddings" <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_.

        To calculate relative cosine similarity between two words, equation (1) of the paper is used.
        For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than
        any arbitrary word pairs.

        Parameters
        ----------
        wa: str
            Word for which we have to look top-n similar word.
        wb: str
            Word for which we evaluating relative cosine similarity with wa.
        topn: int, optional
            Number of top-n similar words to look with respect to wa.

        Returns
        -------
        numpy.float64
            Relative cosine similarity between wa and wb.

        zFCannot calculate relative cosine similarity without any similar words.c              3       K   | ]	\  }}|V  
d S rv   rV   )rW   r   r  s      r    r  z:KeyedVectors.relative_cosine_similarity.<locals>.<genexpr>6  s&      3K3KFAsC3K3K3K3K3K3Kr"   )r  r   r  r   r   )r/   wawbr   r  rcss         r    relative_cosine_similarityz'KeyedVectors.relative_cosine_similarity  sm    0 ##B-- 	gefffDOOB++,,3K3Kd3K3K3K0K0KL
r"    r0   c	                     |t           j                  }|sdnd}	 j        v r-t           j                                         fd          }
n8|t          d d          t                              d             j        }
|t          	                    d	|           t          j        ||	          5 }|
D ]F}|                    | | d
                     |           d                    d                     G	 ddd           n# 1 swxY w Y   t          	                    d| j        |           t           j                   j        f j        j        k    sJ d}t%           j                  D ]\  }}||k    r n|dz  }t'          j        t+          d|          |
          }t          j        ||	          5 }|r3|                    | d
 j         d                    d                     |D ]} |         }|r[|                    | | d
                    d          |                    t.                                                    z              g|                    | | d
d
                    d |D                        d                    d                     	 ddd           dS # 1 swxY w Y   dS )a  Store the input-hidden weight matrix in the same format used by the original
        C word2vec-tool, for compatibility.

        Parameters
        ----------
        fname : str
            File path to save the vectors to.
        fvocab : str, optional
            File path to save additional vocabulary information to. `None` to not store the vocabulary.
        binary : bool, optional
            If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
        total_vec : int, optional
            Explicitly specify total number of vectors
            (in case word vectors are appended with document vectors afterwards).
        write_header : bool, optional
            If False, don't write the 1st line declaring the count of vectors and dimensions.
            This is the format used by e.g. gloVe vectors.
        prefix : str, optional
            String to prepend in front of each stored word. Default = no prefix.
        append : bool, optional
            If set, open `fname` in `ab` mode instead of the default `wb` mode.
        sort_attr : str, optional
            Sort the output vectors in descending order of this attribute. Default: most frequent keys first.

        Nr  abc                 2                         |            S rv   )rn   )rQ   r/   	sort_attrs    r    <lambda>z3KeyedVectors.save_word2vec_format.<locals>.<lambda>\  s    UYUeUefgirUsUsTs r"   )rk   zCannot store vocabulary with 'z'' because that attribute does not existzIattribute %s not present in %s; will store in internal index_to_key orderzstoring vocabulary in %s 
r  z(storing %sx%s projection weights into %sr   r   c              3   4   K   | ]}t          |          V  d S rv   )repr)rW   rl   s     r    r  z4KeyedVectors.save_word2vec_format.<locals>.<genexpr>  s(      8Y8Ysc8Y8Y8Y8Y8Y8Yr"   )r   r(   r-   ra  r*   rJ   r   r   r   r2  r   r  r   rn   encoder'   r+   r   r   r  r  rangerM   r   tobytesjoin)r/   fnamefvocabbinary	total_vecwrite_headerprefixr   r  modestore_order_vocab_keysvoutr   index_id_countr   rl   keys_to_writefoutrk   
key_vectors   `       `           r    save_word2vec_formatz!KeyedVectors.save_word2vec_format:  s   :  	/D-..I!+ttt% 	7%+D,=,B,B,D,DJsJsJsJsJs%t%t%t""
  v !t)!t!t!tuuuNN[4   &*%6" 	gKK2F;;;FD)) gT2 g gDJJ&V$VV1A1A$	1R1RVVV]]^deeffffgg g g g g g g g g g g g g g g 	>	4K[]bcccD%&&(89T\=OOOOO  122 	  	 FAsCx aNN!a(@(@BXYY Zt$$ 	n O

i>>$*:>>>EEfMMNNN$ n n!#Y
 nJJ&0#00077??*BSBSTXBYBYBaBaBcBccddddJJ&]#]]8Y8Yj8Y8Y8Y0Y0Y]]]ddekllmmmmn	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	n 	ns&   ?A
DDDC-K

KKstrictc	                 2    t          | ||||||||	  	        S )a  Load KeyedVectors from a file produced by the original C word2vec-tool format.

        Warnings
        --------
        The information stored in the file is incomplete (the binary tree is missing),
        so while you can query for word similarity etc., you cannot continue training
        with a model loaded this way.

        Parameters
        ----------
        fname : str
            The file path to the saved word2vec-format file.
        fvocab : str, optional
            File path to the vocabulary.Word counts are read from `fvocab` filename, if set
            (this is the file generated by `-save-vocab` flag of the original C tool).
        binary : bool, optional
            If True, indicates whether the data is in binary word2vec format.
        encoding : str, optional
            If you trained the C model using non-utf8 encoding for words, specify that encoding in `encoding`.
        unicode_errors : str, optional
            default 'strict', is a string suitable to be passed as the `errors`
            argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
            file may include word tokens truncated in the middle of a multibyte unicode character
            (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
        limit : int, optional
            Sets a maximum number of word-vectors to read from the file. The default,
            None, means read all.
        datatype : type, optional
            (Experimental) Can coerce dimensions to a non-default float type (such as `np.float16`) to save memory.
            Such types may result in much slower bulk operations or incompatibility with optimized routines.)
        no_header : bool, optional
            Default False means a usual word2vec-format file, with a 1st line declaring the count of
            following vectors & number of dimensions. If True, the file is assumed to lack a declaratory
            (vocab_size, vector_size) header and instead start with the 1st vector, and an extra
            reading-pass will be used to discover the number of vectors. Works only with `binary=False`.

        Returns
        -------
        :class:`~gensim.models.keyedvectors.KeyedVectors`
            Loaded model.

        )r  r  r  unicode_errorslimitdatatype	no_header)_load_word2vec_format)	clsr  r  r  r  r  r  r  r  s	            r    load_word2vec_formatz!KeyedVectors.load_word2vec_format  s1    ^ %vfxXf(i
 
 
 	
r"   r"  c           	         d}t                               d|           t          j        |d          5 }t          j        |                                |          }d |                                D             \  }	}
|
| j        k    st          d|
|fz            |r
t          t                    j        |
z  }t          |	          D ]}g }	 |                    d          }|d	k    rn|d
k    r|                    |           8t          j        d                    |          ||          }t!          j        |                    |          t                    }|| j        v r?|dz  }|| j        |                     |          <   || j        |                     |          <   nt-          |          D ]\  }}t          j        |                                ||                              d          }t1          |          |
dz   k    rt          d|z            |d         d |dd         D             }}|| j        v r?|dz  }|| j        |                     |          <   || j        |                     |          <   ddd           n# 1 swxY w Y   |                     dd| d| j        j         d|            dS )a  Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format,
        where it intersects with the current vocabulary.

        No words are added to the existing vocabulary, but intersecting words adopt the file's weights, and
        non-intersecting words are left alone.

        Parameters
        ----------
        fname : str
            The file path to load the vectors from.
        lockf : float, optional
            Lock-factor value to be set for any imported word-vectors; the
            default value of 0.0 prevents further updating of the vector during subsequent
            training. Use 1.0 to allow further training updates of merged vectors.
        binary : bool, optional
            If True, `fname` is in the binary word2vec C format.
        encoding : str, optional
            Encoding of `text` for `unicode` function (python2 only).
        unicode_errors : str, optional
            Error handling behaviour, used as parameter for `unicode` function (python2 only).

        r   "loading projection weights from %sr  r  c              3   4   K   | ]}t          |          V  d S rv   rZ   rW   xs     r    r  z9KeyedVectors.intersect_word2vec_format.<locals>.<genexpr>  s(      &F&F!s1vv&F&F&F&F&F&Fr"   z&incompatible vector size %d in file %sTr          
r"   r  errorsr&   r  z;invalid vector on line %s (is this really the text format?)c                 ,    g | ]}t          |          S rV   )r   r  s     r    rX   z:KeyedVectors.intersect_word2vec_format.<locals>.<listcomp>  s    .J.J.J1tAww.J.J.Jr"   Nintersect_word2vec_formatzmerged z vectors into z matrix from )msg)r   r2  r   r  r  readliner  r'   r   r   r   itemsizer  readr   r  rN   
fromstringr*   r+   rj   vectors_lockfr   rstripr   add_lifecycle_eventr   )r/   r  lockfr  r  r  overlap_countr  header
vocab_sizer'   
binary_lenr   r   chr   r  r  partss                      r    r  z&KeyedVectors.intersect_word2vec_format  sG   . 8%@@@Zt$$  	I%cllnnxHHHF&F&Fv||~~&F&F&F#J$"22 b !I[Z_L`!`aaa I"4[[1K?
z** I IAD, XXa[[: "!; , KKOOO, !+CHHTNNXVdeeeD mCHHZ,@,@MMMGt00 I%*=DT^^D%9%9:CH*4>>$+?+?@I  &/s^^ I IMGT!,T[[]]XVdeeekkloppE5zz[1_4 r()fip)pqqq$)!H.J.Jabb	.J.J.J'Dt00 I%*=DT^^D%9%9:CH*4>>$+?+?@A 	I  	I  	I  	I  	I  	I  	I  	I  	I  	I  	I  	I  	I  	I  	IB 	  '_-__t|7I__X]__ 	! 	
 	
 	
 	
 	
s   IJJJrJ   allow_inferencecopy_vecattrsreturnc           
         g t                      }}|D ]=}||vr7|                    |           ||r| n| j        v r|                    |           >t	          | j        t          |          | j        j                  }|D ]r}| |         }t          |d||t          |                     |rF| j
        D ]>}		 |                    ||	|                     ||	                     /# t          $ r Y ;w xY ws|S )a  Produce vectors for all given keys as a new :class:`KeyedVectors` object.

        Notes
        -----
        The keys will always be deduplicated. For optimal performance, you should not pass entire
        corpora to the method. Instead, you should construct a dictionary of unique words in your
        corpus:

        >>> from collections import Counter
        >>> import itertools
        >>>
        >>> from gensim.models import FastText
        >>> from gensim.test.utils import datapath, common_texts
        >>>
        >>> model_corpus_file = datapath('lee_background.cor')  # train word vectors on some corpus
        >>> model = FastText(corpus_file=model_corpus_file, vector_size=20, min_count=1)
        >>> corpus = common_texts  # infer word vectors for words from another corpus
        >>> word_counts = Counter(itertools.chain.from_iterable(corpus))  # count words in your corpus
        >>> words_by_freq = (k for k, v in word_counts.most_common())
        >>> word_vectors = model.wv.vectors_for_all(words_by_freq)  # create word-vectors for words in your corpus

        Parameters
        ----------
        keys : iterable
            The keys that will be vectorized.
        allow_inference : bool, optional
            In subclasses such as :class:`~gensim.models.fasttext.FastTextKeyedVectors`,
            vectors for out-of-vocabulary keys (words) may be inferred. Default is True.
        copy_vecattrs : bool, optional
            Additional attributes set via the :meth:`KeyedVectors.set_vecattr` method
            will be preserved in the produced :class:`KeyedVectors` object. Default is False.
            To ensure that *all* the produced vectors will have vector attributes assigned,
            you should set `allow_inference=False`.

        Returns
        -------
        keyedvectors : :class:`~gensim.models.keyedvectors.KeyedVectors`
            Vectors for all the given keys.

        r&   N)r3  addr*   r   r$   r'   r   r+   r   _add_word_to_kvr-   rL   rn   r   )
r/   rJ   r&  r'  rH   seenrk   kvr   rS   s
             r    vectors_for_allzKeyedVectors.vectors_for_all  s+   X #%%t 	& 	&C$ &?I448IJ &LL%%%$*CJJdl>PQQQ 	 	C3iGBc7CJJ???  M  DsD$2B2B32M2MNNNN#   	s   7+C##
C0/C0c                    | j         | _        |                                  | j                                        D ]/}|                     |d          }|| j        z   dz   }|| j        |<   0| j        d= | j        dk    r3t          t          d| j        dz                       | j
        z   | _        n| j
        | _        | j        | _        | ` | `| `| `| `
dS )zXConvert a deserialized older Doc2VecKeyedVectors instance to latest generic KeyedVectorsoffsetr   r   r   N)r9   rH   rC   r*   rJ   rn   
max_rawintr-   r   r  offset2doctagr(   vectors_docsr+   r0   )r/   rQ   
old_offset
true_indexs       r    r@   z!KeyedVectors._upconvert_old_d2vkv;  s    \
!!###"'')) 	. 	.A))!X66J#do59J#-Da  M(#?R 	3 $U1do.A%B%B C CdFX XD $ 2D(LJOr"   c                      t          d          )Nz7Call similarity_unseen_docs on a Doc2Vec model instead.)NotImplementedErrorr   s      r    similarity_unseen_docsz#KeyedVectors.similarity_unseen_docsO  s    !"[\\\r"   )NN)r   rv   )F)NTFT)NF)NNr   r   NNN)r   N)T)NNr   N)rV   )r  TFr  )r  r  r  TF)r   )NFNTr  Fr0   )r"  Fr  r  )TF)Hr5   
__module____qualname__rN   r   r1   r7   r>   rC   rg   rL   rn   rt   rw   r~   rj   r{   r   r   r   r   r   r   r   r   r   r   r   r   propertyr   setterr   r   r;   r:   rH   r   r   r  r  r  r  rF  r\  re  rg  staticmethodrn  r   rz  r   r  r  r  r  r  r  rY  r  r  r  classmethodr   r
  r  r   r   r.  r@   r8  __classcell__)r4   s   @r    r$   r$      s       *+2:D +) +) +) +)Z^ ^ ^( ( ( ( (2X X X!s !s !s !sF) ) ).* * **   & & &E E E&7 7 7 7       D Z())0 0 *)0? ? ? ?B& & &P0J 0J 0J 0Jd6 6 6*	, 	, 	,' ' 'T T T` ` ` Z)**. . +*.5 5 5 
 
 X
   : : : 	> 	> 	> 	> 
 
 X
 " " " 
 
 X
 " " " 
 
 X
 \  \R R R8 8 8 8 8" QU)-R R R Rh? ? ? ?[ [ [ [4^ ^ ^ ^4[- [- [- [-| IMV V V Vp< < < <:5 5 5    \, I  I  I  ID+ + +&K K K$E E E,   \6 FJ5Cw) w) w) w)r   \ C C \C 39HMg, g, g, g,R Z	^ & & &	 &.3 3 3   @ RV/6Ln Ln Ln Ln\ #EFS[1
 1
 1
 [1
f=
 =
 =
 =
~ GK.3> >H >t >'+>8F> > > >@  (] ] ] ] ] ] ]r"   r$   c                        e Zd Zd Zd Zd ZdS )CompatVocabc                 H    d| _         | j                            |           dS )zA single vocabulary item, used internally for collecting per-word frequency/sampling info,
        and for constructing binary trees (incl. both word leaves and inner nodes).

        Retained for now to ease the loading of older models.
        r   N)r0   rA   update)r/   rF   s     r    r1   zCompatVocab.__init__[  s'     
V$$$$$r"   c                 "    | j         |j         k     S rv   )r0   )r/   others     r    __lt__zCompatVocab.__lt__d  s    zEK''r"   c                       fdt           j                  D             } j        j        dd                    |          dS )Nc                 ^    g | ])}|                     d           |dj        |         *S )r   :)r  rA   r|   s     r    rX   z'CompatVocab.__str__.<locals>.<listcomp>h  sA    ppp\_\j\jkn\o\op333c 2 23pppr"   <r3   >)ra  rA   r4   r5   r  )r/   valss   ` r    r7   zCompatVocab.__str__g  sJ    ppppvdm?T?Tppp>222DIIdOOOODDr"   N)r5   r9  r:  r1   rF  r7   rV   r"   r    rA  rA  Y  sF        % % %( ( (E E E E Er"   rA  c                 (   |                      |          rt                              d|           d S |                     ||          }|||z
  }n*||v r	||         }nt                              d|           d }|                     |d|           d S )Nz<duplicate word '%s' in word2vec file, ignoring all but firstz.vocabulary file is incomplete: '%s' is missingr0   )r   r   r   r   rL   )r-  countsr   r   r"  word_id
word_counts          r    r+  r+  r  s    	 UW[\\\mmD'**G 	  ')

	 D\

GNNN
NN4*-----r"   c                    d}d}	|t          t                    j        z  }
|| j        z
  }|dk    sJ t	          |          D ]}|                    d|          }|dz   }|dk    st          |          |z
  |
k     r n}|||                             ||          }|                    d          }t          |||t                    
                    |          }t          | ||||           ||
z   }|	dz  }	|	||d          fS )Nr   r  r   r   r  r  )r0  r0   r   )r   r   r  r)   r  findr   decoder  r   rM   r+  )r-  rN  chunkr"  r'   r  r  r  startprocessed_wordsbytes_per_vector	max_wordsr   i_spacei_vectorr   r   s                    r    _add_bytes_to_kvr\    s/   EO"U4[[%99R]*Iq=9  **T5))Q;b= 	SZZ(26FF 	EU7]#**8N*KK{{4  E(+TRRRYYZbccFD&*===++1E%&&M))r"   utf-8c	           
          d}	d}
|
|k     rR|                      |          }|	|z  }	t          |||	|||||          \  }}	|
|z  }
t          |          |k     rn|
|k     R|
|k    rt          d          d S )Nr"   r   Funexpected end of input; is count incorrect or file otherwise damaged?)r  r\  r   EOFError)r  r-  rN  r"  r'   r  r  binary_chunk_sizer  rU  tot_processed_words	new_chunkrW  s                r    _word2vec_read_binaryrd    s     E


* HH.//	!1z;.RZ"\ "\.y>>-- 	 

*  j( a_```a ar"   c                     t          |          D ]S}|                                 }	|	dk    rt          d          t          |	|||          \  }
}t	          |||
||           Td S )Nr"   r_  )r  r  r`  _word2vec_line_to_vectorr+  )r  r-  rN  r"  r'   r  r  r  r  r  r   r   s               r    _word2vec_read_textrg    s}    $$ ? ?||~~3; 	ecddd0xQYZZgFD':>>>>? ?r"   c                     t          j        |                                 ||                              d          }|d         fd|dd          D             }}||fS )Nr  r  r   c                 &    g | ]} |          S rV   rV   )rW   r  r  s     r    rX   z,_word2vec_line_to_vector.<locals>.<listcomp>  s!    >>>qxx{{>>>r"   r   )r   r  r  r  )r  r  r  r  r%  r   r   s    `     r    rf  rf    se    T[[]]XnUUU[[\_``E!H>>>>E!""I>>>'D=r"   c                     d }t          j                    D ]K}|                                 }|dk    s||k    r n(|r't          ||||          \  }}	t	          |	          }L||fS )Nr"   )r  r0   r  rf  r   )
r  r  r  r  r  r'   r"  r  r   r   s
             r    _word2vec_detect_sizes_textrk    s    Ko'' # #
||~~3; 	*- 	E 	0xQYZZg'll{""r"   Fr  r  i  c
                 v   d}
|t                               d|           i }
t          j        |d          5 }|D ]Q}t          j        ||                                                                          \  }}t          |          |
|<   R	 ddd           n# 1 swxY w Y   t                               d|           t          j        |d          5 }|rQ|rt          d          t          |||||          \  }}|
                                 t          j        |d          }nIt          j        |                                |          }d |                                D             \  }}|rt          ||          } | |||	          }|rt          |||
|||||	|	  	         nt          |||
|||||           ddd           n# 1 swxY w Y   |j        j        d
         t#          |          k    rgt                               d|j        j        d
         t#          |                     t%          |j        dt#          |                             |_        t#          |          |f|j        j        k    sJ |                    dd|j        j         d|j        j         d| ||           |S )a"  Load the input-hidden weight matrix from the original C word2vec-tool format.

    Note that the information stored in the file is incomplete (the binary tree is missing),
    so while you can query for word similarity etc., you cannot continue training
    with a model loaded this way.

    Parameters
    ----------
    fname : str
        The file path to the saved word2vec-format file.
    fvocab : str, optional
        File path to the vocabulary. Word counts are read from `fvocab` filename, if set
        (this is the file generated by `-save-vocab` flag of the original C tool).
    binary : bool, optional
        If True, indicates whether the data is in binary word2vec format.
    encoding : str, optional
        If you trained the C model using non-utf8 encoding for words, specify that encoding in `encoding`.
    unicode_errors : str, optional
        default 'strict', is a string suitable to be passed as the `errors`
        argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
        file may include word tokens truncated in the middle of a multibyte unicode character
        (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
    limit : int, optional
        Sets a maximum number of word-vectors to read from the file. The default,
        None, means read all.
    datatype : type, optional
        (Experimental) Can coerce dimensions to a non-default float type (such as `np.float16`) to save memory.
        Such types may result in much slower bulk operations or incompatibility with optimized routines.)
    binary_chunk_size : int, optional
        Read input file in chunks of this many bytes for performance reasons.

    Returns
    -------
    object
        Returns the loaded model as an instance of :class:`cls`.

    Nzloading word counts from %sr  rR  r  z.no_header only available for text-format filesr  c                 ,    g | ]}t          |          S rV   r  r  s     r    rX   z)_load_word2vec_format.<locals>.<listcomp>  s    &F&F&F!s1vv&F&F&Fr"   r&   r   z=duplicate words detected, shrinking matrix size from %i to %ir
  zloaded z matrix of type z from )r  r  r  )r   r2  r   r  r  r  r  rZ   r7  rk  closer  r`   rd  rg  r+   r   r   r   r  r   )r	  r  r  r  r  r  r  r  r  ra  rN  r  r  r   r0   r"  r'   r!  r-  s                      r    r  r    sp   R F *16:::Z%% 	* * *#.tNKKKQQSSYY[[e"5zzt*	* 	* 	* 	* 	* 	* 	* 	* 	* 	* 	* 	* 	* 	* 	*
 KK4e<<<	E4	 	  nC 
	G v)*Z[[[*Ec5RZ\jlt*u*u'
KIIKKK*UD))CC%cllnnxHHHF&F&Fv||~~&F&F&F#J 	0Z//JSj999 	n!R[(NTego     R[(Tbdlmmm+n n n n n n n n n n n n n n n, 
zc"gg% >KJQR	
 	
 	
 'rz)CGG)'<==
GG[!RZ%55555Wbj&WW
8HWWPUWW    
 Is%   ABB BC-GGGc                  $    t          j        | i |S )zPAlias for :meth:`~gensim.models.keyedvectors.KeyedVectors.load_word2vec_format`.)r$   r
  )rE   rF   s     r    r
  r
  &  s    ,d=f===r"   c                    |rIt           j                            t           j                             ||          dz                      }nt          j        }|                    |                               t                    dz
  | z  S )zGet a random vector, derived deterministically from `seed_string` if supplied.

    Useful for initializing KeyedVectors that will be the starting projection/input layers of _2Vec models.

    l    g      ?)rN   random	GeneratorSFC64r   default_prngrM   r   )sizeseed_stringhashfxnonces       r    pseudorandom_weak_vectorry  +  sr      "y""29??77;3G3G*3T#U#UVV!KK$$T**S0D88r"   c                 $   |t          j        d          }|j        | k    r|S | \  }}t           j                            |          }|                    | |          }|dz  }|dz  }||z  }||d|j        d         d|j        d         f<   |S )	zReturn a numpy array of the given shape. Reuse prior_vectors object or values
    to extent possible. Initialize new values randomly if requested.

    N)r   r   )rq   r&   g       @g      ?r   r   )rN   r   r   rq  default_rng)rs   rp   rq   r   target_countr'   rngnew_vectorss           r    rr   rr   8  s    
  )((l*  ,L+
)

T

*
*C**\*77K3K3K;KFSK-%a((!M,?,B*BBCr"   )r]  )@__doc__loggingsysr  r   numbersr   typingr   numpyr   r   r   r   r   r	   r
   r   r6  r   r   r   r   r   rN   scipyr   scipy.spatial.distancer   gensimr   r   gensim.corpora.dictionaryr   gensim.utilsr   	getLoggerr5   r   r\   rZ   r   r   r  r!   SaveLoadr$   Word2VecKeyedVectorsDoc2VecKeyedVectorsEuclideanKeyedVectorsrA  Vocabr+  r\  rd  rg  rf  rk  maxsizer  r
  hashry  rr   rV   r"   r    <module>r     s  ^ ^@  



                                                       ( ( ( ( ( ( " " " " " " " " 0 0 0 0 0 0 # # # # # # 
	8	$	$ 3
#
CRZ8   }] }] }] }] }]5> }] }] }]B4 $ " $ E E E E E E E E( 	
. . .(* * *4 a a a a&? ? ?  
# 
# 
#  xkDEZV V V Vr> > >
 04T 
9 
9 
9 
9 .2      r"   