The TVF file extension is one of the file extensions used in Apache Lucene, a full-featured text search engine library completely written in the Java language. This search engine is particularly useful on cross-platform applications requiring search engines in full-text.

Apache Lucene has four basic concepts, namely: the Index which contains a sequence of documents, the Document which is a series of fields, the Field which is a named succession of terms, and the Term which is defined as a string. Each index stores field names, stored field values, term dictionary, term frequency data, term proximity data, normalization factors, deleted documents, and term vectors.

The term vector is the list of terms and their number frequencies in the document where a term vector can be stored for each field in the document. For every term vector, there is a file extension attached that identifies its file such as the Document Index file that uses the TVX file extension, the Document file itself that uses the TVD file extension, and the Field file that uses the TVF file extension.

Files with the .tvf file extension contain a list of the terms, their occurrences, frequencies, positions and offset information. Each file has a TVF Version, Number Term, Position/Offset, Term Frequency, and Number Field. The TVF Version is stored as integers while the Number Term is stored in a variable-length format for positive integers called VInt. VInt is described as the high-order bit of each byte that can store values from 0 to 127.

The Position/Offset byte defines whether the term vector has position or offset information stored within the file with the .tvf file extension. The Position information is stored as delta encoded VInts meaning the difference of the current position from the last one is the only position stored. The Offset information is also stored as two delta encoded VInts in which the first VInt is the startOffset and the second one is the endOffset.

The Term Text field contains prefixes that are shared and whose PrefixLength is identified by the number of initial characters from the previous term. The previous term has to be pre-pended to the term's suffix in order to form the new term text in the file with the .tvf file extension.

Author: The Apache Software Foundation
Author URL:
Related Applications: Apache Lucene
Common Path: N/A


