Skip to content

Site Text Dataset

Extracted text and site links. Available only in JSON format and as a daily feed.

NameFormatDescription
urlStringPage from which the content was extracted.
domainStringNormalized domain name (excluding any subdomain).
extraction_dateDate (YYYY-MM-DD)Date in which the content was extracted.
pageStringClassification of the URL from which the content was extracted. Possible values are ‘home’, ‘about’, ‘careers’, ‘documentation’, ‘location’, ‘policy’, ‘products’, ‘services’.
extracted_textStringText as extracted from the URL, cleaned up from all HTML tags and other elements that cannot be used for embeddings.
extracted_linksArray[LinkObject]Compilation of URL links and their corresponding or alternative text from <a> tags.