Marqo InstantAPI

class marqo_instantapi.instant_api_client.InstantAPIClient(api_key)[source]

Bases: object

A client for the InstantAPI API.

next_pages(webpage_url)[source]

TODO: Implement the next_pages method.

Return type:

dict

retrieve(webpage_url, api_method_name, api_response_structure, api_parameters=None, country_code=None, verbose=False, wait_for_xpath=None, enable_javascript=None, cache_ttl=None, serp_limit=None, serp_site=None, serp_page_num=None)[source]

Implements an interface to the InstantAPI retrieve endpoint.

Parameters:
  • webpage_url (str) – The URL of the webpage to retrieve data from.

  • api_method_name (str) – The name of the API method to use.

  • api_response_structure (Union[str, Dict[str, Any]]) – The structure of the API response.

  • api_parameters (Optional[Union[str, Dict[str, Any]]], optional) – The parameters to pass to the API method. Defaults to None.

  • country_code (Optional[str], optional) – The country code to use for the request. Defaults to None.

  • verbose (bool, optional) – Whether to return verbose output. Defaults to False.

  • wait_for_xpath (Optional[str], optional) – The XPath to wait for before returning the response. Defaults to None.

  • enable_javascript (Optional[bool], optional) – Whether to enable JavaScript in the browser. Defaults to None.

  • cache_ttl (Optional[int], optional) – The time-to-live for the cache. Defaults to None.

  • serp_limit (Optional[int], optional) – The number of results to return for SERP requests. Defaults to None.

  • serp_site (Optional[str], optional) – The site to use for SERP requests. Defaults to None.

  • serp_page_num (Optional[int], optional) – The page number to use for SERP requests. Defaults to None.

Returns:

The response from the InstantAPI retrieve endpoint.

Return type:

Dict[str, Any]

class marqo_instantapi.marqo_instantapi_adapter.InstantAPIMarqoAdapter(marqo_url='http://localhost:8882', marqo_api_key=None, instantapi_key=None)[source]

Bases: object

A class for interfacing with Marqo and InstantAPI.

add_documents(webpage_urls, index_name, api_response_structure, api_method_name, text_fields_to_index=[], image_fields_to_index=[], client_batch_size=8, total_image_weight=0.9, total_text_weight=0.1, enforce_schema=True, instantapi_threads=10)[source]

Add documents to a Marqo index from a list of webpage URLs, data is extracted using the InstantAPI Retrieve API.

Parameters:
  • webpage_urls (list[str]) – A list of webpage URLs to index.

  • index_name (str) – The name of the index to add documents to.

  • api_response_structure (dict) – The expected structure of the API’s response.

  • api_method_name (str) – The name of the API method to use for data extraction.

  • text_fields_to_index (list[str], optional) – A list of text fields for indexing. Defaults to [].

  • image_fields_to_index (list[str], optional) – A list of image fields for indexing. Defaults to [].

  • client_batch_size (int, optional) – The client batch size for Marqo. Defaults to 8.

  • total_image_weight (float, optional) – The total weight for images. Defaults to 0.9.

  • total_text_weight (float, optional) – The total weight for text. Defaults to 0.1.

  • enforce_schema (bool, optional) – Toggle strict enforcement of InstantAPI responses against the schema. Defaults to True.

  • instantapi_threads (int, optional) – The number of threads to use for InstantAPI requests. Defaults to 10.

Raises:

ValueError – If no fields are provided for indexing.

Returns:

A list of responses for each document added.

Return type:

list[dict]

crawl(initial_webpage_urls, allowed_domains, index_name, api_response_structure, text_fields_to_index=[], image_fields_to_index=[], client_batch_size=8, total_image_weight=0.9, total_text_weight=0.1, enforce_schema=True, max_pages=None)[source]

Crawl a set of webpages and add them to a Marqo index.

Parameters:
  • initial_webpage_urls (list[str]) – A list of initial webpage URLs to start the crawl from.

  • allowed_domains (set[str]) – A set of domains to exclude from the crawl.

  • index_name (str) – The name of the index to add documents to. If the index does not exist, it will be created based on the fields to index.

  • api_response_structure (dict) – The expected structure of the API’s response, this is passed to InstantAPI.

  • text_fields_to_index (list[str], optional) – A list of text fields for indexing. Defaults to [].

  • image_fields_to_index (list[str], optional) – A list of image fields for indexing. Defaults to [].

  • client_batch_size (int, optional) – The client batch size for Marqo, controls how many docs are sent at a time. Defaults to 8.

  • total_image_weight (float, optional) – The total weight for images, applies when both image and text fields are provided. Defaults to 0.9.

  • total_text_weight (float, optional) – The total weight for text, applies when both image and text fields are provided. Defaults to 0.1.

  • enforce_schema (bool, optional) – Toggle strict enforcement of InstantAPI responses against the schema. Defaults to True.

  • max_pages (Optional[int], optional) – The maximum number of pages to crawl. Defaults to None.

Raises:

ValueError – If no fields are provided for indexing.

Returns:

A list of responses for each document added.

Return type:

list[dict]

create_index(index_name, multimodal=False, model=None, skip_if_exists=False)[source]

Simplified method for creating a Marqo index, recommended when fine grained control is not needed.

Parameters:
  • index_name (str) – The name of the index to create.

  • multimodal (bool, optional) – Toggles image downloading on or off, if model is not provided then also influences model selection. Defaults to False.

  • model (Optional[str], optional) – Optionally specify a specific model. Defaults to None.

  • skip_if_exists (bool, optional) – Skip index creation if the index already exists, does not check if the index conforms to the provided parameters. Defaults to False.

Returns:

index creation response

Return type:

dict

Examples

>>> marqo_adapter = InstantAPIMarqoAdapter()
>>> marqo_adapter.create_index("my-index")
delete_index(index_name, confirm=False, skip_if_not_exists=False)[source]

Delete a Marqo index.

Parameters:
  • index_name (str) – The name of the index to delete.

  • confirm (bool, optional) – Automatically confirms the deletion. Defaults to False.

  • skip_if_not_exists (bool, optional) – Skip deletion if the index does not exist. Defaults to False.

Returns:

The deletion response.

Return type:

dict

search(q, index_name, limit=10, offset=0, searchable_attributes=None, search_method='hybrid')[source]

Search a Marqo index via a simplified interface.

Parameters:
  • q (str) – The query string to search for.

  • index_name (str) – The name of the index to search.

  • limit (int, optional) – The number of results to retrieve. Defaults to 10.

  • offset (int, optional) – The offset for the search results. Defaults to 0.

  • searchable_attributes (Optional[list], optional) – The attributes to search. Defaults to None.

  • search_method (Literal["tensor", "lexical", "hybrid"], optional) – The search method to use, tensor uses only vectors, lexical uses only text, hybrid combines both with RRF. Defaults to “hybrid”.

Raises:

ValueError – If an invalid search method is provided.

Returns:

The search response from Marqo.

Return type:

dict