Skip to content

Rag user

RAG user agent.

It extends a user agent and has RAG related parameters (retrieve_config).

WaldiezRagUserProxy

Bases: WaldiezAgent

RAG user agent.

It extends a user agent and has RAG related parameters.

Attributes:

NameTypeDescription
agent_typeLiteral['rag_user', 'rag_user_proxy']

The agent type: 'rag_user' for a RAG user agent.

dataWaldiezRagUserProxyData

The RAG user agent's data. See WaldiezRagUserProxyData for more info.

retrieve_configWaldiezRagUserProxyRetrieveConfig

The RAG user agent's retrieve config.

retrieve_config property

Get the retrieve config.

Returns:

TypeDescription
WaldiezRagUserProxyRetrieveConfig

The RAG user agent's retrieve config.

Waldiez RAG user agent data.

WaldiezRagUserProxyData

Bases: WaldiezUserProxyData

RAG user agent data.

The data for a RAG user agent.

Attributes:

NameTypeDescription
use_message_generatorbool

Whether to use the message generator in user's chats. Defaults to False.

retrieve_configWaldiezRagUserProxyRetrieveConfig

The RAG user agent's retrieve config.

RAG user agent retrieve config.

WaldiezRagUserProxyChunkMode module-attribute

WaldiezRagUserProxyChunkMode = Literal[
    "multi_lines", "one_line"
]

Possible chunk modes for the retrieve chat.

WaldiezRagUserProxyRetrieveConfig

Bases: WaldiezBase

RAG user agent.

Attributes:

NameTypeDescription
taskLiteral['code', 'qa', 'default']

The task of the retrieve chat. Possible values are 'code', 'qa' and 'default'. System prompt will be different for different tasks. The default value is default, which supports both code and qa, and provides source information in the end of the response.

vector_dbLiteral['chroma', 'pgvector', 'mongodb', 'qdrant']

The vector db for the retrieve chat.

db_configAnnotated[WaldiezVectorDbConfig, Field]

The config for the selected vector db.

docs_pathOptional[Union[str, list[str]]]

The path to the docs directory. It can also be the path to a single file, the url to a single file or a list of directories, files and urls. Default is None, which works only if the collection is already created.

new_docsbool

When True, only adds new documents to the collection; when False, updates existing documents and adds new ones. Default is True. Document id is used to determine if a document is new or existing. By default, the id is the hash value of the content.

modelOptional[str]

The model to use for the retrieve chat. If key not provided, a default model gpt-4 will be used.

chunk_token_sizeOptional[int]

The chunk token size for the retrieve chat. If key not provided, a default size max_tokens * 0.4 will be used.

context_max_tokensOptional[int]

The context max token size for the retrieve chat. If key not provided, a default size max_tokens * 0.8 will be used.

chunk_modeOptional[str]

The chunk mode for the retrieve chat. Possible values are 'multi_lines' and 'one_line'. If key not provided, a default mode multi_lines will be used.

must_break_at_empty_linebool

Chunk will only break at empty line if True. Default is True. If chunk_mode is 'one_line', this parameter will be ignored.

use_custom_embeddingbool

Whether to use custom embedding for the retrieve chat. Default is False. If True, the embedding_function should be provided.

embedding_functionOptional[str]

The embedding function for creating the vector db. Default is None, SentenceTransformer with the given embedding_model will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in https://docs.trychroma.com/guides/embeddings.

customized_promptOptional[str]

The customized prompt for the retrieve chat. Default is None.

customized_answer_prefixOptional[str]

The customized answer prefix for the retrieve chat. Default is ''. If not '' and the customized_answer_prefix is not in the answer, Update Context will be triggered.

update_contextbool

If False, will not apply Update Context for interactive retrieval. Default is True.

collection_nameOptional[str]

The name of the collection. If key not provided, a default name autogen-docs will be used.

get_or_createbool

Whether to get the collection if it exists. Default is False.

overwritebool

Whether to overwrite the collection if it exists. Default is False. Case 1. if the collection does not exist, create the collection. Case 2. the collection exists, if overwrite is True, it will overwrite the collection. Case 3. the collection exists and overwrite is False, if get_or_create is True, it will get the collection, otherwise it raise a ValueError.

use_custom_token_countbool

Whether to use custom token count function for the retrieve chat. Default is False. If True, the custom_token_count_function should be provided.

custom_token_count_functionOptional[str]

A custom function to count the number of tokens in a string. The function should take (text:str, model:str) as input and return the token_count(int). the retrieve_config['model'] will be passed in the function. Default is autogen.token_count_utils.count_token that uses tiktoken, which may not be accurate for non-OpenAI models.

use_custom_text_splitbool

Whether to use custom text split function for the retrieve chat. Default is False. If True, the custom_text_split_function should be provided.

custom_text_split_functionOptional[str]

A custom function to split a string into a list of strings. Default is None, will use the default function in autogen.retrieve_utils. split_text_to_chunks.

custom_text_typesOptional[list[str]]

A list of file types to be processed. Default is autogen.retrieve_utils. TEXT_FORMATS. This only applies to files under the directories in docs_path. Explicitly included files and urls will be chunked regardless of their types.

recursivebool

Whether to search documents recursively in the docs_path. Default is True.

distance_thresholdfloat

The threshold for the distance score, only distance smaller than it will be returned. Will be ignored if < 0. Default is -1.

embedding_function_stringOptional[str]

The embedding function string (if use_custom_embedding is True).

token_count_function_stringOptional[str]

The token count function string (if use_custom_token_count is True).

text_split_function_stringOptional[str]

The text split function string (if use_custom_text_split is True).

n_resultsOptional[int]

The number of results to return. Default is None, which will return all

Methods:

NameDescription
validate_custom_embedding_function

Validate the custom embedding function.

validate_custom_token_count_function

Validate the custom token count function.

validate_custom_text_split_function

Validate the custom text split function.

validate_rag_user_data

Validate the RAG user data.

embedding_function_string property

embedding_function_string: Optional[str]

Get the embedding function string.

Returns:

TypeDescription
Optional[str]

The embedding function string.

get_custom_embedding_function

get_custom_embedding_function(
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]

Generate the custom embedding function.

Parameters:

NameTypeDescriptionDefault
name_prefixstr

The function name prefix.

None
name_suffixstr

The function name suffix.

None

Returns:

TypeDescription
tuple[str, str]

The custom embedding function and the function name.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def get_custom_embedding_function(
    self,
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]:
    """Generate the custom embedding function.

    Parameters
    ----------
    name_prefix : str
        The function name prefix.
    name_suffix : str
        The function name suffix.

    Returns
    -------
    tuple[str, str]
        The custom embedding function and the function name.
    """
    function_name = CUSTOM_EMBEDDING_FUNCTION
    if name_prefix:
        function_name = f"{name_prefix}_{function_name}"
    if name_suffix:
        function_name = f"{function_name}_{name_suffix}"
    return (
        generate_function(
            function_name=function_name,
            function_args=CUSTOM_EMBEDDING_FUNCTION_ARGS,
            function_types=CUSTOM_EMBEDDING_FUNCTION_TYPES,
            function_body=self.embedding_function_string or "",
        ),
        function_name,
    )

get_custom_text_split_function

get_custom_text_split_function(
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]

Generate the custom text split function.

Parameters:

NameTypeDescriptionDefault
name_prefixstr

The function name prefix.

None
name_suffixstr

The function name suffix.

None

Returns:

TypeDescription
tuple[str, str]

The custom text split function and the function name.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def get_custom_text_split_function(
    self,
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]:
    """Generate the custom text split function.

    Parameters
    ----------
    name_prefix : str
        The function name prefix.
    name_suffix : str
        The function name suffix.

    Returns
    -------
    tuple[str, str]
        The custom text split function and the function name.
    """
    function_name = CUSTOM_TEXT_SPLIT_FUNCTION
    if name_prefix:
        function_name = f"{name_prefix}_{function_name}"
    if name_suffix:
        function_name = f"{function_name}_{name_suffix}"
    return (
        generate_function(
            function_name=function_name,
            function_args=CUSTOM_TEXT_SPLIT_FUNCTION_ARGS,
            function_types=CUSTOM_TEXT_SPLIT_FUNCTION_TYPES,
            function_body=self.text_split_function_string or "",
        ),
        function_name,
    )

get_custom_token_count_function

get_custom_token_count_function(
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]

Generate the custom token count function.

Parameters:

NameTypeDescriptionDefault
name_prefixstr

The function name prefix.

None
name_suffixstr

The function name suffix.

None

Returns:

TypeDescription
tuple[str, str]

The custom token count function and the function name.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def get_custom_token_count_function(
    self,
    name_prefix: Optional[str] = None,
    name_suffix: Optional[str] = None,
) -> tuple[str, str]:
    """Generate the custom token count function.

    Parameters
    ----------
    name_prefix : str
        The function name prefix.
    name_suffix : str
        The function name suffix.

    Returns
    -------
    tuple[str, str]
        The custom token count function and the function name.
    """
    function_name = CUSTOM_TOKEN_COUNT_FUNCTION
    if name_prefix:
        function_name = f"{name_prefix}_{function_name}"
    if name_suffix:
        function_name = f"{function_name}_{name_suffix}"
    return (
        generate_function(
            function_name=function_name,
            function_args=CUSTOM_TOKEN_COUNT_FUNCTION_ARGS,
            function_types=CUSTOM_TOKEN_COUNT_FUNCTION_TYPES,
            function_body=self.token_count_function_string or "",
        ),
        function_name,
    )

text_split_function_string property

text_split_function_string: Optional[str]

Get the text split function string.

Returns:

TypeDescription
Optional[str]

The text split function string.

token_count_function_string property

token_count_function_string: Optional[str]

Get the token count function string.

Returns:

TypeDescription
Optional[str]

The token count function string.

validate_custom_embedding_function

validate_custom_embedding_function() -> None

Validate the custom embedding function.

Raises:

TypeDescription
ValueError

If the validation fails.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def validate_custom_embedding_function(self) -> None:
    """Validate the custom embedding function.

    Raises
    ------
    ValueError
        If the validation fails.
    """
    if self.use_custom_embedding:
        if not self.embedding_function:
            raise ValueError(
                "The embedding_function is required "
                "if use_custom_embedding is True."
            )
        valid, error_or_content = check_function(
            code_string=self.embedding_function,
            function_name=CUSTOM_EMBEDDING_FUNCTION,
            function_args=CUSTOM_EMBEDDING_FUNCTION_ARGS,
        )
        if not valid:
            raise ValueError(error_or_content)
        self._embedding_function_string = error_or_content

validate_custom_text_split_function

validate_custom_text_split_function() -> None

Validate the custom text split function.

Raises:

TypeDescription
ValueError

If the validation fails.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def validate_custom_text_split_function(self) -> None:
    """Validate the custom text split function.

    Raises
    ------
    ValueError
        If the validation fails.
    """
    if self.use_custom_text_split:
        if not self.custom_text_split_function:
            raise ValueError(
                "The custom_text_split_function is required "
                "if use_custom_text_split is True."
            )
        valid, error_or_content = check_function(
            code_string=self.custom_text_split_function,
            function_name=CUSTOM_TEXT_SPLIT_FUNCTION,
            function_args=CUSTOM_TEXT_SPLIT_FUNCTION_ARGS,
        )
        if not valid:
            raise ValueError(error_or_content)
        self._text_split_function_string = error_or_content

validate_custom_token_count_function

validate_custom_token_count_function() -> None

Validate the custom token count function.

Raises:

TypeDescription
ValueError

If the validation fails.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def validate_custom_token_count_function(self) -> None:
    """Validate the custom token count function.

    Raises
    ------
    ValueError
        If the validation fails.
    """
    if self.use_custom_token_count:
        if not self.custom_token_count_function:
            raise ValueError(
                "The custom_token_count_function is required "
                "if use_custom_token_count is True."
            )
        valid, error_or_content = check_function(
            code_string=self.custom_token_count_function,
            function_name=CUSTOM_TOKEN_COUNT_FUNCTION,
            function_args=CUSTOM_TOKEN_COUNT_FUNCTION_ARGS,
        )
        if not valid:
            raise ValueError(error_or_content)
        self._token_count_function_string = error_or_content

validate_docs_path

validate_docs_path() -> None

Validate the docs path.

Raises:

TypeDescription
ValueError

If the validation fails.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def validate_docs_path(self) -> None:
    """Validate the docs path.

    Raises
    ------
    ValueError
        If the validation fails.
    """
    if not self.docs_path:
        return

    # Normalize to list
    doc_paths = (
        [self.docs_path]
        if isinstance(self.docs_path, str)
        else self.docs_path
    )

    validated_paths: list[str] = []

    for path in doc_paths:
        # Skip duplicates
        if path in validated_paths:
            continue

        # Check if it's a remote path
        is_remote = is_remote_path(path)
        if is_remote:
            # Remote paths: ensure proper raw string wrapping if needed
            content = extract_raw_string_content(path)
            validated_paths.append(f'r"{content}"')
            continue

        # Handle local paths
        # First remove any file:// scheme
        cleaned_path = remove_file_scheme(path)
        content = extract_raw_string_content(cleaned_path)

        # Determine if it's likely a folder
        is_folder = string_represents_folder(content)

        if is_folder:
            validated_paths.append(f'r"{content}"')
        else:
            # Files: resolve and validate existence
            try:
                resolved_path = resolve_path(cleaned_path, must_exist=True)
                validated_paths.append(resolved_path)
            except ValueError as e:
                raise ValueError(f"Invalid file path '{path}': {e}") from e

    # remove dupes (but keep order)
    validated_paths = list(dict.fromkeys(validated_paths))
    self.docs_path = [path for path in validated_paths if path]

validate_rag_user_data

validate_rag_user_data() -> Self

Validate the RAG user data.

Raises:

TypeDescription
ValueError

If the validation fails.

Returns:

TypeDescription
WaldiezRagUserProxyData

The validated RAG user data.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
@model_validator(mode="after")
def validate_rag_user_data(self) -> Self:
    """Validate the RAG user data.

    Raises
    ------
    ValueError
        If the validation fails.

    Returns
    -------
    WaldiezRagUserProxyData
        The validated RAG user data.
    """
    self.validate_custom_embedding_function()
    self.validate_custom_token_count_function()
    self.validate_custom_text_split_function()
    self.validate_docs_path()
    if not self.db_config.model:
        self.db_config.model = WaldiezRagUserProxyModels[self.vector_db]
    if isinstance(self.n_results, int) and self.n_results < 1:
        self.n_results = None
    return self

WaldiezRagUserProxyTask module-attribute

WaldiezRagUserProxyTask = Literal['code', 'qa', 'default']

Possible tasks for the retrieve chat.

WaldiezRagUserProxyVectorDb module-attribute

WaldiezRagUserProxyVectorDb = Literal[
    "chroma", "pgvector", "mongodb", "qdrant"
]

Possible vector dbs for the retrieve chat.

extract_raw_string_content

extract_raw_string_content(path: str) -> str

Extract content from potential raw string formats.

Parameters:

NameTypeDescriptionDefault
pathstr

The path that might be wrapped in raw string format.

required

Returns:

TypeDescription
str

The actual content of the path, without raw string formatting.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def extract_raw_string_content(path: str) -> str:
    """Extract content from potential raw string formats.

    Parameters
    ----------
    path : str
        The path that might be wrapped in raw string format.

    Returns
    -------
    str
        The actual content of the path, without raw string formatting.
    """
    # Handle r"..." and r'...'
    if path.startswith(('r"', "r'")) and len(path) > 3:
        quote = path[1]
        if path.endswith(quote):
            return path[2:-1]
        # Handle malformed raw strings (missing end quote)
        return path[2:]
    return path

is_remote_path

is_remote_path(path: str) -> bool

Check if a path is a remote path.

Parameters:

NameTypeDescriptionDefault
pathstr

The path to check.

required

Returns:

TypeDescription
tuple[bool, bool]

If the path is a remote path and if it's a raw string.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def is_remote_path(path: str) -> bool:
    """Check if a path is a remote path.

    Parameters
    ----------
    path : str
        The path to check.

    Returns
    -------
    tuple[bool, bool]
        If the path is a remote path and if it's a raw string.
    """
    content = extract_raw_string_content(path)
    for not_local in NOT_LOCAL:
        if content.startswith((not_local, f'r"{not_local}', f"r'{not_local}")):
            return True
    return False

remove_file_scheme

remove_file_scheme(path: str) -> str

Remove the file:// scheme from a path.

Parameters:

NameTypeDescriptionDefault
pathstr

The path to remove the scheme from.

required

Returns:

TypeDescription
str

The path without the scheme.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def remove_file_scheme(path: str) -> str:
    """Remove the file:// scheme from a path.

    Parameters
    ----------
    path : str
        The path to remove the scheme from.

    Returns
    -------
    str
        The path without the scheme.
    """
    content = extract_raw_string_content(path)

    # Remove file:// prefix
    while content.startswith("file://"):
        content = content[len("file://") :]

    return f'r"{content}"'

resolve_path

resolve_path(path: str, must_exist: bool) -> str

Try to resolve a path.

Parameters:

NameTypeDescriptionDefault
pathstr

The path to resolve.

required
must_existbool

If the path must exist.

required

Returns:

TypeDescription
str

The resolved path, potentially wrapped in raw string format.

Raises:

TypeDescription
ValueError

If the path is not a valid local path.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def resolve_path(path: str, must_exist: bool) -> str:
    """Try to resolve a path.

    Parameters
    ----------
    path : str
        The path to resolve.
    must_exist : bool
        If the path must exist.

    Returns
    -------
    str
        The resolved path, potentially wrapped in raw string format.

    Raises
    ------
    ValueError
        If the path is not a valid local path.
    """
    # Extract the actual path content
    # if is_raw:
    path_content = extract_raw_string_content(path)
    # else:
    #     path_content = path

    # Handle JSON-escaped backslashes
    if "\\\\" in path_content:  # pragma: no cover
        path_content = path_content.replace("\\\\", "\\")
    # pylint: disable=too-many-try-statements
    try:
        # Try to resolve the path
        resolved = Path(path_content).resolve()

        if must_exist and not resolved.exists():
            raise ValueError(f"Path {path} does not exist.")

        return f'r"{resolved}"'

    except (
        OSError,
        UnicodeDecodeError,
        ValueError,
    ) as error:  # pragma: no cover
        # Fallback: try as raw string for Windows compatibility
        raw_version = f'r"{path_content}"'
        try:
            # Test if the path can be resolved when treated as raw
            resolved = Path(raw_version).resolve()
            if must_exist and not resolved.exists():
                raise ValueError(f"Path {path} does not exist.") from error
            return raw_version
        except Exception:
            raise ValueError(
                f"Path {path} is not a valid local path: {error}"
            ) from error

string_represents_folder

string_represents_folder(path: str) -> bool

Check if a string represents a folder.

Parameters:

NameTypeDescriptionDefault
pathstr

The string to check (does not need to exist).

required

Returns:

TypeDescription
bool

True if the path is likely a folder, False if it's likely a file.

Source code in waldiez/models/agents/rag_user_proxy/retrieve_config.py
def string_represents_folder(path: str) -> bool:
    """Check if a string represents a folder.

    Parameters
    ----------
    path : str
        The string to check (does not need to exist).

    Returns
    -------
    bool
        True if the path is likely a folder, False if it's likely a file.
    """
    # Extract actual path content if wrapped
    content = extract_raw_string_content(path)

    # Explicit folder indicators
    if content.endswith(("/", "\\", os.path.sep)):
        return True

    # Check if it actually exists and is a directory
    try:
        if os.path.isdir(content):
            return True
    except (OSError, ValueError):  # pragma: no cover
        pass

    # Heuristic: no file extension likely means folder
    # return not os.path.splitext(content)[1]
    _, ext = os.path.splitext(path.rstrip("/\\"))
    return not ext

The vector db config for the RAG user agent.

WaldiezRagUserProxyVectorDbConfig

Bases: WaldiezBase

The config for the vector db.

Attributes:

NameTypeDescription
modelstr

The model to use for the vector db embeddings.

use_memorybool

Whether to use memory for the vector db (if qdrant is used).

use_local_storagebool

Whether to use local storage for the db (if qdrant or chroma is used).

local_storage_pathOptional[str]

The path to the local storage for the vector db (if qdrant or chroma is used).

connection_urlOptional[str]

The connection url for the vector db.

wait_until_index_readyOptional[float]

Blocking call to wait until the database indexes are ready (if mongodb is used). None, the default, means no wait.

wait_until_document_readyOptional[float]

Blocking call to wait until the database documents are ready (if mongodb is used). None, the default, means no wait.

metadataOptional[dict[str, Any]]

The metadata to use for the vector db. Example: {"hnsw:space": "ip", "hnsw:construction_ef": 30, "hnsw:M": 32}

Methods:

NameDescription
validate_vector_db_config

Validate the vector db config.

validate_vector_db_config

validate_vector_db_config() -> Self

Validate the vector db config.

if local storage is used, make sure the path is provided, and make it absolute if not already.

Returns:

TypeDescription
WaldiezRagUserProxyVectorDbConfig

The vector db config.

Raises:

TypeDescription
ValueError

If the validation fails.

Source code in waldiez/models/agents/rag_user_proxy/vector_db_config.py
@model_validator(mode="after")
def validate_vector_db_config(self) -> Self:
    """Validate the vector db config.

    if local storage is used, make sure the path is provided,
    and make it absolute if not already.

    Returns
    -------
    WaldiezRagUserProxyVectorDbConfig
        The vector db config.

    Raises
    ------
    ValueError
        If the validation fails.
    """
    if self.use_local_storage:
        if self.local_storage_path is None:
            raise ValueError(
                "The local storage path must be provided if local storage is used."
            )
        as_path = Path(self.local_storage_path)
        if not as_path.is_absolute():
            self.local_storage_path = str(as_path.resolve())
    return self