Categories
Guides

A Tale of Two Repos

Following the release of ChatGPT, Copilot, GPT Engineer, and other new methods of coding, there seems to be a plethora of new applications of all sorts flooding the internet. Although most of them are some form of ChatGPT wrapper, other genuinely offer unique and extremely useful functionality. Given a specific project and it’s requirements, Software Engineers, Data Scientists, ML Engineers, and Solutions Architects frequently face the critical task of evaluating and selecting appropriate tools for their tech stack. To the untrained eye, the recency and freshness of a given library can be overlooked when a Github with 47 stars seems to match *exactly* the keywords you put in Google. That’s not to say archived repos don’t have their purpose, but relying on them is generally not recommended without refactorization if the project permits.

I’ll save my shpeel on the importance of properly using Google search and how all of that is changing with LLM’s for another post. If you still suck at googling and don’t want to learn, check out perplexity.ai if you haven’t already.

All that to say, a fundamental part of this evaluation process should include rigorous testing of frameworks against specific use cases before commitment to implementation. Always start with a why and a goal. If you have an example input and desired output, this will help greatly in giving you a framework to, *ahem*, compare your frameworks.

For this article, I’ll compare two emerging repositories in the LLM-powered web scraping space: LaVague and ScrapeGraph AI. Both repositories approach the traditional challenge of web scraping by leveraging Large Language Models to interpret and extract web content from the soup of HTML. However, their architectures, feature sets, and support differ significantly.

The traditional web scraping approach requires developers to meticulously craft parsing templates, identifying precise XPath or CSS selectors to target specific HTML elements and extract desired data. While this method is reliable and deterministic, it’s also brittle and maintenance-intensive, often breaking when websites undergo even minor structural changes. LLM-powered alternatives promise more resilient solutions by understanding content contextually rather than relying on rigid selectors.

This comparison will detail my evaluation process and ultimate selection between these two solutions, based on a structured set of criteria that extends beyond surface-level metrics. Through this analysis, I aim to provide a framework for similar technical evaluations while sharing insights from my specific use case in implementing an LLM-powered scraping solution.

Beyond Star Counts: Initial Assessment Criteria

 

When evaluating GitHub repositories, particularly in rapidly evolving spaces like AI/ML, looking beyond star counts is crucial. Here’s how I approached the initial assessment of LaVague and ScrapeGraph AI:

Repository Vitality

 

LaVague presented strong initial metrics with a decent star count (5.5k at time of writing) and forks, indicating community interest. However, a deeper look on the insights tab showed very little activity over the last month, and their Discord Channel even worse (more on that later).

ScrapeGraph AI, on the other hand, clocked 15.8k stars, a SAAS subscription for lazy people or those who want an API accessible version, AND their insights results showed a non-trivial amount of activity. 

Strong Engagement for ScrapeGraphAI
đŸ‘»

Community Health

 

Both repositories maintain Discord channels for community support, but their engagement patterns differ significantly. LaVague’s Discord revealed no support responses in the last month . In contrast, ScrapeGraph AI maintained a more structured support system with *better* response times. 

They even bothered to setup a proper community guild đŸ„Č
🩗

Documentation Quality

 

With respect to these two repositories, I found both to have decent documentation and was able to assemble and answer my intial use-case question relatively quickly. In general, what you would want to be on the lookout for would be:

– Comprehensive getting started guide

– Detailed API reference

– Multiple implementation examples

– Clear troubleshooting section

– HOW TO DISABLE TELEMETRY^^

License Considerations

 

Just look for MIT License or Apache 2.0 and you’ll be good for most smaller projects. For larger enterprise, contact a lawyer.

Learning Curve: Deep Dive Evaluation

 

Armed with my goal, I proceeded to assemble sample code using both of the repos to compare them to each other. I wasn’t looking for numbers at this point but rather raw initial impressions.

Performance and Benchmarks

 

In comparing these solutions to traditional scraping methods, both repositories demonstrated notable advantages in handling dynamic content and structural variations as well as intelligently navigating a set of 3 links with very different page structures. However, their approaches yielded different results:

– LaVague’s agent-based approach showed higher latency due to its recursive decision-making process

– ScrapeGraph AI’s direct parsing approach proved more efficient for single-page extractions

– Neither solution matched the raw speed of traditional selector-based scraping, but offered significantly better adaptability to page changes

Additionally, LaVague powered with an OpenAI model was pretty resistant to asking me to collect user name and emails, even when I explained these were businesses I wanted to contact and refused to comply after a few attempts. I suppose this is due to stricter guardrails in place to protect users..

 

Pretty ironic the data they were trained on was from the exact same places though 🙄

LaVague also was failing performing basic tasks, like `switch tab`

 

 

Setup and Configuration

 

The initial setup experience varied considerably between the two solutions. With LaVague, it was almost a one-liner to get the agent to start collecting information from a list of URLs. ScrapeGraphAI proved to be a bit more challenging, requiring the use of a LangChain agent as the main agent that used ScrapeGraphAI as a tool.

API Integration and Extensibility

 

This particular project was setup to be a locally running script, so no integration or extensibility was necessary. However, if required for the project, keep an eye out for RESTful API endpoints, client libraries, and webhooks. 

Platform Compatibility

 

Both were tested on MacOS using OpenAI’s API as the backend, so no local GPU’s were harmed in the writing of this article 😆

If privacy/offline LLM’s are important, you’ll pretty much need a Windows with a beefy Nvidia graphics card to handle the heavy lifting for you.

Many projects will offer docker support, which makes cross-platform compatibility a breeze. 

Pro-tip: If you’re on Windows, get WSL on your system. 

If you don’t have a Windows machine, you can use cloud-based solutions like Paperspace to spin up a virtual desktop and perform testing on an hourly basis for very reasonable prices. 

Project-Specific Considerations

 

After establishing a baseline evaluation of repository health and technical capabilities, the final selection must be driven by your specific project requirements and constraints. Let me share how this played out in my case, then provide a framework for your own assessment.

Use Case Alignment 

 

My project required nested scraping across multiple unique website structures to gather business contact information. Initially, LaVague’s agent-based approach seemed ideal – its ability to autonomously navigate and extract data aligned perfectly with the need to handle varied page structures. However, real-world testing revealed stability issues that made it unsuitable for use.

ScrapeGraphAI, while more limited in scope as a single-page parser, proved more reliable when combined with a LangChain agent. This exemplifies an important principle: sometimes a more focused tool used as part of a larger solution beats an all-in-one approach.

When evaluating repositories for your own use case, consider:

– Does the solution’s core competency align with your primary requirements?

– Are there workarounds available for missing functionality?

– Is the implementation flexible enough to adapt to edge cases?

Scale and System Integration

 

While my project operated at a local machine scale, any production system needs room to grow. The scalability and integration patterns of a repository can make or break its viability in a production environment. Here’s what to consider:

Processing Architecture:

– Real-time vs batch processing requirements

– Rate limits from data sources or API providers

– Parallel processing capabilities and threading models

– Data freshness requirements vs caching strategies

– Error handling and recovery mechanisms

Infrastructure and Costs:

– Hardware and hosting requirements

– Pricing models for paid components (beware: low-code often = )

– Monitoring and maintenance overhead

A repository that works perfectly at development scale may hit unforeseen barriers when deployed to production. In my case, running locally on a single machine was sufficient, but if I needed to scale to thousands of URLs per hour, I would need to carefully consider rate limiting, proxy rotation, and possibly distributed processing – features that neither repository natively supported without significant additional middleware.

Remember: scaling isn’t just about handling more load – it’s about maintaining performance, reliability, and cost-effectiveness as your system grows. Sometimes a solution that seems more complex initially can prove more economical at scale than a “simpler” option that incurs high operational costs.

Team Dynamics and Knowledge Transfer

 

A technically superior solution may still fail if your team can’t effectively work with it. Consider:

Knowledge Distribution: In my case, I was the sole developer, but in team environments, assess how knowledge will be shared and maintained. Does the solution have clear documentation that new team members can easily follow?

Training Requirements: Even with good documentation, estimate the learning curve for your team. A more complex solution might offer better features but could slow down development if extensive training is needed.

Maintenance Outlook

 

Modern development often emphasizes rapid iteration, but don’t discount long-term maintenance costs. A few key considerations:

Version Stability: How often does the repository push breaking changes? Are updates well-documented? You can get a feel of this by looking through any support chat groups, forums, or their major version releases.

Technical Debt: Sometimes a “quick win” solution can lead to significant technical debt. Low-Code Saas tools will bottleneck you to the development speed of the provider’s engineering team. Additionally, these often don’t scale well in terms of operation cost and more often then not will “vendor lock” you into their ecosystem, being that there are no unified agreements on how or if to enable exporting from one platform to another. Assess whether the tool’s architecture aligns with your long-term technical strategy.

Security and Data Control

 

In today’s privacy-conscious environment, data handling capabilities are crucial:

Data Flow: Understand where your data goes. Does the repository send data to external services? Is there telemetry anywhere you CANNOT disable?

Model Control: For AI-powered tools, can you use your own models or are you locked into specific providers? Can you self-host if needed?

Compliance: Ensure the solution can meet your regulatory requirements, especially if handling sensitive data for medical or legal applications.

Conclusion

 

In comparing LaVague and ScrapeGraph AI, what began as a simple feature comparison evolved into a comprehensive evaluation framework that reaches far beyond surface-level metrics. While LaVague’s agent-based approach initially seemed more suited to my nested scraping needs, ScrapeGraph AI’s stability and reliability, combined with its active community and robust documentation, ultimately made it the better choice.

This evaluation process highlights a crucial lesson in modern software development: the best tool isn’t always the one that promises to do everything, but rather the one that does its core function exceptionally well and plays nicely with others. In an ecosystem where new AI-powered tools emerge daily, it’s essential to look beyond flashy features and star counts to evaluate fundamental aspects like community health, maintenance patterns, and long-term viability.

Unfortunately there is no shortcut to this. Implement and try it yourself.

The rapid evolution of AI tools also emphasizes the importance of building flexible architectures. Today’s cutting-edge solution might be tomorrow’s technical debt, so choosing tools that offer clean integration patterns and clear upgrade paths becomes crucial. This is particularly relevant in the AI/ML space, where both the underlying models and the tools built upon them are advancing at an unprecedented pace.

For developers and architects evaluating similar choices, remember that the “right” solution often depends more on your specific context than on any absolute measure of technical superiority. Consider your team’s capabilities, your project’s scalability requirements, and your organization’s long-term technical strategy. Sometimes, as in my case, the best approach might be combining simpler, more focused tools rather than adopting an all-encompassing solution that promises to do everything but masters nothing.

In this dynamic landscape of AI-powered development tools, maintaining a structured evaluation process while staying adaptable to change will serve you better than chasing the latest trending repository. After all, the goal isn’t to use the newest or most sophisticated tool, but to build reliable, maintainable solutions that effectively solve real-world problems.

Categories
FAQ Guides

What is Text Embedding?​

​Textual information surrounds us, from literature and articles to social media posts and customer feedback. However, for artificial intelligence (AI) systems to effectively analyze and comprehend this textual data, it needs to be transformed into a format that these systems can process. This is where the text embedding process, also known as vectorization, comes into play.

Text embedding, also referred to as vectorization, is a technique of converting textual data into numerical vectors or arrays of numbers. Each word, phrase, or document is represented as a unique vector, where similar texts have similar vector representations. This allows AI systems to work with textual data in a way that they can understand and process.

AI models, particularly deep learning models, operate with numerical data rather than raw text. However, this numerical data is not just a random collection of numbers; it is a carefully crafted numerical representation of the textual data. Text embeddings allow AI models to understand and process textual data by converting it into a meaningful numerical format.

For example, consider the words “dog” and “puppy.” While these words are clearly related in the context of canines, their raw text representations (sequences of letters) don’t convey this similarity. However, through text embedding, these words would be represented as numerical vectors that are close to each other in the vector space, reflecting their semantic similarity.

How Does Text Embedding Work?​

The fundamental concept behind text embeddings is that each word is mapped to a unique set of numbers based on its context and relationships with other words. These word embeddings can then be combined to represent larger pieces of text, such as sentences or documents.

One way to visualize this is to imagine a list of vehicles (e.g., car, motorcycle, bicycle) and a list of furniture (e.g., table, chair, sofa). In the vector space created by the embedding process, the vehicle vectors would be closer to each other, while the furniture vectors would be further away from the vehicle vectors, reflecting their semantic differences.

Popular text embedding techniques include Word2Vec, GloVe, and BERT. Without going into technical details, these methods use neural networks and machine learning algorithms to learn the vector representations of words and texts from extremely large datasets of words, where semantic meaning and syntax can be more easily found.

Text embeddings enable various natural language processing (NLP) tasks, such as text classification (categorizing texts into different topics or sentiments), machine translation (translating text from one language to another), and language generation (generating human-like text output).

In the real world, text embeddings play a crucial role in applications like chatbots, content recommendation systems, and spam detection. For example, a chatbot powered by a retrieval-augmented generation (RAG) model might use embeddings to efficiently search through a large corpus of documents to find the most relevant information to answer a user’s query, avoiding the “needle in a haystack” problem of poorly embedded data.

Limitations of Text Embedding

While text embeddings have been instrumental in advancing NLP and AI, the embedding process is not without limitations and challenges. One significant challenge is the need for large amounts of high-quality training data to learn accurate vector representations. Additionally, the computational complexity of these methods can be a hurdle, especially for resource-constrained environments.

Another limitation is that text embeddings may not always capture certain nuances or context-specific meanings of language, leading to potential misunderstandings or errors in downstream applications.

Ongoing research aims to address these challenges by developing more efficient and contextually aware embedding techniques, as well as exploring alternative approaches to representing and processing textual data.

Categories
FAQ

What is Vectorization?

What is Vectorization?

Text data is all around us, from books and articles to social media posts and customer reviews. However, for artificial intelligence (AI) systems to effectively process and understand this text data, it needs to be converted into a format that these systems can work with. This is where the process of vectorization, also known as embedding, comes into play.

Embedding, also known as vectorization, is a process of converting text data into numerical vectors or arrays of numbers. Each word, phrase, or document is represented as a unique vector, where similar texts have similar vector representations. This allows AI systems to work with text data in a way that they can understand and process.

AI models, particularly deep learning models, work with numerical data rather than raw text. However, this numerical data is not just a random collection of numbers; it is a carefully crafted numerical representation of the text data. Embeddings/vectorization allow AI models to understand and process text data by converting it into a meaningful numerical format.

For example, consider the words “king” and “queen.” While these words are clearly related in the context of royalty, their raw text representations (sequences of letters) don’t convey this similarity. However, through embeddings/vectorization, these words would be represented as numerical vectors that are close to each other in the vector space, reflecting their semantic similarity.

 

How Does Vectorization Work?

The basic idea behind vectors is that each word is mapped to a unique set of numbers based on its context and relationships with other words. These vectors can then be combined to represent larger pieces of text, such as sentences or documents.

One way to visualize this is to imagine a list of fruits (e.g., apple, banana, orange) and a list of junk food (e.g., candy bar, chips, soda). In the vector space created by vectorization process, the fruit vectors would be closer to each other, while the junk food vectors would be further away from the fruit vectors, reflecting their semantic differences. However, a “candy apple” may be somewhere in the middle. 

Popular embedding/vectorization techniques include Word2Vec, GloVe, and BERT. Without going into technical details, these methods use neural networks and machine learning algorithms to learn the vector representations of words and texts from extremely large datasets of words, where semantic meaning and syntax can be more easily found. 

Vectorization enable various natural language processing (NLP) tasks, such as text classification (categorizing texts into different topics or sentiments), machine translation (translating text from one language to another), and language generation (generating human-like text output).

In the real world, vectorization play a crucial role in applications like chatbots, content recommendation systems, and spam detection. For example, a chatbot powered by a retrieval-augmented generation (RAG) model might use vectors to efficiently search through a large corpus of documents to find the most relevant information to answer a user’s query, avoiding the “needle in a haystack” problem of poorly vectorized data.

Limitations

While vectorization has been instrumental in advancing NLP and AI, the process is not without limitations and challenges. One significant challenge is the need for large amounts of high-quality training data to learn accurate vector representations. Additionally, the computational complexity of these methods can be a hurdle, especially for resource-constrained environments.

Another limitation is that vectorization may not always capture certain nuances or context-specific meanings of language, leading to potential misunderstandings or errors in downstream applications.

Ongoing research aims to address these challenges by developing more efficient and contextually aware vectorization techniques, as well as exploring alternative approaches to representing and processing text data.

Categories
FAQ

How do I Save Data for ChatGPT?

To save data for use with ChatGPT or other language models, you typically follow a multi-step process involving raw data collection, storage, and vectorization/embedding.

Raw Data

The first step is to gather the raw data you want to use for training or fine-tuning ChatGPT. This raw data can come from various sources, such as websites, documents, transcripts, or databases. One common technique for collecting raw data is web scraping, which involves programmatically extracting data from websites.

For example, if you want to train ChatGPT on a collection of PDF documents, you can use a web scraper to download those PDFs from various sources on the internet. Alternatively, if you want to use structured data from a database, you can query the database and export the relevant data into a suitable format.

Storage

Once you have collected the raw data, you need to store it in a way that facilitates efficient processing and access for the subsequent steps. The storage approach you choose depends on the format and size of your data, as well as your specific requirements.

  1. File-based Storage: If your raw data consists of individual files (e.g., PDFs, text documents), you can store them in a file-based storage system like cloud object storage (e.g., Amazon S3, Google Cloud Storage) or a local file system. This approach is suitable when you need to process each file individually and can handle the overhead of managing and tracking individual files.

Example: You have a collection of 10,000 PDF documents that you want to use for training ChatGPT. You can upload these PDFs to an Amazon S3 bucket, which will store them as individual objects. This bucket acts as a centralized repository for your raw data files.

  1. Database Storage: If your raw data is structured and can be represented in tabular form, you can store it in a database management system (DBMS). This approach is often preferred when you need to perform complex queries, joins, or transformations on your data.

Example: You have a database containing millions of rows of customer support conversations that you want to use for training ChatGPT. You can export this data from the database into a format like CSV or JSON, and then load it into a new database table specifically designed for storing and processing the raw data for your language model.

The choice between file-based storage and database storage depends on factors such as the size and structure of your data, the processing requirements, and the tools and frameworks you plan to use for the subsequent steps.

Vectorization/Embedding

After storing the raw data, the next step is to convert it into a numerical representation suitable for training language models like ChatGPT. This process is called vectorization or embedding, and it involves transforming the text data into dense numerical vectors that capture semantic and contextual information.

One popular technique for vectorization/embedding is to use pre-trained language models like OpenAI’s or Cohere’s embeddings models. These models are trained on vast amounts of text data and can generate high-quality embeddings that capture semantic and contextual information.

Example with OpenAI Embeddings: You have a PostgreSQL database containing raw text data for customer support conversations. You can use the OpenAI Python library to compute embeddings for each conversation using the text-embedding-ada-002 model. These embeddings can then be stored in a separate table within the same PostgreSQL database, assuming you have installed the pgvector extension for efficient vector operations.

Example with Cohere Embeddings: Alternatively, you can use Cohere’s embeddings to generate embeddings for your raw data. Cohere provides a simple API for computing embeddings, which you can integrate into your data processing pipeline. Once you have obtained the embeddings, you can store them in a dedicated vector store like Pinecone or Weaviate, which are optimized for storing and querying high-dimensional vectors.

By using pre-trained language models like OpenAI’s text-embedding-ada-002 or Cohere’s embeddings, you can efficiently generate high-quality embeddings for your raw data, without the need to train your own embedding models from scratch.

After obtaining the embeddings, you can store them in a separate database or vector store optimized for efficient retrieval and processing of high-dimensional vectors. This separate storage is often necessary because traditional databases may not be well-suited for storing and querying dense numerical vectors.

By following this process of raw data collection, storage, and vectorization/embedding, you can prepare your data for training or fine-tuning ChatGPT or other language models. The specific tools, frameworks, and storage solutions you choose will depend on your data characteristics, computational resources, and project requirements.

 
Categories
Current Trends How It Works Technology

Where do I find my OpenAI API Key?

Many AI applications and tools need their users to obtain their own OpenAI API key. This key enables programmatic access to the OpenAI backend on behalf of the user, essentially “charging” an AI-powered tool.

 

Please note: at the time of writing, new users are given $ 5 USD in free “tokens” for 3 months. Afterwards, you will need a credit card to continue using any API keys. This is NOT the same as ChatGPT Plus.

Getting Started

To get started, go to the Open AI website. If you haven’t already created an account to use the ChatGPT UI, you can easily create an Open AI account by navigating to Developers -> API Reference (don’t worry about the code, we won’t be dealing with that today!)

If you’ve already set up an account and are signed in, you can ignore this part since you should see your profile icon and name in the top-right of the image in place of `Login` and `Sign up`.

Finding Your API Key

To get your API key, click on your name in the top right corner, which will display the drop-down menu. From the menu, select the “View API keys” option.

At this stage, you will see the option to `Create a new secret key at the centre. If you have any previously created API Keys, they will be visible here (you can only copy them once on creation, so be sure to copy it somewhere secure). If you don’t have an API key, click to get one.

Using Your API Key

Now that you have gotten your API key, you can give your applications and tools OpenAI power! Please be aware that some applications will consume more tokens than others. You can read more about how pricing is calculated on the OpenAI pricing page.

Categories
Current Trends How It Works Technology

How to Scam People with AI

Just kidding


I thought I would take some time to go over some of the scams (either loosely or tightly) related to AI that have begun to surface. In the dawn of this new and exciting age, scams will likely continue to be on the rise in new forms.

I’ll go over some of the different scams I’ve spotted across the web. If you’ve got any more to share, drop a comment below!

👆Upwork Scamming

đŸ€–AI Proposals

In some cases, the lack of authenticity in the proposal may be especially [random adjective] when they use the AI equivalent of lorem ipsum in the text:

Bonus points if you also used an open-source image generation tool that contains all of the image generator prompts in the file name to make your attachments seem legit.

In another case, I was helping a client hire some UI help on Upwork to assist us with some design work. I usually have a pretty good knack for spotting good proposals, but this week really threw me for a loop.

The first sign to look for is checking the responses to questions that you post in the proposal. If they don’t respond to the answers well or use ChatGPT to generate responses, that’s a red flag to look for:

đŸ”ȘAccount Hijacking?

If the account is having issues, you won’t be able to see the profile of the contractor.

You can also spot potential issues if the account has been flagged if the contractor withdraws their proposal.

Lastly, when they come to the meeting, if their appearance doesn’t match the proposal, then you can be sure they are either a fake account, or the account has been compromised and they are fishing to pawn off a cheap project for a large amount:

Our conversation basically entailed an output that was NOT in spec with the proposal at all (I needed a simple automation setup using make.com or another glue tool, and they had obviously not read my proposal and were trying to sell an full-blown Application stack 😅

đŸ“șYouTube Scamming

While going over my home feed, a live stream with Elon Musk from OpenAI was airing! Curious to see what it was, I opened the video to find a QR in the corner that lead to a link that will allegedly change my life within minutes, according to a “screenshot” from Elon Musk overlaid on the video stream:

Let’s dig in, shall we?

  • Compromised or fake YouTube account? ✅
  • A single live stream with a relatively large amount of viewers? ✅
  • Does QR Code lead to “Tesla bonus dot live” (really)? ✅

I’ve reported 3 of these to YouTube, and they’ve been taken down within roughly 45 minutes. Stay sharp y’all!

🔍tl;dr

  • When hiring help on platforms like Upwork, it’s crucial to keep an eye out for red flags.
  • Examine the responses to your proposal questions; if they seem off or AI-generated, be cautious.
  • Watch for account hijacking signs, such as inaccessible profiles or withdrawn proposals.
  • During meetings, ensure the contractor’s appearance matches their proposal.
  • Watch out for fake live streams on YouTube. Check the authenticity of the channel for signs of account hijacking/bot views.
Courtesy of https://i.redd.it/dcz26dc7jlia1.png

By staying vigilant, you can avoid falling prey to scammers seeking to profit from your hard-earned cash. đŸ˜ŽđŸ’ŒđŸ’Ą

Follow us for more AI shenanigans. If you need help wrangling with this new world, call us for help automating your workflows IF and ELSE statements.

Categories
Business Current Trends Technology

Demystifying AI for Business Leaders: Current Trends and Challenges Part 1

Everyone and their mother has something to say about AI these days. Whether you’re trying to garner media attention by signing a moratorium or making AI music videos, the entire world is focused on grappling with this horrifically powerful and sometimes down right silly and fun technology. Individuals and organizations both are looking towards how to best integrate AI into an existing project or start a totally new venture in this brave new world.

Here are some insights that can help you navigate the AI landscape with confidence.

Machine Dreams

Without providing an exhaustive list of examples (which can generally be found with a quick Google search), hallucination refers to AI responses that are incoherent given their training data. Unfortunately (or fortunately?) at the time of writing, the best practice of detecting hallucinations involves an actual human fact-checking the output of the AI.

Prepare for takeoff

Let me share a personal experience I had with the gpt-4 model while using it for data analysis. I provided it with a CSV file to run calculations, and as I checked the intermediary steps, I discovered that it had generated a completely fake dataset to “answer” my question. At first, it seemed to be on the right track, stating: I need to compare the two datasets to find the rows in the temp_sheet.tsv "Location" column that has corresponding cities in the demo.csv "City" column. However, things quickly went downhill from there.

```python
import pandas as pd

data1 = {'Name': ['John', 'Paul', 'Ringo', 'George'],
'Age': [20, 21, 22, 23]}

data2 = {'Name': ['John', 'Paul', 'George', 'Ringo'],
'Height': [180, 170, 175, 165]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

pd.merge(df1, df2, on='Name').head(3)

The Rise of Agent Behavior

With the launch of ChatGPT Plugins (most notably the ability to book flights) came the concept of “agent” behavior. An Intelligent Agent is essentially an AI that can receive environmental data and perform actions based on contextual information. In theory, it’s quite simple to “string” the inputs and outputs of different AI models to each other in a way that allows a main AI to interact with different assets in an agentic manner. LangChain is a popular open-source library for integrating this kind of behavior in your Python or JS project. There are many pre-built tools, clear examples in the documentation, and a highly-active developer community to help you integrate advanced AI usage into your project.

Bond, James Bond

Interested in implementing this AI solution for your business? Contact us for a consultation!

On a side note, OpenAI, if you happen to come across this article, I’m eagerly awaiting access to the plugin SDK! 😉

Data Leakage

Due to the nature of powering Large Language Models (LLM) and the advent of Reinforcement Learning from Human Feedback (RLHF), proprietary information has already been “accidentally” leaked into the verification and training data for OpenAI. This trend is likely to continue to get worse before it gets better as researchers and developers seek a balance between powering these feedback and resource-intensive engines while also securing user-submitted data. Unfortunately, the line between AI “power” and privacy is very thin due to the nature of how ML models work.

Data Leakage vs. Data Breaches

One way that organizations (such as Samsung) are combatting this is by developing in-house LLMs, given enough data, that allow their users to keep user prompts and interactions on internal servers. The downside here is that this relies on:

  • A large enough supply of data to fine-tune an open-source model
  • Enough users and verifiers to implement RLHF within organizational processes

Next Steps

Armed with these insights, I hope you can embark on your AI journey with greater confidence, even in this rapidly changing landscape. Remember the timeless machine learning mantra: “Garbage In = Garbage Out.” Always double-check your work and keep humans in the loop to minimize any potential negative side effects of utilizing AI in your project.

đŸ“» Stay tuned for Part 2 of this series, where we’ll dive into privacy and security issues related to AI. In the meantime, follow Automation Architech for more great content!

đŸ§™â€â™‚ïž We are AI application experts! If you want to collaborate on a project, drop an inquiry here, stop by our website, or shoot us a direct email.

📚 Check out some of our other content:

Categories
Current Trends How It Works Technology

Bidirectional (2 Way) Sync Using Pipedream

What is Bidirectional Sync?

Bidirectional sync, also known as 2-way sync, is a type of data synchronization process that involves data being synced in both directions, meaning information can be transferred from one system to another and vice versa. It ensures that any changes made in either system are reflected in both systems. This type of synchronization is commonly used in cloud-based applications and services, where data needs to be shared across multiple devices and users.
While most automation workflows will generally flow one-way (such as in a DAG), there are occasions where you may want to keep data synchronized in a non-hierarchical manner between two different applications. In this article, we will use Trello and Google Sheets as a use cases.
In classical software engineering cases, this is accomplished using a server-client model where the server acts as the “single source of truth” from which the client updates. However, if we are referring to essentially two applications that have their own “servers”, what is the best way to keep information organized and synchronized between these two resources?

Challenges in Bidirectional Sync

When referring to bidirectional synchronization, there are a few key challenges that need to be weighed and balanced when designing a bidirectional sync system:

Maintaining Data Consistency

Bidirectional sync between two web applications can be challenging when it comes to maintaining data consistency. For example, if one application is updated, the changes must be reflected in the other application to keep the data consistent across both applications. This problem is trickier than most people realize due to issues like how fields are mapped, how individual “rows” or entries are keyed to each other, and how frequently a service can send/receive updates.

Conflict Resolution

When synchronizing data between two web applications, conflicts can arise in how the data is represented or stored. This can lead to discrepancies between the two applications and must be resolved in order to keep the data consistent. Depending on how robust the data pipeline is, it would be possible to maintain a clean connection if there is a transparent keying system as mentioned above.
Depending on the scale and how critical the data is, it may make sense to either have a database external to both services that store transactions or have one service act as the “single source of truth” for both services. That way, in a conflict, the system falls back to a single service.

Security

Bidirectional sync between two web applications can pose a security risk if not implemented properly. It’s important to ensure that both applications are secure and that any data transmitted between them is encrypted. Many low-code/serverless workflow automation pipelines exist that can accomplish this in a secure manner, including Pipedream, Make.com, and Zapier, to name a few. 

However, at scale, these options can become expensive to maintain. In many cases, it’s best to prototype with these serverless workflow automation tools, then invest in engineering a custom cloud solution based on the design of your low-code solution. We can help with the transition from low-code to full-scale cloud builds. Contact us today for details!

Performance

Bidirectional sync can be resource intensive, as data must be constantly transmitted between devices. This can lead to slow performance and a degraded user experience. This is often dependent on the scale of the data transferred and stored and HOW it’s transmitted. If each operation requires a complete synchronization of both data sources, then this will be much more computationally costly than single incremental updates as they occur. 

Strategies for Establishing Bidirectional Sync

The main thing to consider when setting up a bidirectional sycn workflow is to prevent the “infinite loop” problem in the case of a non-hierarchical system. The main strategies we can implement to prevent this would be:

Last Modified

When performing a sync, we would want to include data about when a “row” in our data was last updated/created. If it’s relatively recent (within the past X seconds or so) we would want to ignore an update and stop the workflow from continuing to prevent wasted resources. As an extension of this, we could also check if the values are different from each other and if there is no difference between the existing row and the incoming update, break the workflow cycle. 

Is User

Some servers and automation tools may allow us to check if the incoming change is coming from either a human user or a program/API and react accordingly. If the change is coming from a non-human actor, we could easily close the loop this way and prevent wasted computation resources. 

Toggle

In some situations, a toggle could be added in a server-stored value that would prevent a workflow or function from initiating if something like isUpdating is true. After the update is complete, this could be flipped to false to reallow flow between the two resources. 

Bidirectional Sync Use Case: Trello and Google Sheets

Google Sheets is a fantastic spreadsheet application that is popular for it’s ease of sharing, free access to anyone, plethora of plugins, and the ability to write pseudo-javascript to drastically increase it’s functionality.

Trello, one of the top project management tools, is able to organize tasks and just about anything related to a project’s needs. 

We were approached to test a bidirectional sync system between the two tools since they both have API access, and we decided to give it a go! The architecture we came up with involved using Pipedream as our server to communicate and relay changes between the two applications. For this project, we simply mapped the Title, Description, and Status fields between a Google Sheet and the Trello Cards on a given board, like so:

To do this, we created two workflows in Pipedream: one responsible for processing updates FROM Trello TO Google Sheets, and another FROM Google Sheets TO Trello. Initially, we also wanted to account for the “infinite loop” problem (where one update would create a continuous loop of updates), but it turned out this was not necessary for this task likely due to the onEdit() event on Google Sheets side only responding to changes initiated by humans (and not the API). 

Given Google Sheets only takes one dimensional data, there were some limitations with how many fields from a Trello Card could be included in Sheets due to parsing and data structure complexity. However, in this scenario, for full-scale access to all data structures, you could use Airtable or Notion as an alternative database since it has more complex data structures built in like lists and the ability to key to other tables. 

Challenges

In addition to the data structure issues above, while implementing this workflow in Pipedream, the batch updates from Google Sheets also proved to be problematic as Pipedream does not handle iteration by default in a single “run”. We had to essentially split the workflow into “receiving” and “processing” to be able to handle these edge cases. You can read more about the status of this on their GitHub.