English versionEN

[EN] Semantic Search through the Complete Wikipedia with Weaviate’s GraphQL API
[ES] Búsqueda Semántica a través de la Wikipedia Completa con la API de GraphQL de Weaviate

Semantic Search through the Complete Wikipedia with Weaviate’s GraphQL API

Weaviate uses GraphQL to provide user-friendly data interaction. Weaviate is an open-source vector search engine, and all searches (e.g. semantic, contextual) are done via its GraphQL API. We’ve put a lot of thought into the design of the GraphQL API, which results in good user and developer experience. In this talk, I will take you along in the journey of how our GraphQL implementation was shaped according to user needs and software requirements, and show a demo of the current design for Weaviate. The demo will show how Weaviate’s GraphQL design enables semantic (vector) search in combination with scalar search through unstructured data. Machine learning models are used in the background, but with the current GraphQL design, users without a technical background can query the vector database easily.

Weaviate has a modular architecture, so users can connect various machine learning models on top of the vector database. Examples are the newly released Question Answering module and the Named Entity Recognition module. Modules can extend the GraphQL schema dynamically, to query the new features intuitively.

This presentation contains a demo where we will query the complete Wikipedia, conduct semantic search queries and more. All through Weaviate’s GraphQL API. No prior knowledge is required.

This talk has been presented at GraphQL Galaxy 2021, check out the latest edition of this Tech Conference.

FAQ

Weaviate is a vector search engine or database that utilizes a GraphQL API to manage and query data. It's designed to handle unstructured data by converting it into vector representations through machine learning models, allowing for more contextual and nuanced search capabilities beyond simple keyword matching.

Weaviate uses machine learning models to convert data objects into vector representations. Each object is processed through the model to obtain a vector that represents its information in a multi-dimensional space. These vectors allow Weaviate to perform searches based on semantic meanings rather than just matching keywords.

The GraphQL API in Weaviate is designed to facilitate complex queries on the data stored within the database. It allows users to perform operations like 'Get', 'Explore', and 'Aggregate' to retrieve, search through vector space, and summarize data, respectively, using a flexible and powerful query language.

Yes, while the demo focuses on text data, Weaviate is capable of handling various types of data including images and videos. This is facilitated by different machine learning models that can vectorize diverse data types, making it a versatile tool for many use cases.

The core functions of Weaviate's GraphQL API include 'Get', 'Explore', and 'Aggregate'. 'Get' is used to retrieve specific items from the dataset, 'Explore' helps in searching through the complete vector space, and 'Aggregate' is used for summarizing data, such as counting objects.

To start using Weaviate and its GraphQL API, you can visit the SAMI.Technology website, access the Developer section, and follow the installation guide. There are also customization options available to tailor the setup according to specific needs.

Weaviate offers benefits such as handling unstructured data, providing semantic search capabilities through vector representations, and supporting complex queries with its GraphQL API. This makes it suitable for searching and managing large datasets with nuanced and contextual search requirements.

Yes, the Wikipedia dataset used in Weaviate's demo is open source and available on GitHub. It contains over 11 million articles, 27 million paragraphs, and 125 million cross-references, and can be used for testing and exploring Weaviate's capabilities.

case study graphql

Bob van Luijt

17 min

10 Dec, 2021

Comments

Video Summary and Transcription

Weaviate is a database and search engine that uses a GraphQL API. It supports various machine learning models for data vectorization and search. The core functions of Weaviate are get, explore, and aggregate, which allow users to query and search through the data set. Weaviate provides fast and accurate results, allowing users to find anything in the dataset. The GraphQL API in Weaviate can be used for querying specific data and establishing graph relations.

Available in Español: Búsqueda Semántica a través de la Wikipedia Completa con la API de GraphQL de Weaviate

1. Introduction to Weaviate and Vector Search Engines

Short description:

We will talk about our database, search engine Weaviate, and its GraphQL API. We will use a demo data set, the complete Wikipedia, to demonstrate how to query it. We will provide context on vector search engines, discuss the design of the GraphQL API, and give a demo of the API on the data set. Lastly, we will show you how to start Weaviate with its GraphQL API.

So hello everybody. Thank you for taking the time to listen to this talk. We are going to talk about a few things. So first of all, we're going to talk about our database, our search engine, Weaviate, and we're going to use a demo data set, which is the complete Wikipedia to show how you can query it, and most importantly of course, we're going to talk about the GraphQL API that it has.

So weave is a vector search engine or database, it has a GraphQL API, and we're going to use it to demo to show you the demo data set of the complete Wikipedia. So first I will give a little bit of context about like what the vector search engine is, so that you understand what we're talking about, if it's new to you. Then we will look at the design of the GraphQL API. Then we'll go into a demo of the API on the data set. And last but not least, I'll show you how you can start it with Weaviate and its GraphQL API yourself.

So again, thanks for listening. So first of all, what is Weaviate and what is a vector search engine? So at the core, we're dealing with the problem of unstructured data. If you ever use a database or if you ever use a search engine, then you know that the data that you're storing, for example, if it is text, that you can only find it if you use keywords. So for example, in a traditional search engine, you have to, if you search for this data object for wine, for seafood, you will probably not find it because except for the key here, there's nowhere where you find the word wine in the data. The word for is not in there either and seafood is not in there either. So using a vector search engine and you would search wine for seafood, it would actually find the data object. And the reason why it's able to do that is because every data object that you add to the search engine is run through a machine learning model. The machine learning model creates factor representations and that's what you use to search to the database.

Now if this is new to you, then let me give you a little bit of context so that you know what's happening there. So, most machine learning models output vectors. And the easiest way to think about vectors are coordinates. So, for example, our first model had 300 dimensions and you had all these kinds of words in there. So the bulbs here represent words like meat, chicken, fish, etc. What you can do if you add a new data object, for example, the Chardonnay that's good with is that all these individual words that you see here highlighted in green are found in the vector space and they're placed in that same vector space. And what you can do is that you can give a unique centroid position to that data object. So, now you can say in the vector space the data object, in this case the Chardonnay, sits exactly here in the middle of where all these words sit. So now, if you search for wine related to seafood or those kind of things, you will actually be able to find that data object. It is not 100% match, but it's an approximation of what you're searching for. But in a bit, you will see what actually the value is of this. So, as you see here, we have the class Wine with property Covey run 2005 Chardonnay. It might be related to a beacon, and it might have certain vector weights.

2. Data Object Structure and GraphQL API Design

Short description:

We will discuss the data object structure in Weaviate and the database's role in storing objects for vector search and filtering. Weaviate supports various machine learning models for data vectorization and search. The architecture includes modules like text-to-vec and Q&A, running on your infrastructure. Weaviate's core contains these modules, along with a persistence layer for storing vectors and an API for data search. We will focus on the GraphQL API and its design, which we chose over other options. The design involves classes, properties, and graph-like data models with additional properties for searching.

So this is what the data object looks like when you store it in a Weaviate instance. Well, to help you work with this, we have the database which you see in the middle to store your objects to do vector search and to do filtering. But of course, there are many, many machine learning models that you can use to actually or vectorize the data or search through the data.

The demo that I'm going to give today is purely focusing on text. However, you could also do this for images or videos or any other data type. If you go a little bit deeper under the hood, you see how that works from an architectural point of view. So for example, we have text-to-vec modules or we have Q&A modules. They often run on a GPU. That's all running on your infrastructure.

These modules sit in the Weaviate core, then there's a persistence layer that's taking care of storing the vectors, being able to search through the vectors and to store the data object. But most important, there is an API on top of it. Of course, what we're going to focus on today is the GraphQL API and how you can leverage to search through your data.

First, before we do that, I want to talk a little bit about the design of the GraphQL API, because you have to know when we created the database, we didn't have an interface yet. We had to choose what language will we choose to query data. Will we just have a pure RESTful API? Will we adopt some kind of query language? Will we invent something of our own? Then we decided that the best for us was actually to use GraphQL. This is, in a tiny nutshell, our design. At the top, you have a core function within UEFI 8. We'll look at that in a bit. You have a class that you can add and add your data to. A class can be anything. Whatever data you can have, for example, if you have documents, you can just have a class document. If you have products, you can have a class product. Then you have the properties. A property can be also anything. So, for example, if we stay with the class product, then you might have the property name or the property price. You can, of course, make a cross-reference. Hence, it's a graph-like data model. Then we have these underscore additional properties. Those are properties that you get as part of searching for classes. But those are baked in in the modules or into Weave8 itself.

3. Core Functions and Querying Data in Weaviate

Short description:

We have three core functions: get, explore, and aggregate. Get is used to find data in the dataset. Explore is used to search the complete vector space. Aggregate is used to perform operations like counting data objects. We can add search filters to the class name, such as near text filters from the text-to-vect modules. We will use the Wikipedia dataset in our demo, which contains millions of articles, paragraphs, and cross-references. Let's start with a simple example of querying paragraphs with titles, content, and order.

So, we have three core functions. So, get, explore, and aggregate. Get is the one that we'll be using the most, because that's how we find stuff in our dataset. Explore is to search through the complete vector space. This is often done if you don't know what your classes or properties are called. And aggregate is just, for example, how many data objects do I have, et cetera.

So — oh, of course, what you can also do is that for the class name, you can add search filters. So, here, for example, if you have the class article, the property title, you can add a near text filter that comes from the text-to-vect modules, and you say like, okay, I want to search for the concept. The concepts, in this case, housing prices. So that is at the root what the GraphQL design is like from Weave 8. And I think it's best to actually start to look at the demo because what's better than actually looking at it in action?

So, for all the demo datasets that we have, you have a console that you can go through console dot some other technology, and embedded in that console is among other things, GraphQL. That's something that we'll be using here. Of course, it's a database. So I need to select a database, a dataset. And the dataset that is here is actually Wikipedia. So this is open source. It's on GitHub. And it currently contains a little over 11 million articles, a little over 27 million paragraphs. And a little over 125 million cross-references. And this is the machine that it's running on. In this GitHub repo, you'll find all the information if you want to run the dataset yourself. But we also have a live demo that you can use.

So let's start very simple first. So I can say get. I can say paragraph. And I can say title, content, and order. And the reason that we structured it like this is because a paragraph is part of a article in Wikipedia. And a paragraph can have a title. It doesn't have to. It has content and it has an order.

4. Querying Data and Machine Learning Models

Short description:

We can query the data object and retrieve its vector representation. We can also establish graph relations and search through the data set using GraphQL. Let's start by asking a question and limiting the search to the first result. This query aims to find the answer to the question of how many people fit in an Airbus A380.

So it's like, for example, the fourth paragraph. Now, if I run this query, very boring query, it says as much as just get me the first 25 paragraphs and show me the title, content, and order. Now, if we limit this, oh, apologies. If we limit this to the first result, I can do something like this. So I can say limit one, first result. And what's going to be interesting is how the machine learning model represents this data object. So what I'm now doing with this additional property is that I want to see the vector for this specific data object. So if I run it here, it returns the vector. A representation that is coming from the machine learning model. So this data object is run through the machine learning model. This is the vector representation that gives us and it's representing this. Now, if we go one step back, so I just remove the vector representation. We can also make a graph relation. So we can say in article. I can say on article, an article as a title. And now if I run this, you see where it's coming from. So it's coming from a file on Wikipedia. So this is how the data is structured. Or sorry, how the GraphQL API is structured. But of course, where it becomes very interesting is now if you're going to use GraphQL to enable the machine learning models to search through the data set. So let's do something like this. Let's start from the perspective of asking a question. So we can say ask question. And I could say how many people fit in on Airbus A380. I'm going to tell it that I wanted to find the answer in the content. So this is the content. And I'm going to limit that just to the first result, because I'm looking for a specific answer to a question. And then again, I can add the additional property to show me the answer. Yeah, I need result. So this query says like search through the paragraphs, try to answer the question, how many people fit in an Airbus A380? Use the content property to find it, limit it to the first result.

5. Weaviate Query Results and Generic Questions

Short description:

Here we demonstrate the results of a query in Weaviate, showing the title, content, order, and source of a paragraph. The query is fast and accurate, allowing users to find anything in Wikipedia. We can also use Weaviate for generic questions and search for concepts like Italian food. The results provide a high level of certainty, but may vary as we move further from the main topic.

Show me the result of the answer. I also want to see the title of the paragraph, the content of the paragraph itself, the order in which this paragraph is showing on the page. And I want to see the actual Wikipedia page where this is coming from. So now running the query, and you see how fast was there. So it says like 656. And then you see here at launch in December 2006, 156 seat A380, 200, etc, etc. So that's how you see how that works and how it gets these results. You can find anything.

So anything that's in Wikipedia can search for. So I'm a big music fan, so I could say, for example, what was the name of Frank Zappa's first band? So same type of query. Let's keep everything the same here. Let's run this query. And here you see it's The Mothers of Invention, which happens to be the correct answer. I know. We cannot only do that for Q&A, but we can also do that for more generic questions.

So now let's remove the ask and the limit and say near text. I want to search the concepts, and let's go, for example, Italian food. No, I do not have an answer, but I can ask for certainty. OK, let's see what happens when we run this query. So what you see here is ... well, of course it comes from the article list of Italian dishes, that kind of makes sense. It is the first paragraph, so there's no text. And here you see all kinds of things about Italian dishes. And it's like almost 90% certain that this is the right result. But if we scroll down, you see that the number goes slightly down. It's still about Italian cuisine. But the further we go down, you see the more we get removed from the actual topic. So here the culture of Italy, the cuisine. Pasta, of course, pasta.

6. Querying Data and Certainty Levels

Short description:

We can specify a minimum certainty level for the query results. The further we scroll down, the less relevant the results become. We can use any natural language query and any data set for the search.

Pasta, of course, pasta. So one of the things that we could do, for example, is that we can say, well, we want to have at least 85% certainty of the result. So if I run this query again, and let me scroll all the way down. Then let's see what's the. So here we see Lombard cuisine from Milan, which is like 85% certain. So what you see, it's still about Italian food. But the further we go down, the further we scroll down, the further it is removed from the actual query that we have. And you can use any natural language query that you want to use. Important to mention, this demo data set is Wikipedia. But you can use any data that you have.

7. Querying Italian Food and Documentation

Short description:

I will now demonstrate how to query Italian food from the list of Italian dishes using graph relations. By running the query, you can see the links to related articles. This is how the GraphQL API of Weaviate works. To learn more, visit our website or search for Weaviate. Thank you for listening and consider giving Weaviate a try.

Last one I want to show is something related to the graph relations. So let's just limit this to the first result. So I'm now going to get Italian food from the list of Italian dishes. Now I can even say in article, on article title, I can say has... Oh, no. Sorry. Links to article on article. And I can say title again. So what this query does is it finds the data object for the paragraph. In this case, that's the first paragraph of the list of Italian dishes, which is the first graph relation. And then we're going to make another graph relation where we're going to say, OK, show me what data objects this data set is in turn linking to. So let's run that query and see what happens. So if you now go down, you see that the links to the article are pizza margherita, DOC, Italian cuisine, et cetera, et cetera, et cetera, et cetera.

So that's how, at the root, the GraphQL API of the Weaviate Vector Search Engine works. But I would like to do is I want to introduce you to the documentation. The easiest thing you can do is just google for Weaviate or go to our website, SAMI.Technology. You click Developer section, and you can't miss it. In the installation guide, you can click, for example, Customize Your Weaviate Setup. You can go to the customizer and install it yourself. Of course, last but not least, we have the GraphQL references. So if you scroll there, you see example queries. You can try them out in real time, and you also see the equivalent of these GraphQL queries in different programming languages.

So, thank you so much for listening to my talk. I hope you like it. I hope you'll give Weaviate a try. If you come to our website, you will find our slack. You will find the software documentation. Weaviate itself is on GitHub. If you like what you see, then, of course, a GitHub star is always appreciated. And you can also, of course, simply google for Weaviate, and you will find other videos, blog posts, software documentation, demo datasets, whatever you can think of. Thank you so much for listening, and I hope you'll enjoy the other talks, as well.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

From GraphQL Zero to GraphQL Hero with RedwoodJS

GraphQL Galaxy 2021

32 min

From GraphQL Zero to GraphQL Hero with RedwoodJS

Top Content

Tom Preston-Werner

GitHub cofounder, RedwoodJS author

Tom Pressenwurter introduces Redwood.js, a full stack app framework for building GraphQL APIs easily and maintainably. He demonstrates a Redwood.js application with a React-based front end and a Node.js API. Redwood.js offers a simplified folder structure and schema for organizing the application. It provides easy data manipulation and CRUD operations through GraphQL functions. Redwood.js allows for easy implementation of new queries and directives, including authentication and limiting access to data. It is a stable and production-ready framework that integrates well with other front-end technologies.

frameworks graphql redwoodjs builders and founders

Debugging JS

React Summit 2023

24 min

Debugging JS

Top Content

Watch video: Debugging JS

Mark Erikson

Replay.io

Debugging JavaScript is a crucial skill that is often overlooked in the industry. It is important to understand the problem, reproduce the issue, and identify the root cause. Having a variety of debugging tools and techniques, such as console methods and graphical debuggers, is beneficial. Replay is a time-traveling debugger for JavaScript that allows users to record and inspect bugs. It works with Redux, plain React, and even minified code with the help of source maps.

best practices case study javascript web development debug

A Framework for Managing Technical Debt

TechLead Conference 2023

35 min

A Framework for Managing Technical Debt

Top Content

Natalia Tepluhina

GitLab

This Talk discusses handling local state in software development, particularly when dealing with asynchronous behavior and API requests. It explores the challenges of managing global state and the need for actions when handling server data. The Talk also highlights the issue of fetching data not in Vuex and the challenges of keeping data up-to-date in Vuex. It mentions alternative tools like Apollo Client and React Query for handling local state. The Talk concludes with a discussion on GitLab going public and the celebration that followed.

graphql vue server cache

Building a Voice-Enabled AI Assistant With Javascript

JSNation 2023

21 min

Building a Voice-Enabled AI Assistant With Javascript

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.

case study artificial intelligence

A Practical Guide for Migrating to Server Components

React Advanced 2023

28 min

A Practical Guide for Migrating to Server Components

Top Content

Watch video: A Practical Guide for Migrating to Server Components

Fredrik Höglund

ephem.dev

React query version five is live and we'll be discussing the migration process to server components using Next.js and React Query. The process involves planning, preparing, and setting up server components, migrating pages, adding layouts, and moving components to the server. We'll also explore the benefits of server components such as reducing JavaScript shipping, enabling powerful caching, and leveraging the features of the app router. Additionally, we'll cover topics like handling authentication, rendering in server components, and the impact on server load and costs.

react react query next.js case study react server components react 18

Workshops on related topic

Build a Headless WordPress App with Next.js and WPGraphQL

React Summit 2022

173 min

Build a Headless WordPress App with Next.js and WPGraphQL

Top Content

Workshop

Kellen Mace

In this workshop, you’ll learn how to build a Next.js app that uses Apollo Client to fetch data from a headless WordPress backend and use it to render the pages of your app. You’ll learn when you should consider a headless WordPress architecture, how to turn a WordPress backend into a GraphQL server, how to compose queries using the GraphiQL IDE, how to colocate GraphQL fragments with your components, and more.

next.js wordpress graphql

Build with SvelteKit and GraphQL

GraphQL Galaxy 2021

140 min

Build with SvelteKit and GraphQL

Top Content

Workshop

Scott Spence

Have you ever thought about building something that doesn't require a lot of boilerplate with a tiny bundle size? In this workshop, Scott Spence will go from hello world to covering routing and using endpoints in SvelteKit. You'll set up a backend GraphQL API then use GraphQL queries with SvelteKit to display the GraphQL API data. You'll build a fast secure project that uses SvelteKit's features, then deploy it as a fully static site. This course is for the Svelte curious who haven't had extensive experience with SvelteKit and want a deeper understanding of how to use it in practical applications.

Table of contents:
- Kick-off and Svelte introduction
- Initialise frontend project
- Tour of the SvelteKit skeleton project
- Configure backend project
- Query Data with GraphQL
- Fetching data to the frontend with GraphQL
- Styling
- Svelte directives
- Routing in SvelteKit
- Endpoints in SvelteKit
- Deploying to Netlify
- Navigation
- Mutations in GraphCMS
- Sending GraphQL Mutations via SvelteKit
- Q&A

graphql svelte

Relational Database Modeling for GraphQL

GraphQL Galaxy 2020

106 min

Relational Database Modeling for GraphQL

Top Content

Workshop

Adron Hall

In this workshop we'll dig deeper into data modeling. We'll start with a discussion about various database types and how they map to GraphQL. Once that groundwork is laid out, the focus will shift to specific types of databases and how to build data models that work best for GraphQL within various scenarios.
Table of contentsPart 1 - Hour 1 a. Relational Database Data Modeling b. Comparing Relational and NoSQL Databases c. GraphQL with the Database in mindPart 2 - Hour 2 a. Designing Relational Data Models b. Relationship, Building MultijoinsTables c. GraphQL & Relational Data Modeling Query Complexities
Prerequisites a. Data modeling tool. The trainer will be using dbdiagram b. Postgres, albeit no need to install this locally, as I'll be using a Postgres Dicker image, from Docker Hub for all examples c. Hasura

database graphql

Building a Shopify App with React & Node

React Summit Remote Edition 2021

87 min

Building a Shopify App with React & Node

Top Content

Workshop

2 authors

Shopify merchants have a diverse set of needs, and developers have a unique opportunity to meet those needs building apps. Building an app can be tough work but Shopify has created a set of tools and resources to help you build out a seamless app experience as quickly as possible. Get hands on experience building an embedded Shopify app using the Shopify App CLI, Polaris and Shopify App Bridge.We’ll show you how to create an app that accesses information from a development store and can run in your local environment.

case study e-commerce shopify shopify with react

Build and Deploy a Backend With Fastify & Platformatic

JSNation 2023

104 min

Build and Deploy a Backend With Fastify & Platformatic

Top Content

WorkshopFree

Matteo Collina

Platformatic allows you to rapidly develop GraphQL and REST APIs with minimal effort. The best part is that it also allows you to unleash the full potential of Node.js and Fastify whenever you need to. You can fully customise a Platformatic application by writing your own additional features and plugins. In the workshop, we’ll cover both our Open Source modules and our Cloud offering:- Platformatic OSS (open-source software) — Tools and libraries for rapidly building robust applications with Node.js (https://oss.platformatic.dev/).- Platformatic Cloud (currently in beta) — Our hosting platform that includes features such as preview apps, built-in metrics and integration with your Git flow (https://platformatic.dev/).
In this workshop you'll learn how to develop APIs with Fastify and deploy them to the Platformatic Cloud.

node.js cloud graphql fastify

Building GraphQL APIs on top of Ethereum with The Graph

GraphQL Galaxy 2021

48 min

Building GraphQL APIs on top of Ethereum with The Graph

Workshop

Nader Dabit

The Graph is an indexing protocol for querying networks like Ethereum, IPFS, and other blockchains. Anyone can build and publish open APIs, called subgraphs, making data easily accessible.

In this workshop you’ll learn how to build a subgraph that indexes NFT blockchain data from the Foundation smart contract. We’ll deploy the API, and learn how to perform queries to retrieve data using various types of data access patterns, implementing filters and sorting.

By the end of the workshop, you should understand how to build and deploy performant APIs to The Graph to index data from any smart contract deployed to Ethereum.

graphql ethereum api development