Supabase/Supabase Changelog

Releases19Avg6/moVersionsv2.0

16 Aug 2024 · 4 minute read

vec2pg is a CLI utility for migrating data from vector databases to Supabase, or any Postgres instance with pgvector.

Objective

Our goal with vec2pg is to create an easy on-ramp to efficiently copy your data from various vector databases into Postgres with associated ids and metadata. The data loads into a new schema with a table name that matches the source e.g. vec2pg.<collection_name>. That output table uses pgvector's vector type for the embedding/vector and the builtin json type for additional metadata.

Once loaded, the data can be manipulated using SQL to transform it into your preferred schema.

When migrating, be sure to increase your Supabase project's disk size so there is enough space for the vectors.

Vendors

At launch we support migrating to Postgres from Pinecone and Qdrant. You can vote for additional providers in the issue tracker and we'll reference that when deciding which vendor to support next.

Throughput when migrating workloads is measured in records-per-second and is dependent on a few factors:

the resources of the source data
the size of your Postgres instance
network speed
vector dimensionality
metadata size

When throughput is mentioned, we assume a Small Supabase Instance, a 300 Mbps network, 1024 dimensional vectors, and reasonable geographic colocation of the developer machine, the cloud hosted source DB, and the Postgres instance.

Pinecone

vec2pg copies entire Pinecone indexes without the need to manage namespaces. It will iterate through all namespaces in the specified index and has a column for the namespace in its Postgres output table.

Given the conditions noted above, expect 700-1100 records per second.

Qdrant

The qdrant subcommand supports migrating from cloud and locally hosted Qdrant instances.

Again, with the conditions mentioned above, Qdrant collections migrate at between 900 and 2500 records per second.

Why Use Postgres/pgvector?

The main reasons to use Postgres for your vector workloads are the same reasons you use Postgres for all of your other data. Postgres is performant, scalable, and secure. Its a well understood technology with a wide ecosystem of tools that support needs from early stage startups through to large scale enterprise.

A few game changing capabilities that are old hat for Postgres that haven't made their way to upstart vector DBs include:

Backups

Postgres has extensive supports for backups and point-in-time-recovery (PITR). If your vectors are included in your Postgres instance you get backup and restore functionality for free. Combining the data results in one fewer systems to maintain. Moreover, your relational workload and your vector workload are transactionally consistent with full referential integrity so you never get dangling records.

Row Security

Row Level Security (RLS) allows you to write a SQL expression to determine which users are allowed to insert/update/select individual rows. Since vector is just another column type in Postgres, you can write policies to ensure e.g. each tenant in your application can only access their own records. That security is enforced at the database level so you can be confident each tenant only sees their own data without repeating that logic all over API endpoint code or in your client application.

Performance

pgvector has world class performance in terms of raw throughput and dominates in performance per dollar. Check out some of our prior blog posts for more information on functionality and performance:

Next Steps

To get started, head over to the vec2pg GitHub Page, or if you're comfortable with CLI help guides, you can install it using pip:

pip install vec2pg

If your current vector database vendor isn't supported, be sure to weigh in on the vendor support issue.

pg_graphql 1.5.7: pagination and multi-tenancy support

15 Aug 2024 · 4 minute read

What's new in pg_graphql 1.5.7

Since the last check-in on pg_graphql there have been a few quality of life improvements worth calling out. A quick roundup of the key differences includes:

Pagination via First/Offset
Schema based multi-tenancy
Filtering on array typed columns with contains, containedBy and overlaps

First/Offset pagination

Since the earliest days of pg_graphql, keyset pagination has been supported. Keyset pagination allows for paging forwards and backwards through a collection by specifying a number of records and the unique id of a record within the collection. For example:

{
  blogCollection(
    first: 2,
    after: "Y3Vyc29yMQ=="
  ) {
    ...
  }
}

to retrieve the first 2 records after the record with unique id Y3Vyc29yMQ==.

Starting in version 1.5.0 there is support for offset based pagination, which is based on skipping offset number of records before returning the results.

{
  blogCollection(
    first: 2,
    offset: 5
  ) {
    ...
  }
}

That is roughly equivalent to the SQL

select
  *
from
  blog
limit
  2
offset
  5

In general as offset values increase, the performance of the query will decrease. For that reason its important to use keyset pagination where possible.

Performance schema based multi-tenancy

pg_graphql caches the database schema on first query and rebuilds that cache any time the schema changes. The cache key is a combination of the postgres role and the database schema's version number. Initially, the structure of all schemas was loaded for all roles, and table/column visibility was filtered down within pg_graphql.

In multi-tenant environments with 1 schema per tenant, that meant every time a tenant updated their schema, all tenants had to rebuild the cache. When the number of tenants gets large, that burdens the database if its under heavy load.

Following version 1.5.2 each tenant's cache only loads the schemas that they have usage permission for, which greatly reduces the query time in multi-tenant environments and the size of the schema cache. At time of writing this solution powers a project with >2200 tenants.

Filtering array column types

From 1.5.6 pg_graphql has added contains, containedBy, overlaps filter operators for scalar array fields like text[] or int[].

For example, given a table

create table blog (
  id int primary key,
  name text not null,
  tags text[] not null,
  created_at timestamp not null
);

the tags column with type text[] can be filtered on.

{
  blogCollection(filter: { tags: { contains: ["tech", "innovation"] } }) {
    edges {
      cursor
      node {
        name
        tags
        createdAt
      }
    }
  }
}

In this case, the result set is filtered to records where the tags column contains both tech and innovation.

Roadmap

The headline features we aim to launch in coming releases of pg_graphql include support for:

Insert on conflict / Upsert
Nested inserts

If you want to get started with GraphQL today, check out the Docs or the source code.

Log Drains for exporting product logs is now available under Public Alpha

Snaplet is closing their business and opening their source code

Use Firebase Auth, Auth0 or AWS Cognito (Amplify) with your Supabase project, secure your users with SMS based MFA, and use send hooks.

Secure Realtime Broadcast and Presence with Authorization

Today we're launching a new GitHub Copilot extension for VS Code to make your development with Supabase even more delightful.

Introducing postgres.new, the in-browser Postgres sandbox with AI assistance.

Supabase is now available on the open source JavaScript Registry (JSR).

Learn how to use range columns in Postgres to simplify time-based queries and add constraints to prevent overlaps.

Use Supabase Realtime to draw live location data onto the map with MapLibre GL JS.

Use PostGIS to programmatically generate Mapbox Vector Tiles and render them with MapLibre GL.

Host Protomaps PMTiles on Supabase Storage and render them with MapLibre GL.

Cal.com and Supabase team up to build an open-source platform starter kit for developers.

How we built a Kahoot alternative for the Supabase community meetups.

What's new in pgvector v0.7.0

02 May 2024 · 8 minute read

Real-world embedding datasets often contain redundancy buried within the vector space. By reducing this redundancy, we can achieve memory and performance savings with a minimal impact on precision. pgvector v0.7.0 introduces several approaches to leverage this:

float16 vector representation
sparse vectors
bit vectors

Float16 vectors

An HNSW index is most efficient when it fits into shared memory. pgvector v0.7.0 introduces 16-bit float HNSW indexes which consume exactly half the memory of 32-bit vectors.

Two options when using float16 vectors:

Index using float16, but the underlying table continues to use float32
Both the index and underlying table use float16, using 50% as much disk space and 50% less shared memory

Example: With 900K OpenAI 1536-dimensional vectors, the table size is 3.5GB (compared to 7GB for float32).

Create a float16 HNSW index:

create index on embedding_half using hnsw (vector halfvec_l2_ops);

HNSW build times improved a further 30% in 0.7.0 with the halfvec feature. Performance measurements on r7gd.16xlarge show float16 HNSW build times are up to 3x faster than float32, while maintaining similar precision and queries per second.

Metric	Vector / Vector	Vector / HalfVec
Index size (MB)	7734	3867
Index build time (s)	264	90
Recall @ ef_search=10	0.819	0.809
QPS @ ef_search=10	1231	1219

Sparse vectors

Vectors with many zero components can use sparse vector representation to save significant storage space:

create table embedding_sparse (
  id serial,
  vector sparsevec(1536),
  primary key (id)
);

insert into embedding_sparse (vector) values ('{1:0.1,3:0.2,5:0.3}/1536'), ('{1:0.4,3:0.5,5:0.6}/1536');

select * from embedding_sparse order by vector <-> '{1:3,3:1,5:2}/1536' limit 5;

Bit vectors

Using binary quantization, float vectors can be represented in binary space, reducing storage dramatically for quick pre-selection before performing additional search within the subset:

create index on embedding using hnsw ((binary_quantize(vector)::bit(1000)) bit_hamming_ops);

select * from embedding order by binary_quantize(vector)::bit(3) <~> binary_quantize('[1,-2,3]') limit 5;

Two-stage search pattern:

select * from (
  select * from embedding
  order by binary_quantize(vector)::bit(3) <~> binary_quantize('[1,-2,3]')
  limit 20
)
order by vector <=> '[1,-2,3]'
limit 5;

New distance functions

pgvector 0.7.0 added:

L1 distance operator <+> with vector_l1_ops index
Hamming distance with bit_hamming_ops index
Jaccard distance with bit_jaccard_ops index

Conclusion

pgvector has achieved over 100x speedup compared to one year ago, including HNSW indexes, parallel builds, and new quantization options.

Using v0.7.0 in Supabase

All new projects ship with pgvector v0.7.0 or later. Enable the extension:

create extension if not exists vector with schema extensions;

If using a previous version, upgrade by navigating to the Service Versions section on the Infrastructure page and upgrading your Postgres version to 15.1.1.47 or later.

Announcing the winners of the Open Source Hackathon 2024!

26 Apr 2024 · 7 minute read

By Pavel Borisov, Postgres Engineer

How data of Postgres tables stored

By default, all table data in Postgres are physically stored using the "heap" method. Every database is a set of 1Gb files ("segments") and each file is logically split into 8Kb pages. Actual table rows are put into any page with enough free space.

When the row data is updated, a new version of a whole row is constructed and written (to any free space). The old one remains because, at the time of the update, the transaction is not completed and can be rolled back in the future. When the transaction is completed we'll have two or several versions of the same row in the table. Cleaning old ones is by an asynchronous process called vacuum (and autovacuum).

How does the vacuum work?

Vacuum goes through all table files looking for row versions that were updated or deleted by already completed transactions and frees the space on the pages.

Then it updates the table's free-space-map to reflect that some page has a certain amount of free space for row inserts.

It also updates the visibility map for a page. It marks that all remaining rows are visible. So index scans can skip visibility checks, which is not so for the modified page before vacuuming. This significantly increases the speed of queries using indexes.

In many cases vacuum runs automatically, cleans everything, and requires little care. But in some scenarios, we need to go deeper and tune the autovacuum parameters or run the vacuum manually.

Cleaning relation files

Vacuum marks space in a relation file as free to use for future row inserts or updates. And it's not a problem unless we insert many rows, delete many rows at once, and then don't make any inserts or updates. Space remains reserved in a file but we don't use it.

In this case, we could free actual filesystem space by running:

VACUUM FULL mytable;

It will rebuild the table from live rows and they will be placed compactly so that filesystem space will be freed. The downside is that it needs an exclusive lock and you won't be able to modify the table while VACUUM FULL does it's work. It's wise to execute that process when the database is least accessed e.g. at night.

An alternative way that doesn't need full locks is using pg_repack extension:

pg_repack --table mytable test

pg_repack operates similarly to VACUUM FULL for database test and table mytable.

Table bloating and autovaccuum

To see database bloat via the Supabase CLI run:

$ supabase inspect db bloat

Or query directly:

-- number of dead rows
SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname = 'mytable';

-- number of live rows
SELECT count(*) FROM mytable;

If the numbers differ by more than 2x, chances are that the autovacuum didn't start or hasn't completed for a table. There could be several legitimate reasons for this.

You can see information for the last successful autovacuum by running:

$ supabase inspect db vacuum-stats

Or:

SELECT last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'mytable';

Let's turn on autovacuum logging so that all autovacuum events land in the log:

ALTER TABLE mytable SET log_autovacuum_min_duration to 0;

There are two ways we can encounter this situation:

autovacuum hasn't started
autovacuum did start, (possibly multiple times) but never succeeded

Autovacuum hasn't started for a table

Autovacuum starts based on several configuration parameters like timeout and pattern of access to a particular table. Maybe it's even legitimate that it hasn't started.

autovacuum_vacuum_threshold - number of rows updated or deleted in a table to invoke autovacuum
autovacuum_vacuum_insert_threshold - number of rows inserted into a table to invoke autovacuum
autovacuum_vacuum_scale_factor - a fraction of table modified by updates or deletes to invoke autovacuum
autovacuum_vacuum_insert_scale_factor - a fraction of table modified by inserts to invoke autovacuum

With all these parameters set, the autovacuum will start if the number of rows updated or deleted exceeds:

autovacuum_vacuum_scale_factor * size_of_table + autovacuum_vacuum_threshold

The same logic applies for inserts. Default scale factors are 20% of a table, which could be too high for big tables. If we want autovacuum to occur on large tables more frequently and take less time each run, decrease the default values for these tables:

ALTER TABLE mytable SET autovacuum_vacuum_scale_factor to 0.05;

autovacuum_naptime (default 1 min) - Each 1-minute autovacuum daemon will see the state of all tables in the database and decide whether to start autovacuum for a table. Most often this parameter does not need to be modified.

To see global vacuum settings for your cluster run:

SELECT * from pg_settings where category like 'Autovacuum';

To see current settings for a table (that overrides global settings) run:

SELECT relname, reloptions FROM pg_class WHERE relname='mytable';

Autovacuum started but couldn't succeed

The most common reason autovacuum doesn't succeed is long-running open transactions that access old row versions. In that case, Postgres recognizes that the row versions are still needed so any row versions created after that point can't be marked as dead. One common cause for this problem is interactive sessions that were left open on accident. When tuples can't be marked as dead, the database begins to bloat.

To see all open transactions run:

SELECT xact_start, state FROM pg_stat_activity;

To close transactions found to be idling:

SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE xact_start = '<value from previous query>';

For automatically closing idle transactions in a session:

SET idle_in_transaction_session_timeout TO '10000s';

The same parameter could be set per role or database as needed.

Another, less likely, possibility is that autovacuum can't succeed due to locks. If some of your processes take the SHARE UPDATE EXCLUSIVE lock e.g. ALTER TABLE clause, this lock will prevent vacuum from processing a table. Lock conflicts in your ordinary transactions could cause SHARE UPDATE EXCLUSIVE to be taken for a long time. A good recipe when this happens is to cancel all the open transactions and run VACUUM for a table manually (or wait until the next autovacuum comes for it).

SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active';
VACUUM mytable;

Other vacuum optimizations

There could be too few autovacuum workers or each worker could operate slowly due to the low setting of autovacuum_work_mem.

autovacuum_max_workers (default 3) - Number of parallel workers doing autovacuum for tables. When you have enough cores, increasing the default value is worthwhile. Note that this will decrease the number of possible backends or regular parallel workers running at a time.
autovacuum_work_mem (default equal to maintenance_work_mem or 64Mb) - work memory that is used per autovacuum worker. If you see in your logs that autovacuum for some tables starts but takes a long time, that time may be decreased by increasing this value. This parameter can only be modified in the config file.

Conclusion

Vacuum and autovacuum are efficient ways to maintain the tables without bloat. They have several parameters that allow efficient tuning. Some insight into what database does can help prevent the cases where autovacuum becomes problematic and bloat increases:

Long open transactions
Stuck locks
Insufficient resources allocated to vacuuming
Space not freed on a filesystem level after massive table modifications

References:

In this post, we explore the support tooling at Supabase, and how we use SLA Buddy to monitor our SLAs.

Supabase Changelog

Objective

Vendors

Pinecone

Qdrant

Why Use Postgres/pgvector?

Backups

Row Security

Performance

Next Steps

What's new in pg_graphql 1.5.7

First/Offset pagination

Performance schema based multi-tenancy

Filtering array column types

Roadmap

What's new in pgvector v0.7.0

Float16 vectors

Sparse vectors

Bit vectors

New distance functions

Conclusion

Using v0.7.0 in Supabase

How data of Postgres tables stored

How does the vacuum work?

Cleaning relation files

Table bloating and autovaccuum

Autovacuum hasn't started for a table

Autovacuum started but couldn't succeed

Other vacuum optimizations

Conclusion

More from this team

Similar releases

Other sources from this team

Similar sources

Other sources from this team

Similar sources

More from this team

Similar releases