DBMS where join+limit works?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

DBMS where join+limit works?

Tim Starling-2
On thistle with DB=dewiki:

mysql> explain select * from recentchanges
left join tag_summary on ts_rc_id=rc_id
order by rc_timestamp desc limit 50\G

*************************** 1. row ***************************
        table: recentchanges
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1179921
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
        table: tag_summary
         type: ALL
possible_keys: ts_rc_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4
        Extra:
2 rows in set (0.00 sec)


Whenever you do a join with a limit, MySQL gets the query plan wrong.
It scans the small table and filesorts the large table. You have to
use FORCE INDEX on the small table to suppress the scan. We've seen
this many times. It's very difficult to detect during code review and
frequently crashes the site.

Does anyone know a DBMS where joining with limits actually works?
Because I'm sick of this crap.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Nikola Smolenski
Tim Starling wrote:
> Whenever you do a join with a limit, MySQL gets the query plan wrong.
> It scans the small table and filesorts the large table. You have to
> use FORCE INDEX on the small table to suppress the scan. We've seen
> this many times. It's very difficult to detect during code review and
> frequently crashes the site.
>
> Does anyone know a DBMS where joining with limits actually works?
> Because I'm sick of this crap.

Would it help to do it in a way similar to what you have to do in DBMSes
without limit:

select a.* from (
   select * from recentchanges
   left join tag_summary on ts_rc_id=rc_id
   order by rc_timestamp desc
) a limit 50

or is MySQL's parser too smart for its own good?

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Domas Mituzas
In reply to this post by Tim Starling-2
Hi!

> Whenever you do a join with a limit, MySQL gets the query plan wrong.


"Wherever you do a unconstrained join with a limit, MySQL may get the  
query plan wrong".
Anyway, this exact case looks like a bug that was fixed later.

>  It scans the small table and filesorts the large table.

It actually makes sense when small table refers to subset of large  
table. I've discussed this with optimizer team quite a few times, it  
is one of gray areas where having better statistics within engine help.

> Does anyone know a DBMS where joining with limits actually works?
> Because I'm sick of this crap.


Every database in the end will want you to provide hints in one way or  
another.
Indexing itself is already a hint :)

Anyway, as you may note, we were up for quite a while, just developers  
end up thinking that "oh, database is too stupid to do X or to do Y".

Every static decision can have its own problems, and this is where you  
end up having humans to do the work.
We don't employ software to write all the code for us? Maybe in  
similar ways we cannot always be sure, that software which does all  
the data management for us (how many developers do actually think  
about indexing and data fetch efficiency?), sometimes may go into  
wrong decisions?

(And yes, maybe it is fixed in later versions, along with other  
problems fixed, and other problems introduced).

> Does anyone know a DBMS where joining with limits actually works?

*shrug*, maybe PG does it properly, maybe MySQL 5.x does it properly,  
maybe sqlite does it properly.

> Because I'm sick of this crap.

Well, FORCE INDEX used to be where it was for a reason, whatever the  
reason was.
Removing it was problem too ;-)

Anyway, one can be sick of lots of crap, I'm sick of fact that 1% of  
our requests that get to backend (which is 0.1% of requests that are  
done to the site) make 50% of site CPU load. The simple fact is, that  
we had forced indexes where needed, but site still has 0.1% of user  
requests causing 50% of CPU load.

Cheers,
--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Greg Sabino Mullane-2
In reply to this post by Tim Starling-2
> Does anyone know a DBMS where joining with limits actually works?
> Because I'm sick of this crap.

FWIW, this is what Postgres gives:

mw=# explain
select * from recentchanges
left join tag_summary on ts_rc_id=rc_id
order by rc_timestamp desc limit 50;

                  QUERY PLAN
------------------------------------------------------------
Limit  (cost=0.00..31.28 rows=50)
->  Nested Loop Left Join  (cost=0.00..475208.34 rows=2940914)
  ->  Index Scan Backward using rc_timestamp on recentchanges
      (cost=0.00..45650.46 rows=1020612)
  ->  Index Scan using tag_summary_rc_id on tag_summary
      (cost=0.00..0.37 rows=4)
      Index Cond: (tag_summary.ts_rc_id = recentchanges.rc_id)


EXPLAIN ANALYZE with about a million (bogus) rows in each table:

mw=# explain analyze
select * from recentchanges
left join tag_summary on ts_rc_id=rc_id
order by rc_timestamp desc limit 50;
                             QUERY PLAN
-------------------------------------------------------------------------
Limit  (cost=0.00..31.28 rows=50)
       (actual time=0.147..0.415 rows=50 loops=1)
->  Nested Loop Left Join
    (cost=0.00..475208.34 rows=2940914)
    (actual time=0.146..0.372 rows=50 loops=1)
  ->  Index Scan Backward using rc_timestamp on recentchanges
      (cost=0.00..45650.46 rows=1020612)
      (actual time=0.110..0.122 rows=4 loops=1)
  ->  Index Scan using tag_summary_rc_id on tag_summary
      (cost=0.00..0.47 rows=4)
      (actual time=0.014..0.043 rows=12 loops=4)
      Index Cond: (tag_summary.ts_rc_id = recentchanges.rc_id)
 Total runtime: 0.559 ms

--
Greg Sabino Mullane [hidden email]
End Point Corporation
PGP Key: 0x14964AC8


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

signature.asc (233 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Tim Landscheidt
In reply to this post by Domas Mituzas
Domas Mituzas <[hidden email]> wrote:

> [...]
>> Does anyone know a DBMS where joining with limits actually works?

> *shrug*, maybe PG does it properly, maybe MySQL 5.x does it properly,
> maybe sqlite does it properly.
> [...]

BTW, is it feasible to replicate from MySQL to PostgreSQL
(or SQLite :-)) in a continuous way so that one or two
cuckoo's eggs could be planted in the database farm? Though
probably undesirable from an operations viewpoint, a real-
time performance benchmark would be really cool :-).

Tim


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Platonides
Tim Landscheidt wrote:
> BTW, is it feasible to replicate from MySQL to PostgreSQL
> (or SQLite :-)) in a continuous way so that one or two
> cuckoo's eggs could be planted in the database farm? Though
> probably undesirable from an operations viewpoint, a real-
> time performance benchmark would be really cool :-).
>
> Tim

The SQL syntax varies between dbs so I don't think it's feasible. Waht
_could_ be would be making MW update several DBs at the same time.
I'm sure db in consitencies will appear, though.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Domas Mituzas
In reply to this post by Tim Landscheidt
Hi,
> BTW, is it feasible to replicate from MySQL to PostgreSQL
> (or SQLite :-)) in a continuous way so that one or two
> cuckoo's eggs could be planted in the database farm?

in theory, MySQL's 5.1 replication sends only row changes, so that can  
be used to replicate to different stores.

> Though
> probably undesirable from an operations viewpoint, a real-
> time performance benchmark would be really cool :-).

Yes and yes :)

--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: DBMS where join+limit works?

Aryeh Gregor
In reply to this post by Tim Starling-2
On Fri, Feb 27, 2009 at 2:29 AM, Tim Starling <[hidden email]> wrote:

> On thistle with DB=dewiki:
>
> mysql> explain select * from recentchanges
> left join tag_summary on ts_rc_id=rc_id
> order by rc_timestamp desc limit 50\G
>
> *************************** 1. row ***************************
>        table: recentchanges
>         type: ALL
> possible_keys: NULL
>          key: NULL
>      key_len: NULL
>          ref: NULL
>         rows: 1179921
>        Extra: Using temporary; Using filesort
> *************************** 2. row ***************************
>        table: tag_summary
>         type: ALL
> possible_keys: ts_rc_id
>          key: NULL
>      key_len: NULL
>          ref: NULL
>         rows: 4
>        Extra:
> 2 rows in set (0.00 sec)

Almost two months later, but now that I have access to the toolsever
DB, here's some output from zedler on dewiki:

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.1.33    |
+-----------+
1 row in set (0.00 sec)

mysql> explain select * from recentchanges left join tag_summary on
ts_rc_id=rc_id order by rc_timestamp desc limit 50\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: index
possible_keys: NULL
          key: rc_timestamp
      key_len: 16
          ref: NULL
         rows: 50
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tag_summary
         type: ref
possible_keys: ts_rc_id
          key: ts_rc_id
      key_len: 5
          ref: dewiki.recentchanges.rc_id
         rows: 1
        Extra:
2 rows in set (0.00 sec)

So "MySQL 5.1" is a valid answer.  (EXPLAIN even understands LIMIT now, yay!)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l