Есть большой список сделок с недвижимостью примерно 26м строк.
Надо найти всю недвижимость по которой были сделки более одного раза.
Содержимое файла
0 {F887F88E-7D15-4415-804E-52EAC2F10958} 70000 1995-07-07 00:00 MK15 9HP D N F 31 NaN ALDRICH DRIVE WILLEN MILTON KEYNES MILTON KEYNES MILTON KEYNES A A
1 {40FD4DF2-5362-407C-92BC-566E2CCE89E9} 44500 1995-02-03 00:00 SR6 0AQ T N F 50 NaN HOWICK PARK SUNDERLAND SUNDERLAND SUNDERLAND TYNE AND WEAR A A
2 {7A99F89E-7D81-4E45-ABD5-566E49A045EA} 56500 1995-01-13 00:00 CO6 1SQ T N F 19 NaN BRICK KILN CLOSE COGGESHALL COLCHESTER BRAINTREE ESSEX A A
3 {28225260-E61C-4E57-8B56-566E5285B1C1} 58000 1995-07-28 00:00 B90 4TG T N F 37 NaN RAINSBROOK DRIVE SHIRLEY SOLIHULL SOLIHULL WEST MIDLANDS A A
4 {444D34D7-9BA6-43A7-B695-4F48980E0176} 51000 1995-06-28 00:00 DY5 1SA S N F 59 NaN MERRY HILL BRIERLEY HILL BRIERLEY HILL DUDLEY WEST MIDLANDS A A
6 {709FB471-3690-4945-A9D6-4F48CE65AAB6} 58000 1995-04-28 00:00 PE7 3AL D Y F 4 NaN BROOK LANE FARCET PETERBOROUGH PETERBOROUGH CAMBRIDGESHIRE A A
8 {E78710AD-ED1A-4B11-AB99-5A0614D519AD} 20000 1995-01-16 00:00 SA6 5AY D N F 592 NaN CLYDACH ROAD YNYSTAWE SWANSEA SWANSEA SWANSEA A A
9 {1DFBF83E-53A7-4813-A37C-5A06247A09A8} 137500 1995-03-31 00:00 NR2 2NQ D N F 26 NaN LIME TREE ROAD NORWICH NORWICH NORWICH NORFOLK A A
10 {BCD607D8-7698-4C5A-97D8-52EB0A6ED632} 57500 1995-12-20 00:00 WS6 7BQ D N F 6 NaN MERRILL CLOSE WALSALL WALSALL SOUTH STAFFORDSHIRE STAFFORDSHIRE A A
11 {461BAE3D-DD53-40C1-BE48-52EB1511C5F3} 70000 1995-09-20 00:00 GL52 3LH D N F 139B NaN NEW BARN LANE CHELTENHAM CHELTENHAM CHELTENHAM GLOUCESTERSHIRE A A
12 {003DB740-4F22-46E5-AD3C-52EB15407DE5} 62750 1995-10-02 00:00 BR3 4AT T N F 56 NaN EDEN ROAD BECKENHAM BECKENHAM BROMLEY GREATER LONDON A A
14 {8B5FB0C0-01CF-4584-9777-4F4908B02FEA} 78000 1995-09-29 00:00 SW16 3BL S N F 208 NaN GREEN LANE LONDON LONDON CROYDON GREATER LONDON A A
15 {38008828-F33E-4F54-927C-4BB90B83248E} 51000 1995-10-31 00:00 UB5 5LH S N F 57 NaN BENGARTH ROAD NORTHOLT NORTHOLT EALING GREATER LONDON A A
17 {8982E939-2927-457E-B695-5A066720BF59} 41000 1995-02-10 00:00 NE45 5AP T N F 21 NaN FRONT STREET CORBRIDGE CORBRIDGE TYNEDALE NORTHUMBERLAND A A
18 {C6AC299D-A75A-481D-BE03-5A06749177E3} 100000 1995-09-29 00:00 WR5 3EU D N F 8 NaN BARNESHALL AVENUE WORCESTER WORCESTER WORCESTER WORCESTERSHIRE A A
19 {EAFDBDAA-0C4D-4A9B-8CA2-5A06776A1BBB} 123000 1995-09-05 00:00 NG34 7TF D Y F 5 NaN CLAY HILL ROAD SLEAFORD SLEAFORD NORTH KESTEVEN LINCOLNSHIRE A A
20 {3A947F9C-D5C3-4178-8382-566E952D20A1} 47000 1995-10-30 00:00 BS15 9UR S N F 9 NaN AMBLE CLOSE BRISTOL BRISTOL KINGSWOOD AVON A A
21 {EFC499C5-9329-46AF-92FA-4F492E0A8BC3} 58000 1995-11-23 00:00 SN6 8LH S N F THYME COTTAGE NaN NaN IDSTONE SWINDON VALE OF WHITE HORSE OXFORDSHIRE A A
22 {6D414E04-656D-41AB-98F0-4F493154CE27} 85000 1995-06-09 00:00 KT16 8DW S N F 40 NaN EASTWORTH ROAD CHERTSEY CHERTSEY RUNNYMEDE SURREY A A
23 {2AF070FA-D882-4A62-9ABD-4F494D410DB1} 38000 1995-08-31 00:00 LU2 7LZ S N F 36 NaN HERON DRIVE LUTON LUTON LUTON LUTON A A
25 {245992BA-92D4-4529-B5AD-4BB942B6F435} 56500 1995-04-20 00:00 BN3 5AG T N F 15 NaN SHAKESPEARE STREET HOVE HOVE HOVE EAST SUSSEX A A
26 {444EAAD2-A559-4847-9A33-5A06847257AF} 94500 1995-06-01 00:00 LE65 1HF D Y F 8 NaN KELSO CLOSE ASHBY-DE-LA-ZOUCH ASHBY-DE-LA-ZOUCH NORTH WEST LEICESTERSHIRE LEICESTERSHIRE A A
27 {852368A9-09D6-4BA6-B33A-5A068BAD2D48} 89000 1995-11-24 00:00 B75 5PG D N F 28 NaN SLADE ROAD SUTTON COLDFIELD SUTTON COLDFIELD BIRMINGHAM WEST MIDLANDS A A
28 {CB70FDBB-49CC-412F-911C-5A069B724EE2} 110000 1995-12-21 00:00 BS15 3HH D N F 148 NaN HIGH STREET HANHAM BRISTOL KINGSWOOD AVON A A
29 {2A9D6B70-FC5D-46A1-A8EF-5A06AF633839} 51000 1995-02-17 00:00 NN15 6UQ S N F 80 NaN GOTCH ROAD BARTON SEAGRAVE KETTERING KETTERING NORTHAMPTONSHIRE A A
Хочу как можно эффективнее по потреблению ресурсов и времени выполнить задачу.
def main(filename):
st = time.time()
df = pd.read_csv(filename, delimiter=",")
print(df.info(memory_usage='deep'))
print(time.time() - st)
Но чтение файла уже выходит за все рамки.
Data columns (total 16 columns):
# Column Dtype
--- ------ -----
0 Transaction unique identifier object
1 Price int64
2 Date of Transfer object
3 Postcode object
4 Property Type object
5 Old/New object
6 Duration object
7 PAON object
8 SAON object
9 Street object
10 Locality object
11 Town/City object
12 District object
13 County object
14 PPD Category Type object
15 Record Status - monthly file only object
dtypes: int64(1), object(15)
memory usage: 23.6 GB
221.08039593696594 сек.
При разбиении на чанки в 10 ** 5 удалось снизить показатели, но такой подход значительно усложняет дальнейшую работу, так как их все равно надо хранить.
memory usage: 91.4 MB
140.3509292602539 сек.
Dask так же не дал какого либо прироста в скорости.
Цель написать программу, которая должна потреблять как можно меньше памяти и процессорного времени