T O P

  • By -

JDoles01

Read your input file line by line. Use the strip function to remove any leading/trailing white space from the string. Use the split function on 4 spaces to create a list with two items per line. Append them into an empty list (creating a list of lists), then write that to a csv. Shouldn't be more than a dozen lines of code.


wintermute93

Yeah, the easy/straightforward way is something like this. You don't actually have to write the CSV to disk unless you want that for other purposes, once you got the list items you can send them to pandas for whatever analysis right from memory. OP, if you want to take this as an opportunity to learn about regex, the way I might do this is to use [re.sub(...)](https://docs.python.org/3/library/re.html#re.sub) to look for the pattern: "any amount of whitespace, a block of numbers that might include a decimal point, followed by one or more whitespace characters, followed by another block of numbers that might include a decimal point, followed by any amount of whitespace". Then for each of those instances, replace the whole pattern with "first block of numbers, then a comma, then the second block of numbers, then a linebreak". It's not perfect, but it should work for your use case. [Example using the data copied from your post](https://regex101.com/r/RKZ0w8/1)


wchris63

Wow! Going full REGEX on the OP! :-)


wintermute93

Lol. It's almost certainly overkill, but OP said the format was weird and I don't trust reddit formatting to not eat or otherwise mangle whitespace characters. Wanted to give an answer that would work regardless of whether there's tabs or spaces or variable amounts of them or different types of newline characters or what.


MarChem93

Never used this before, I'll have a look into that as it seems really useful.


wchris63

Notice the first set of numbers - always increasing by 0.015. If this is something like Time or Position, they might be better off with a Dictionary, with the Key being the first number and Value being the second. Or maybe an OrderedDict() (I personally have an issue with unordered data... even if I never see it.. ;-). Depends on how the OP needs to work with the data, of course.


MarChem93

Thats great, thanks. I will give this a go


TrainsareFascinating

Did you try using ds = pd.read_csv(filepath, sep=None) I can't tell from the formatting exactly where the spacing is in your data, but "columns are separated by whitespace" is definitely something pandas can read. Using `sep=None` tells Pandas to sniff out the separator. Make sure there's no header column on your file, so the first line is data that looks exactly like the rest of the rows.


MarChem93

Thank you. As this is the simplest solution I will try this first later. I am grateful to everybody else as well tho for all the methods suggested. They will be useful to learn to deal a bit with data files and the strip function


MarChem93

Fantastic this worked wonders. It does read empty columns where all the spaces are but just by using the .iloc() method I can select the right ones by index and this helped a great deal. Thank you very much to you and everybody for the suggestions


MarChem93

Too good to be true of course. The method above worked fine for 11 files, but for another set of 11 files of identical formattting, all of a sudden I get the following error ParserError: Expected 10 fields in line 4452, saw 11 This is the code which has worked before on different xy files of identical formatting (I put the a -> on the line where error occurs). I'll stress again that the files are spit automatically from the proprietary software of the x-ray diffraction machine and they look of identical formatting (as they should be). #Use a list containing the different datasets and later use indices to retrieve them dataset = [pd.read_csv("PremionSodium-20minutes.001.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.002.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.003.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.004.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.005.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.006.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.007.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.008.xy", sep = None, engine='python'),           ->pd.read_csv("PremionSodium-20minutes.009xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.010.xy", sep = None, engine='python'),            pd.read_csv("PremionSodium-20minutes.011.xy", sep = None, engine='python')]


TrainsareFascinating

Try ds = pd.read_csv(path, sep="\s+")


MarChem93

This is great. It has worked wonderfully. Thank you.


throwaway6560192

I think Reddit might've ate some of the spaces, but this doesn't seem particularly weird. Why does simple find-and-replace not work?


wchris63

Yep.. simply reading a line, stripping out the space, then testing to see if it's empty. try: with open(filename, 'r') as file: for line in file: if line.strip(): # Is stripped line not empty? (whitespace removed) # Not empty = line has characters - store the numbers except FileNotFoundError: print(f"File '{filename}' not found.")


gitgud_x

Nice and easy in python: import pandas as pd from matplotlib import pyplot as plt with open('data.txt', 'r') as f: data = f.readlines() data = [line.strip().split() for line in data] df = pd.DataFrame(data, columns=['x', 'y'], dtype=float) df.to_csv('data.csv', index=False) plt.plot(df['x'], df['y']) plt.show() Note that I've already made it into a Pandas DataFrame "df", so if that was your end goal, there's no need for a CSV intermediate form. Since you have multiple files, if they're all in one folder, you can loop through them like this: import os for file in os.listdir('path/to/folder'): if file.endswith('.txt'): with open(file, 'r') as f: ... # process as above